Table of Contents | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
This document contains instruction for both study administrators and Bridge Downstream developers. Once the study administrator has requested that a new study be set up and alerted the developers to the first batch of data exported to Synapse, the developers will execute the steps described here to ensure that data is flowing smoothly from Bridge and through the pipeline.
Prerequisites
Bridge must have already created the study project. Documentation on how to create a new study project in Synapse from Bridge is here: https://sagebionetworks.jira.com/l/cp/T5EPF30w.
It is not necessary for data to have already been exported by Bridge to the study project, except for the “Initiate data processing” step in both the Study Administrator and Bridge Downstream Developers sections below.
Study Administrators
Request Synapse study set up
...
Because we don’t necessarily want data to be submitted if the Parquet dataset specified in --diff-s3-uri
does not exist (perhaps we incorrectly specified the dataset name and are unknowingly submitting all data in the study project each time the script is invoked), the bootstrap trigger script must be manually run for the first batch of data so that we ensure that the --diff-s3-uri
dataset exists. This can be done by running the same command as specified in the cron file after removing the following parameters:
--diff-s3-uri
...
--diff-parquet-field
...
--diff-file-view-field
Once the /wiki/spaces/BD/pages/2746613791 completes, the /wiki/spaces/BD/pages/2749759500 can be manually run. This is a good opportunity to verify :
Verify that workflows and jobs are completing successfully
...
.
Check that data is passing /wiki/spaces/BD/pages/2751594608
...
.
Ensure that the resulting parquet datasets can be read using pyarrow and that all data from the study project is represented.
Once a full run through the pipeline is complete, study data will be automatically processed on a recurring schedule as specified in the crontab in the previous step.
...