Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents
maxLevel3
minLevel1
include
outlinefalse
indent
exclude
typelist
printablefalse
class

This document contains instruction for both study administrators and Bridge Downstream developers. Once the study administrator has requested that a new study be set up and alerted the developers to the first batch of data exported to Synapse, the developers will execute the steps described here to ensure that data is flowing smoothly from Bridge and through the pipeline.

Prerequisites

Bridge must have already created the study project. Documentation on how to create a new study project in Synapse from Bridge is here: https://sagebionetworks.jira.com/l/cp/T5EPF30w.

It is not necessary for data to have already been exported by Bridge to the study project, except for the “Initiate data processing” step in both the Study Administrator and Bridge Downstream Developers sections below.

Study Administrators

Request Synapse study set up

...

  • The cron schedule – we want to stagger job start times as much as possible so that jobs do not overlap.

  • The --file-view parameter – Update this to use the Bridge Raw Data View entity view created when using the bridge-analytics template tool above.

  • The --raw-folder-id parameter – Update this to specify the Bridge Raw Data folder for this study.

  • The --glue-workflow parameter.

  • The --diff-s3-uri parameter – This parquet dataset may not exist yet, in which case this job will raise an exception. To ensure that this Parquet dataset exists, complete the below step on initiating data processing.

Note

There is no automatic deployment of the crontab to the EC2 instance running the cron job – unless you happen to be deploying the EC2 instance from scratch. It’s necessary to write the new crontab to /etc/cron.d/bootstrap-trigger for updates to propagate to the instance itself.

Initiate data processing

Because we don’t necessarily want data to be submitted if the Parquet dataset specified in --diff-s3-uri does not exist (perhaps we incorrectly specified the dataset name and are unknowingly submitting all data in the study project each time the script is invoked), the bootstrap trigger script must be manually run for the first batch of data so that we ensure that the --diff-s3-uri dataset exists. This can be done by running the same command as specified in the cron file while after removing the following parameters:

  • --diff-s3-uri

...

  • --diff-parquet-field

...

  • --diff-file-view-field

Once the /wiki/spaces/BD/pages/2746613791 completes, the /wiki/spaces/BD/pages/2749759500 can be manually startedrun. This is a good opportunity to verify :

  • Verify that workflows and jobs are completing successfully

...

...

  • .

  • Ensure that the resulting parquet datasets can be read using pyarrow and that all data from the study project is represented.

Once one a full run - through of the pipeline is complete, study data will be automatically processed on a recurring schedule as specified in the crontab in the previous step.