Setting Up a New Study
This document contains instruction for both study administrators and Bridge Downstream developers. Once the study administrator has requested that a new study be set up and alerted the developers to the first batch of data exported to Synapse, the developers will execute the steps described here to ensure that data is flowing smoothly from Bridge and through the pipeline.
Prerequisites
Bridge must have already created the study project. Documentation on how to create a new study project in Synapse from Bridge is here: https://sagebionetworks.jira.com/l/cp/T5EPF30w.
It is not necessary for data to have already been exported by Bridge to the study project, except for the “Initiate data processing” step in both the Study Administrator and Bridge Downstream Developers sections below.
Study Administrators
Request Synapse study set up
Please fill out this Jira Service Desk ticket (to the best of your ability) requesting that a new study be set up: Submit a Data Processing request (e.g., ETL, Feature Engineering)
A link to the study on Synapse is sufficient to complete the request.
Initiate data processing
Once a first batch of data has been exported from Bridge to Synapse, alert the developers so that they can initiate processing of the first batch of data as described in the developer section below.
Bridge Downstream Developers
Complete Synapse study set up
Study set up can be completed by running the following script: https://github.com/Sage-Bionetworks/bridge-analytics-template/blob/main/src/copy_from_template.py .
You will need to authenticate with both Synapse and AWS to run the script successfully. Comprehensive documentation on how to run and the behavior of the above script can be found here: https://sagebionetworks.jira.com/wiki/spaces/BD/pages/2758410253/Bridge-analytics-template+tool+documentation .
Deploy the study
This step can be completed either before or after setting up the project in Synapse. Additionally, this step may be completed in conjunction with updating the crontab as described in the section below. Add a study stack configuration with the filename {app-identifier}-{study-identifier}.yaml
to the following location and submit a PR to the main
branch: https://github.com/Sage-Bionetworks/BridgeDownstream/tree/main/config/prod/studies .
Unless there are unique considerations specific to this study (e.g., custom transformations or other major changes to the pipeline), use one of the existing study stack config files as a template for the new study stack config. Study stack configs use the template https://github.com/Sage-Bionetworks/BridgeDownstream/blob/main/templates/study-pipeline-infra.j2, hence any parameterization of the stack config allowed by this template is acceptable, but in most cases it’s sufficient to substitute the study identifier in a copy of an existing study stack config with this study’s identifier.
Update the crontab
Currently, we use a bootstrap trigger process to submit unprocessed data on Synapse to the pipeline. This is a script which runs on a cron schedule as defined here: https://github.com/Sage-Bionetworks/BridgeDownstream/blob/main/src/ec2/resources/crontab
In typical cases, it’s sufficient to copy an existing line in the cron file and update the following values to be specific to this study:
The cron schedule – we want to stagger job start times as much as possible so that jobs do not overlap.
The
--file-view
parameter – Update this to use the Bridge Raw Data View entity view created when using the bridge-analytics template tool above.The
--raw-folder-id
parameter – Update this to specify the Bridge Raw Data folder for this study.The
--glue-workflow
parameter.The
--diff-s3-uri
parameter – This parquet dataset may not exist yet, in which case this job will raise an exception. To ensure that this Parquet dataset exists, complete the below step on initiating data processing.
There is no automatic deployment of the crontab to the EC2 instance running the cron job – unless you happen to be deploying the EC2 instance from scratch. It’s necessary to write the new crontab to /etc/cron.d/bootstrap-trigger
for updates to propagate to the instance itself.
Initiate data processing
Because we don’t necessarily want data to be submitted if the Parquet dataset specified in --diff-s3-uri
does not exist (perhaps we incorrectly specified the dataset name and are unknowingly submitting all data in the study project each time the script is invoked), the bootstrap trigger script must be manually run for the first batch of data so that we ensure that the --diff-s3-uri
dataset exists. This can be done by running the same command as specified in the cron file after removing the following parameters:
--diff-s3-uri
--diff-parquet-field
--diff-file-view-field
Once the S3 to JSON workflow completes, the JSON to Parquet workflow can be manually run. This is a good opportunity to:
Verify that workflows and jobs are completing successfully.
Check that data is passing validation.
Ensure that the resulting parquet datasets can be read using pyarrow and that all data from the study project is represented.
Once a full run through the pipeline is complete, study data will be automatically processed on a recurring schedule as specified in the crontab in the previous step.