...
The cron schedule – we want to stagger job start times as much as possible so that jobs do not overlap.
The
--file-view
parameter – Update this to use the Bridge Raw Data View entity view created when using the bridge-analytics template tool above.The
--raw-folder-id
parameter – Update this to specify the Bridge Raw Data folder for this study.The
--glue-workflow
parameter.The
--diff-s3-uri
parameter – This parquet dataset may not exist yet, in which case this job will raise an exception. To ensure that this Parquet dataset exists, complete the below step on initiating data processing.
Note |
---|
There is no automatic deployment of the crontab to the EC2 instance running the cron job – unless you happen to be deploying the EC2 instance from scratch. It’s necessary to write the new crontab to |
Initiate data processing
Because we don’t necessarily want data to be submitted if the Parquet dataset specified in --diff-s3-uri
does not exist (perhaps we incorrectly specified the dataset name and are unknowingly submitting all data in the study project each time the script is invoked), the bootstrap trigger script must be manually run for the first batch of data so that we ensure that the --diff-s3-uri
dataset exists. This can be done by running the same command as specified in the cron file while after removing the --diff-s3-uri
and --diff-parquet-field
parameters. Once the /wiki/spaces/BD/pages/2746613791 completes, the /wiki/spaces/BD/pages/2749759500 can be manually startedrun. This is a good opportunity to verify that jobs are completing successfully, data is passing /wiki/spaces/BD/pages/2751594608, and the resulting parquet datasets can be read using pyarrow. Once one a full run - through of the pipeline is complete, study data will be automatically processed on a recurring schedule as specified in the crontab in the previous step.