TCGA Curation Pipeline Design
This design document is now obsolete
Design Goals
Allow scientists to write the workflow activities in their preferred language
approach: further develop the Synapse R client and the service APIs that it utilizes
Ensure that scientists can use the same code as workflow activities AND also for one-off tasks they want to run on their laptops
approach: use a particular input/output parameter scheme for all R scripts
approach: R scripts have no direct dependencies upon Amazon's Simple Workflow Service, they only depend upon Synapse
Ensure that the workflow is scalable to many nodes running concurrently
approach: use Amazon's Simple Workflow system
Minimize the amount of workflow decision logic needed in non-R code
approach: to keep the complicated logic about whether a particular script should be run on a particular piece of source data out of Java, instead pass all source data to every R script and let the R script decide whether it wants to work on the data or not
Steps in the TCGA Workflow
Workflow Scaling
Workflow Architecture
Preliminary R Script API
To invoke a script locally:
R createMatrix.R --args --username 'nicole.deflaux@sagebase.org' --password XXXXX --datasetId 543 --layerId 544
Script workflow output to STDOUT:
blah blah, this is ignored ...
SynapseWorkflowResult_START
{"layerId":560}
SynapseWorkflowResult_END
blah blah, this is ignored too ...
Details
Identify new or updated TCGA Datasets
Details:
Input: n/a
Poll http://tcga-data.nci.nih.gov/tcga/for currently available datasets and their last modified date
We'll always look for updates within a configurable number of days
Any repeat processing should be safe to do because we will ensure this pipeline is idempotent
But to reduce costs, we should abort repeat work when it is identified further down the pipeline
Note that without integrity constraints on creation operations (e.g., no two datasets can have the same name, no two layers in a dataset can have the same name), we could wind up with duplicate datasets and layers due to race conditions if we do not have our timeouts set correctly for workflow operations.
Compare this list to the TCGA datasets held in Synapse
for each new dataset
create the dataset metadata in Synapse
raise an error upon duplicate dataset
Be sure that the permissions on the dataset are such that the public cannot see it but Sage Scientists can. We cannot redistribute TCGA datasets via the Sage Commons.
crawl the download site to identify the layers
proceed with the workflow for each layer
for each updated layer in an existing dataset
crawl the download site to identify the updated/new layers
proceed with the workflow for the updated/new layer(s)
raise an error if we find multiple Synapse datasets for the TCGA dataset
Just get layer urls for clinical and array data, sequence data will come later
bcr/ has clinical info
cgcc/ array platforms
Output: one or more of (Synapse dataset id, TCGA layer download URL) or error
Short Term Deliverables:
Focus on just the Colon adenocarcinoma COAD dataset for now
Create the Synapse dataset by hand
Create Synapse record for the source layer from TCGA
Details
Input: (Synapse dataset id, TCGA layer download URL)
Search Synapse to see if this layer already exists
assumption: there is sufficient information in the path portion of the TCGA layer URL due to the way TCGA formulates its paths to determine the type (clinical vs. expression) and make/model of the platform to have sufficient metadata for the layer, etc.
raise an error if we find multiple matching layers
formulate the layer metadata for this layer from the TCGA URL
if the layer exists in synapse
if all fields of the metadata match
skip to the end of this task
else the metadata does not match
raise an error (our scheme for formulating layer metadata may have changed?)
else the layer does not exist
create new layer in Synapse using the layer metadata
raise an error upon duplicate layer
Output: (Synapse dataset id, Synapse layer id, TCGA layer download URL) or error
Download the dataset layer from TCGA to S3
Details:
Input: (Synapse dataset id, Synapse layer id, TCGA layer download URL)
Download the layer MD5 checksum from TCGA first
Get the layer metadata from Synapse to see if we already have an S3 location and MD5 for this layer id
if we have an MD5 from Synapse
if the Synapse MD5 matches the TCGA MD5
skip to the end of this task because we already uploaded this file
else the md5 does not match
bump the version number in our client-side copy of the layer metadata
Download the layer data file from TCGA
Compute its MD5 and compare it to the MD5 we got from TCGA
if they do not match, we had a download error, return an error (we'll try again later)
Ask Synapse for a pre-signed URL to which to put the layer
Arguments to this call would include
dataset id
layer id
naming convention for TCGA layer S3 URLs
use a subdirectory in the S3 bucket tcga/source for source data from TCGA
perhaps the filename is tcga/source concatenated with the path portion of the TCGA download URL -> TCGA has spent a long time thinking about their URL conventions so we might consider reusing their scheme
Synapse layer version number
NOTE: if something already exists in S3 at that location, it was because workflow had a failure after the S3 upload step but before the update metadata step, perhaps be conservative at first and just raise an error, later on if/when we feel more confident, we can just overwrite it
look at S3 API to see if there is a create-only flag
Upload the layer to S3
Update the layer metadata in Synapse with the new S3 URL, the new MD5, and (if applicable) the bumped version number
Output: (Synapse dataset id, Synapse layer id) or error
Short Term Deliverables:
Pull down levels 1, 2, and 3 of Agilent expression data from http://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/coad/cgcc/unc.edu/agilentg4502a_07_3/transcriptome/, skip all other urls
Process the TCGA dataset layer
TODO more details here. But generally what we want to create matrices with patient_sample id as the column header and gene/probeId/etc as the row header.
We will probably have different tasks here depending upon the type of the layer.
Details:
Input: (Synapse dataset id, Synapse source layer id)
Search Synapse for analysis result layers matching this, uniquely identified by
source layer(s) and their versions
analysis method and code version
if this result already exists
skip to the end of this task
TODO currently assuming we want a different layer when the source data changes, an alternative would be to do a new version of this layer
Future enhancement: co-locate this worker with the download worker to save time by skipping the download from S3
Download the layer and any metadata needed to make the matrix
the patient_sample ids are stored in a different layer I think
if that's right, we would return an error here if that layer is not available yet and retry this task later
Make the matrix
this could fail for a variety of reasons so do this part before creating any objects in Synapse for the new analysis result
Formulate the layer metadata
compute the MD5 for the matrix
create the new layer in Synapse for this analysis result
raise an error upon duplicate layer
Ask synapse for a pre-signed URL to which to put the matrix layer, arguments to this call would include
dataset id
layer id
TODO what should our naming convention be here?
Upload the matrix to S3
Update the layer metadata in Synapse with the new S3 URL, the new MD5, and (if applicable) the bumped version number
Output: (Synapse dataset id, Synapse analysis result layer id) or error
Deliverables:
Once this step is complete the layers are now visible to Sage Scientists via the Web UI and when you search with the R client.
The layer can be downloaded via the Web UI and the R client.
We might also consider adding one more task to the workflow to just pull a copy of the data from Synapse to Sage's local shared storage
Short Term Deliverables:
a matrix of id to probe for Level 2 expression data
a matrix of id to gene for Level 3 expression data
Formulate notification message
Input: (Synapse dataset id, Synapse analysis result layer id)
query synapse for dataset and layer metadata to formulate a decent email
query synapse for dataset followers
Output: one or more of (follower, email body) or error
Notify via email all individuals following TCGA datasets of the new layer(s)
Input: one or more of (follower, email body) or error
send the email
later on if the volume is high, consolidate and dedup messages
Output: success or error
Next Steps
enable scientists to drop off TCGA raw data scripts for the workflow in a particular location
run the next script upon all previously processed TCGA source data
going forward, this script is in the collection of scripts run for each new raw data file found
enable scientists to direct the course of the workflow by having an output parameter by the name of the next script to run
Technologies to Consider
Languages: A combination of Python, R, and Perl (only legacy scripts) should do the trick. This seems amenable to scripting languages (as opposed to Java). It is a design goal to keep each task small and understandable.
Workflow
Remember to ensure that each workflow step is idempotent.