Steps in the workflow
Identify new or updated TCGA Datasets
Details:
- Input: n/a
- Poll http://tcga-data.nci.nih.gov/tcga/ for currently available datasets and their last modified date
- We'll always look for updates within a configurable number of days
- Any repeat processing should be safe to do because we will ensure this pipeline is idempotent
- But to reduce costs, we should abort repeat work when it is identified further down the pipeline
- Compare this list to the TCGA datasets held in Synapse
- for each new dataset
- create the dataset metadata in Synapse
- Be sure that the permissions on the dataset are such that the public cannot see it but Sage Scientists can. We cannot redistribute TCGA datasets via the Sage Commons.
- crawl the download site to identify the layers
- proceed with the workflow for each layer
- create the dataset metadata in Synapse
- for each updated layer in an existing dataset
- crawl the download site to identify the updated/new layers
- proceed with the workflow for the updated/new layer(s)
- for each new dataset
- raise an error if we find multiple Synapse datasets for the TCGA dataset
- Just get layer urls for clinical and array data, sequence data will come later
- bcr/ has clinical info
- cgcc/ array platforms
- Output: one or more of (Synapse dataset id, TCGA layer download URL) or error
Short Term Deliverables:
- Focus on just the Colon adenocarcinoma COAD dataset for now
- Create the Synapse dataset by hand
Create Synapse record for the source layer from TCGA
Details
- Input: (Synapse dataset id, TCGA layer download URL)
- Search Synapse to see if this layer already exists
- assumption: there is sufficient information in the path portion of the TCGA layer URL due to the way TCGA formulates its paths to determine the type (clinical vs. expression) and make/model of the platform to have sufficient metadata for the layer, we can always add more layer metadata at another step
- raise an error if we find multiple matching layers
- formulate the layer metadata for this layer from the TCGA URL
- if the layer exists in synapse
- if all fields of the metadata match
- skip to the end of this task
- else the metadata does not match
- raise an error (our scheme for formulating layer metadata may have changed?)
- if all fields of the metadata match
- else the layer does not exist
- create new layer in Synapse using the layer metadata
- Output: (Synapse dataset id, Synapse layer id, TCGA layer download URL) or error
Download the dataset layer from TCGA to S3
Details:
- Input: (Synapse dataset id, Synapse layer id, TCGA layer download URL)
- Download the layer MD5 checksum from TCGA first
- Get the layer metadata from Synapse to see if we already have an S3 location and MD5 for this layer id
- if we have an MD5 from Synapse
- if the Synapse MD5 matches the TCGA MD5
- skip to the end of this task because we already uploaded this file
- else the md5 does not match
- bump the version number in our client-side copy of the layer metadata
- if the Synapse MD5 matches the TCGA MD5
- Download the layer data file from TCGA
- Compute its MD5 and compare it to the MD5 we got from TCGA
- if they do not match, we had a download error, return an error (we'll try again later)
- Ask Synapse for a pre-signed URL to which to put the layer
- Arguments to this call would include
- dataset id
- layer id
- naming convention for TCGA layer S3 URLs
- use a subdirectory in the S3 bucket tcga/source for source data from TCGA
- perhaps the filename is tcga/source concatenated with the path portion of the TCGA download URL -> TCGA has spent a long time thinking about their URL conventions so we might consider reusing their scheme
- Synapse layer version number
- NOTE: if something already exists in S3 at that location, it was because workflow had a failure after the S3 upload step but before the update metadata step, perhaps be conservative at first and just raise an error, later on if/when we feel more confident, we can just overwrite it
- look at S3 API to see if there is a create-only flag
- Arguments to this call would include
- Upload the layer to S3
- Update the layer metadata in Synapse with the new S3 URL, the new MD5, and perhaps the bumped version number
- Output: (Synapse dataset id, Synapse layer id) or error
Short Term Deliverables:
- Pull down levels 1, 2, and 3 of Agilent expression data from http://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/coad/cgcc/unc.edu/agilentg4502a_07_3/transcriptome/, skip all other urls
Process the TCGA dataset layer
TODO more details here. But generally what we want to create matrices with patient_sample id as the column header and gene/probeId/etc as the row header.
We will probably have different tasks here depending upon the type of the layer.
Details:
- Input: (Synapse dataset id, Synapse layer id)
- Future enhancement: co-locate this worker with the download worker to save time by skipping the download from S3
- Download the layer and any metadata needed to make the matrix
- the patient_sample ids are stored in a different layer I think
- if that's right, we would return an error here if that layer is not available yet and retry this task later
- Make the matrix
- Ask synapse for a pre-signed URL to which to put the matrix layer, arguments to this call would include
- dataset id
- layer type (an analysis result)
- path portion of the source URL? (many source layers may be involved in a particular analysis result)
- Upload it to S3
- Output: (Synapse dataset id, Synapse layer id, S3 url of matrix) or error
Short Term Deliverables:
- a matrix of id to probe for Level 2 expression data
- a matrix of id to gene for Level 3 expression data
Create Synapse record for processed TCGA dataset layer
- Input: (Synapse dataset id, Synapse layer id, S3 url of matrix) or error
- create the new layer metadata for this matrix in synapse
- handle dups
- handle new versions
- Output: (dataset id, matrix layer id) or error
Deliverables:
- Once this step is complete the layers are now visible via the Web UI and when you search with the R client.
- The layer can be downloaded via the Web UI and the R client.
- We might also consider adding one more task to the workflow to just pull a copy of the data from Synapse to Sage's local shared storage
Formulate notification message
- Input: (dataset id, matrix layer id)
- query synapse for dataset and layer metadata to formulate a decent email
- query synapse for dataset followers
- Output: one or more of (follower, email body) or error
Notify via email all individuals following TCGA datasets of the new layer(s)
- Input: one or more of (follower, email body) or error
- send the email
- later on if the volume is high, consolodate and dedup messages
- Output: success or error
Technologies to Consider
Languages: A combination of Python, R, and Perl (only legacy scripts) should do the trick. This seems amenable to scripting languages (as opposed to Java). It is a design goal to keep each task small and understandable.
Workflow
- AWS's Simple Workflow Service
- Taverna
Remember to ensure that each workflow step is idempotent.