Steps in the workflow

Identify new or updated TCGA Datasets

Details:

Input: n/a
Poll http://tcga-data.nci.nih.gov/tcga/ for currently available datasets and their last modified date
- We'll always look for updates within a configurable number of days
- Any repeat processing should be safe to do because we will ensure this pipeline is idempotent
- But to reduce costs, we should abort repeat work when it is identified further down the pipeline
Compare this list to the TCGA datasets held in Synapse
1. for each new dataset
  1. create the dataset metadata in Synapse
    - Be sure that the permissions on the dataset are such that the public cannot see it but Sage Scientists can. We cannot redistribute TCGA datasets via the Sage Commons.
  2. crawl the download site to identify the layers
  3. proceed with the workflow for each layer
2. for each updated layer in an existing dataset
  1. crawl the download site\ to identify the updated/new layers
  2. proceed with the workflow for the updated/new layer(s)
Just get layer urls for clinical and array data, sequence data will come later
- bcr/ has clinical info
- cgcc/ array platforms
Output: one or more of (Synapse dataset id, TCGA layer download uri) or error

Short Term Deliverables:

Focus on just the Colon adenocarcinoma COAD dataset for now
Create the Synapse dataset by hand

Download the dataset layers to S3

Details:

Input: (Synapse dataset id, TCGA layer download uri)
Download the layer MD5 checksum from TCGA first
- see if we already have a corresponding layer with that checksum stored in Synapse, if so, skip to the end of this step
Download the layer data file from TCGA
Compute its MD5 and compare
Upload the layer data file to a subdirectory in the S3 bucket tcga/source for source data from TCGA
- the filename is the tcga/source concatenated with the path portion of the TCGA download URL
Output: (Synapse dataset id, layer md5, S3 layer uri) or error

Short Term Deliverables:

Pull down levels 1, 2, and 3 of Agilent expression data from http://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/coad/cgcc/unc.edu/agilentg4502a_07_3/transcriptome/, skip all other urls

Create Synapse record for the source Layers from TCGA

Details

Input: (Synapse dataset id, S3 layer md5 uri, S3 layer uri)
Search Synapse to see if this layer already exists
if the layer exists
- if the m5sum matches
  - skip to the end of this task
- else the md5 does not match
  - bump the version number of the layer and set the new S3 urls
else the layer does not exist
- create new layer metadata in Synapse
Output: (Synapse dataset id, Synapse layer id) or error

Process the TCGA dataset layer

TODO more details here. But generally what we want to create matrices with patient_sample id as the column header and gene/probeId/etc as the row header.

We will probably have different tasks here depending upon the type of the layer.

Details:

Input: (Synapse dataset id, Synapse layer id)
Future enhancement: co-locate this worker with the download worker to save time by skipping the download from S3
Download the layer and any metadata needed to make the matrix
- the patient_sample ids are stored in a different layer I think
- if that's right, we would return an error here if that layer is not available yet and retry this task later
Make the matrix
Upload it to S3
Output: (Synapse dataset id, Synapse layer id, S3 url of matrix) or error

Short Term Deliverables:

a matrix of id to probe for Level 2 expression data
a matrix of id to gene for Level 3 expression data

Create Synapse record for Processed TCGA dataset layer

Input: (Synapse dataset id, Synapse layer id, S3 url of matrix) or error
create the new layer metadata for this matrix in synapse
- handle dups
- handle new versions
Output: (dataset id, matrix layer id) or error

Deliverables:

Once this step is complete the layers are now visible via the Web UI and when you search with the R client.
The layer can be downloaded via the Web UI and the R client.
We might also consider adding one more task to the workflow to just pull a copy of the data from Synapse to Sage's local shared storage

Formulate notification message

Input: (dataset id, matrix layer id)
query synapse for dataset and layer metadata to formulate a decent email
query synapse for dataset followers
Output: one or more of (follower, email body) or error

Notify via email all individuals following TCGA datasets of the new layer(s)

Input: one or more of (follower, email body) or error
send the email
later on if the volume is high, consolodate and dedup messages
Output: success or error

Technologies to Consider

Languages: A combination of Python, R, and Perl (only legacy scripts) should do the trick. This seems amenable to scripting languages (as opposed to Java). It is a design goal to keep each task small and understandable.

Workflow

AWS's Simple Workflow Service
Taverna

Remember to ensure that each workflow step is idempotent.

TCGA Curation Pipeline Design

Steps in the workflow

Identify new or updated TCGA Datasets

Download the dataset layers to S3

Create Synapse record for the source Layers from TCGA

Process the TCGA dataset layer

Create Synapse record for Processed TCGA dataset layer

Formulate notification message

Notify via email all individuals following TCGA datasets of the new layer(s)

Technologies to Consider