Steps in the workflow

Identify new or updated TCGA Datasets

Details:

Input: n/a
Poll http://tcga-data.nci.nih.gov/tcga/ for currently available datasets and their last modified date
- We'll always look for updates within a configurable number of days, any repeat processing will be aborted further down the pipeline
Compare this list to the TCGA datasets held in Synapse
1. for each new dataset
  1. create the dataset metadata in Synapse
  2. crawl the download site to identify the layers
  3. proceed with the workflow for each layer
2. for each updated layer in an existing dataset
  1. crawl the [download site|http://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/ to identify the updated/new layers
  2. proceed with the workflow for the updated/new layer(s)
Output: (Synapse dataset id, TCGA layer download uri)

Short Term Deliverables:

Focus on just the Colon adenocarcinoma COAD dataset for now
Create the Synapse dataset by hand

Download the dataset layers to S3

Details:

Input: (Synapse dataset id, TCGA layer download uri)
Just pull down clinical and array data, sequence data will come later
- bcr/ has clinical info
- cgcc/ array platforms
Get the MD5 checksum first
- see if we already have that checksum stored, if so, skip to the end of this step
Download the data file
Check compute its MD5 and compare
Upload the file first and checksum second to a subdirectory in the S3 bucket tcga/source for source data from TCGA
- the filename is the tcga/source concatenated with the path portion of the TCGA download URL
- if a file of that name already exists in S3
Output: (Synapse dataset id, S3 layer md5 uri, S3 layer uri)

Short Term Deliverables:

Pull down levels 1, 2, and 3 of Agilent expression data from http://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/coad/cgcc/unc.edu/agilentg4502a_07_3/transcriptome/

Create Synapse record for the source Layers from TCGA

Details

Input: (Synapse dataset id, S3 layer md5 uri, S3 layer uri)

Process the dataset layers

TODO more details here. But generally what we want to create matrices with patient_sample id as the column header and gene/probeId/etc as the row header.

Short Term Deliverables:

a matrix of id to probe for Level 2 expression data
a matrix of id to gene for Level 3 expression data

Submit the processed dataset to Synapse

Be sure that the permissions on the dataset are such that the public cannot see it but Sage Scientists can. We cannot redistribute TCGA datasets via the Sage Commons.

Deliverable:

Once this step is complete the layers are now visible via the Web UI and when you search with the R client.
The layer can be downloaded via the Web UI and the R client.
We might also consider adding one more task to the workflow to just pull a copy of the data from Synapse to Sage's local shared storage

Notify via email all individuals following TCGA datasets of the new layer(s)

Technologies to Consider

Languages: A combination of Python, R, and Perl (only legacy scripts) should do the trick. This seems amenable to scripting languages (as opposed to Java). It is a design goal to keep each task small and understandable.

Workflow

AWS's Simple Workflow Service
Taverna

Remember to ensure that each workflow step is idempotent.

TCGA Curation Pipeline Design

Steps in the workflow

Identify new or updated TCGA Datasets

Download the dataset layers to S3

Create Synapse record for the source Layers from TCGA

Process the dataset layers

Submit the processed dataset to Synapse

Notify via email all individuals following TCGA datasets of the new layer(s)

Technologies to Consider