Steps in the workflow

Identify new or updated TCGA Datasets

Details:

Short Term Deliverables:

Download the dataset layers from TCGA to S3

Details:

Short Term Deliverables:

Create Synapse record for the source layers from TCGA

Details

Process the TCGA dataset layer

TODO more details here. But generally what we want to create matrices with patient_sample id as the column header and gene/probeId/etc as the row header.

We will probably have different tasks here depending upon the type of the layer.

Details:

Short Term Deliverables:

Create Synapse record for processed TCGA dataset layer

Deliverables:

Formulate notification message

Notify via email all individuals following TCGA datasets of the new layer(s)

Technologies to Consider

Languages: A combination of Python, R, and Perl (only legacy scripts) should do the trick. This seems amenable to scripting languages (as opposed to Java). It is a design goal to keep each task small and understandable.

Workflow

Remember to ensure that each workflow step is idempotent.