This design document is now obsolete

Design Goals

  1. Allow scientists to write the workflow activities in their preferred language
  2. Ensure that scientists can use the same code as workflow activities AND also for one-off tasks they want to run on their laptops
  3. Ensure that the workflow is scalable to many nodes running concurrently
  4. Minimize the amount of workflow decision logic needed in non-R code

Steps in the TCGA Workflow

Workflow Scaling

Workflow Architecture

Preliminary R Script API

To invoke a script locally:

R createMatrix.R --args --username 'nicole.deflaux@sagebase.org' --password XXXXX --datasetId 543 --layerId 544

Script workflow output to STDOUT:

blah blah, this is ignored ...
SynapseWorkflowResult_START
{"layerId":560}
SynapseWorkflowResult_END
blah blah, this is ignored too ...

Details

Identify new or updated TCGA Datasets

Details:

Short Term Deliverables:

Create Synapse record for the source layer from TCGA

Details

Download the dataset layer from TCGA to S3

Details:

Short Term Deliverables:

Process the TCGA dataset layer

TODO more details here. But generally what we want to create matrices with patient_sample id as the column header and gene/probeId/etc as the row header.

We will probably have different tasks here depending upon the type of the layer.

Details:

Deliverables:

Short Term Deliverables:

Formulate notification message

Notify via email all individuals following TCGA datasets of the new layer(s)

Next Steps

Technologies to Consider

Languages: A combination of Python, R, and Perl (only legacy scripts) should do the trick. This seems amenable to scripting languages (as opposed to Java). It is a design goal to keep each task small and understandable.

Workflow

Remember to ensure that each workflow step is idempotent.