Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Table of Contents

This design document is now obsolete

Design Goals

  1. Allow scientists to write the workflow activities in their preferred language
    • approach: further develop the Synapse R client and the service APIs that it utilizes
  2. Ensure that scientists can use the same code as workflow activities AND also for one-off tasks they want to run on their laptops
    • approach: use a particular input/output parameter scheme for all R scripts
    • approach: R scripts have no direct dependencies upon Amazon's Simple Workflow Service, they only depend upon Synapse
  3. Ensure that the workflow is scalable to many nodes running concurrently
    • approach: use Amazon's Simple Workflow system
  4. Minimize the amount of workflow decision logic needed in non-R code
    • approach: to keep the complicated logic about whether a particular script should be run on a particular piece of source data out of Java, instead pass all source data to every R script and let the R script decide whether it wants to work on the data or not

Steps in the TCGA Workflow

Image Added

Workflow Scaling

Image Added

Workflow Architecture

Image Added

Preliminary R Script API

To invoke a script locally:

Code Block
R createMatrix.R --args --username 'nicole.deflaux@sagebase.org' --password XXXXX --datasetId 543 --layerId 544

Script workflow output to STDOUT:

Code Block
blah blah, this is ignored ...
SynapseWorkflowResult_START
{"layerId":560}
SynapseWorkflowResult_END
blah blah, this is ignored too ...

Details

Identify new or updated TCGA Datasets

...

  • Input: one or more of (follower, email body) or error
  • send the email
  • later on if the volume is high, consolodate consolidate and dedup messages
  • Output: success or error

Next Steps

  • enable scientists to drop off TCGA raw data scripts for the workflow in a particular location
    • run the next script upon all previously processed TCGA source data
    • going forward, this script is in the collection of scripts run for each new raw data file found
  • enable scientists to direct the course of the workflow by having an output parameter by the name of the next script to run

Technologies to Consider

Languages: A combination of Python, R, and Perl (only legacy scripts) should do the trick. This seems amenable to scripting languages (as opposed to Java). It is a design goal to keep each task small and understandable.

...

Remember to ensure that each workflow step is idempotent.