Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

Design Goals

  1. Allow scientists to write the workflow activities in their preferred language
    • approach: further develop the Synapse R client and the service APIs that it utilizes
  2. Ensure that scientists can reuse the same code for one-off tasks they want to run on their laptops and also as workflow activities
    • approach: use a particular input/output parameter scheme for all R scripts
    • approach: R scripts have no direct dependencies upon Amazon's Simple Workflow Service
  3. Ensure that the workflow is scalable to many nodes running concurrently
    • approach: use Amazon's Simple Workflow system
  4. Minimize the amount of logic needed in non-R code
    • approach: to keep the complicated logic about whether a particular script should be run on a particular piece of source data out of Java, instead pass all source data to every R script and let the R script decide whether it wants to work on the data or not

Steps in the workflow

Identify new or updated TCGA Datasets

...

  • Input: one or more of (follower, email body) or error
  • send the email
  • later on if the volume is high, consolodate consolidate and dedup messages
  • Output: success or error

Next Steps

  • enable scientists to drop off TCGA raw data scripts for the workflow in a particular location
    • run the next script upon all previously processed TCGA source data
    • going forward, this script is in the collection of scripts run for each new raw data file found
  • enable scientists to direct the course of the workflow by having an output parameter by the name of the next script to run

Technologies to Consider

Languages: A combination of Python, R, and Perl (only legacy scripts) should do the trick. This seems amenable to scripting languages (as opposed to Java). It is a design goal to keep each task small and understandable.

...