Table of Contents |
---|
Design Goals
- Allow scientists to write the workflow activities in their preferred language
- approach: further develop the Synapse R client and the service APIs that it utilizes
- Ensure that scientists can reuse the same code for one-off tasks they want to run on their laptops and also as workflow activities
- approach: use a particular input/output parameter scheme for all R scripts
- approach: R scripts have no direct dependencies upon Amazon's Simple Workflow Service
- Ensure that the workflow is scalable to many nodes running concurrently
- approach: use Amazon's Simple Workflow system
- Minimize the amount of logic needed in non-R code
- approach: to keep the complicated logic about whether a particular script should be run on a particular piece of source data out of Java, instead pass all source data to every R script and let the R script decide whether it wants to work on the data or not
Steps in the workflow
Identify new or updated TCGA Datasets
...
- Input: one or more of (follower, email body) or error
- send the email
- later on if the volume is high, consolodate consolidate and dedup messages
- Output: success or error
Next Steps
- enable scientists to drop off TCGA raw data scripts for the workflow in a particular location
- run the next script upon all previously processed TCGA source data
- going forward, this script is in the collection of scripts run for each new raw data file found
- enable scientists to direct the course of the workflow by having an output parameter by the name of the next script to run
Technologies to Consider
Languages: A combination of Python, R, and Perl (only legacy scripts) should do the trick. This seems amenable to scripting languages (as opposed to Java). It is a design goal to keep each task small and understandable.
...