Document toolboxDocument toolbox

Gene Expression Pipeline

Gene Expression Pipeline 

The goal of this project is to integrate Synapse and other cloud computing resources to increase reliability, increase performance, and otherwise facilitate the MetaGEO workflows.

Background

MetaGEO is the project led by Big Mecham to utilize thousands of molecular profiling datasets from repositories like GEO and ArrayExpress.  The home page for the effort is here:

http://sagebionetworks.jira.com/wiki/display/METAGEO

The effort includes:

- Automating the aggregation of 'raw' (i.e. CEL files) microarray data and related metadata

- Study-level normalization of microarray data. This includes two flavors:  (1) Unsupervised normalization is a regression-based process that is completely automatic and omits sample annotation information. (2) Supervised normalization is a refinement in which a human curator may add known factors to the regression, improving the fit.  Supervised normalization may be done iteratively, where the curator reviews results, adjusts the model, and reruns the process.

- Probe-set summarization

To date, ~5000 Affymetrix studies have been downloaded from GEO, processed through these steps, and the results saved on the local file system at Sage Bionetworks.

Opportunities for Improvement

  • Make initial data aggregation more reliable.   Currently, when the download scripts fail there is no notification. (Main failure modes are:  (1) downloads which don't complete, (2) corrupt zip or (3) corrupt CEL files.) Therefore error checking is a manual, tedious process.  It would helpful to have exception logging, email notification, automatic retries, etc.   The revised process could be applied (1) to datasets already downloaded from GEO, (2) to ArrayExpress, (3) to new datasets as they appear on GEO or ArrayExpress.
  • Scanning CEL files for time stamps can break if a study has a mix of file formats.
  • Download metadata from ArrayExpress.  Since metadata is more structured on ArrayExpress than on GEO, there is an opportunity to automate gathering this information along with the molecular profiling data. Per Brig this is out of scope, at least for now.
  • Automate 'unsupervised' normalization. 
  • Upload results to Synapse.  Currently the aggregated data are saved to the local unix file system.   Uploading to Synapse would leverage Synapse's features, including accessibility and searchability of the datasets.
  • Speed-up ANOVA-based normalization.  This computationally intensive, but parallelizable process benefits from acceleration in two ways: (1) when used in Supvervised QC it reduces the time spent by the human curator, (2) when processing large numbers of datasets, acceleration of this process has a significant impact.
  • Speed-up probe-set summarization.  This process, downstream from QC, reaps the same benefits as does QC, above.
  • Add novel steps to the processing pipeline and novel target data structures for cross-dataset analyses.

Stages

Our plan is to implement the improvements in the following order

1) Pipeline the existing unsupervised workflow using Amazon SWF, uploading results to Synapse.

2) Accelerate the ANOVA-based QC process via parallelization, e.g. MapReduce

3) Accelerate probe-set summarization, via parallelization, e.g. MapReduce

4) Incorporate K-means-based module determination, with cross-validation.

5) Create and populate gene expression database schema for cross-study analyses.

Misc Notes

Files downloaded to

/gluster/external-data/metageo/Treatment.Study/GPL<array-pattern>/GSE<study>/...

In GEO, GSM is a prefix for the *sample* identifier.

'makefile' is the shell script to process all the data in a study.  The details of processing GEO data are here:

http://sagebionetworks.jira.com/wiki/display/METAGEO/Building+a+default+instance+of+metaGEO

GEO has unstructured metadata in a file header.  There is some structured metadata.

The MetaGenomics Workflow

Background

The process of applying QC algorithms to datasets has been divided into two workflows.  First, an "indexing" workflow crawls a public repository and creates datasets and layers in Synapse corresponding to the data it finds.  (Currently we have crawlers for TCGA and GEO.)  These "raw datasets" are created in the public "Sage Commons Repository" project.  Second, the MetaGenomics workflow passes over the Commons project, runs unsupervised QC, and places the results in the MetaGenomics project.  This segregation of function allows additional public datasets (unassociated with any major data repository) to be created in the Commons project, after which the MetaGenomics QC workflow is automatically applied. 

MetaGenomics logic

The MetaGenomics workflow runs over each data layer in the Sage Commons Repository project.  For each layer it does the following:

  • If the layer type is not "G" or "E" (genetic or expression), then continue.
  • Find or create a dataset in the MetaGenomics project having the same name as its parent layer, copying the description, status, and createdBy attributes as well.
  • Determine if the layer has already been processed.  It does this by finding two annotations on the MetaGenomics dataset for the layer, one for the layer's 'modifiedOn' attribute and one for the layer's MD5 checksum.  If either attribute matches that of the source layer, no further processing takes place.
  • If the "number_of_samples" attribute is available in the source layer, then MetaGEO QC is scheduled for a machine size as shown in the table below. Assuming a memory requirement of 40MB/sample (derived empirically for Affymetrix datasets), the machine sizes are given in the third column of the table.  If the number_of_samples attribute is not given, the task is scheduled for a "large" machine.
  • Upon successful completion of QC, a layer in the MetaGenomics dataset called "QCd Data <source layer name> <platform>" is found or created.  The QC output is uploaded to the layer and the 'modifiedOn' and MD5 checksum attributes are added to the parent dataset.

Machine Size

No. Samples

Server Memory (GB)

small

375 or less

15 GB

medium

376 to 800

32GB

large

801 to 1600

64GB

extralarge

more than 1600

>> 64GB

Installation and running

http://sagebionetworks.jira.com/wiki/display/PLFM/Workflow+Deployment#WorkflowDeployment-MetaGenomicsWorkflowDeployment

MetaGEO as a Workflow

We use the conceptual framework of Amazon's Simple Workflow (SWF) system.
The workflow is a series of activities.  Activities are identified based on
 - points in which the workflow splits into parallel threads or joins into a single one
 - operations that are sequential but that run on different machines (or that involve humans vs. computers)
 - retries:  operations which can fail or hang indefinitely (e.g. downloading a large file) should be at the start of a step  (or at least should not be preceded in the activity by another operation which is costly, is error-prone, or cannot be repeated)  E.g. if one 1TB file is to be downloaded, followed by another one, then each of these operations should be its own activity

Activities should be 'idempotent':  accidentally repeating an activity can't be catastrophic.

Accidentally rerunning an activity while the same activity is running can't be catastrophic.

 - granularity of tracking:  the workflow system tracks progress of activities, so the wf should be split into activities based on the desired level of detail in the reporting of progress

A workflow 'instance' is the application of the workflow to a single study.

We envision these activities:

1) For a given study ID:

  • create or empty temp folder on server
  • download the .tar.gz file
  • unzip into temp folder on server
  • identify the array pattern and scan timestamp for each file
  • return a mapping of file><platform, timestamp>-

 ("platform"="array pattern")

2) for each <study,platform>

  • create or empty sweave output folder
  • launch normalization process (sweave file), creating results on file server
  • delete cel files and any other temp files
  • return <study,platform>><processed data, diagnostic data, inference data>-

3) for each <study,platform, layer> (layer=processed-data, diagnostic-data, or inference-data)

  • upload to Synapse
  • delete sweave output files and any other temp files

Organization of MetaGEO output in Synapse

Project:  There will be a single MetaGEO project

Dataset:  There will be a 1-1 mapping from GEO "studies" (GSE #'s) to Synapse datasets.  A dataset may have annotations drawn from GEO, like abstract, summary, etc.
Layers:  The MetaGEO workflow is applied to each <GSE,platform> pair resulting in two layers: (1) an R object containing auto-QCed expression data and related diagnostic information, (2) a sample annotation layer (akin to a clinic endpoint layer).  Since these two layers are produced for each platform (i.e. microarray pattern) in the study, there may be more than two layers for a dataset.   Each layer will have the platform as an annotation and also references to the code used to generate the layer.
 
When MetaGEO is re-run, we will leverage Synapse's versioning capability to generate new versions of the preexisting layers.
 
MetaGEO's output includes a "Gene Information Content" (GIC) object which is a cross-dataset summary of the variation level of each probe set.  As such, this object is not the child of any single dataset, but rather it is the child of the MetaGEO project.  To properly categorize the object, Synapse would have to allow a layer to be a child of a project.

Lessons Learned

These are some of the 'lessons learned' from the first pass at running the unsupervised workflow and uploading the results to Synapse:
- The general approach we ended up taking was to create (1) an 'initiator' to 'crawl' GEO and start one job per identified dataset; (2) an 'executor' that did everything for one dataset.  In particular we did not decompose execution into workflow steps.

- The scientist (Brig) was able to write both the crawler and execution code, in R and PERL.

- Synapse is more convenient than the SWF console for reviewing the status of a dataset.  (Status is pushed to annotations.  Note:  Need a way to filter datasets by their annotations.)

- During development/testing, need a way to run 'just one job' (or a small number of representative jobs) as an integration test.

- During debugging, need to easily run the components interactively.

- Need better error reporting from R (i.e. a stack trace rather than just an 'execution halted' message).  Might be accomplished by replacing Rscript with RJava.  Another idea is to integrate a call to 'traceback()' with R's end-of-session shutdown.

- Needed to add flexible input data, i.e. change input data from initiator->executor, without adding new parameters in workflow infrastructure.

- Running out of RAM or disk space killed our activity executor, even if we tried to 'catch' errors and continue with next job.

- Didn't attempt to spawn multiple EC2 instances, but clearly should be done in the future

- Don't put code snippets in Synapse unless there's a good reason, e.g. the code has dataset specific variations, or it facilitates interactive use.  Other R code that's not part of the generic workflow framework ought to be in some package.    Open question:  where should code that's workflow specific, yet not dataset specific live?

- Code entities should be kept simple, modular because they can get confusing, especially if they start referencing each other.

- It was hard. Why?

  • - Synapse is under development
  • - Requirements evolved over time
  • - Run-debug cycle is long for such large jobs.

- Problems with SWF forced us to bypass that framework and use a single (Java app) thread, which has greatly reduced the speed at which we churn through data sets.

- Used new Synapse 'API Key' functionality, to avoid session timeout and avoid putting user password on worker node.

Info to send back to the SWF team

PR Blurbs

"We are starting to look into our PR related work. As part of that, I would like to get quotes from our key private beta customers. It would be good if you could summarize your experience in using SWF in the form of a quote or two. The quote(s) should address the following questions: How are you using SWF? How has it benefited you and/or how do you think you will benefit from it? Where possible, it would be great to quantify the benefits (e.g. saved you X in development time or enabled you to go to market with a use case in half the time, etc.)"

  • We used SWF to develop a pipeline to analyze more than 10,000 genomic data sets. SWF trivialized our ability to leverage the power of the amazon cloud and what used to take weeks now runs in days.

Feedback and Feature Requests