Distributed Computation Strategy
Distributed Computation Strategy
Goals of this effort
Evaluate technologies
Enable specific, real use cases
Create artifacts (software installations, scripts, guides, live demos) which enable future applications
NON-Goals
Reinvent the wheel (start a software engineering effort)
Exhaustively evaluate all available technologies
Try to create a one-size-fits-all solution
Use Cases
- Elias' randomized simulation. Requires 10,000 runs of elastic net, lasso, ridge using slightly different data.
- In Sock's prediction pipeline. Very similar to Elias use case. Parallelization can be either on: a) each predictive model (as in Elias' case); b) each bootstrap run; c) each cross validation fold.
- Roche Collaboration: a Matlab-based Bayesian Network analysis which is computationally intensive because it performs a big exploration (~1000 variations) of parameter space.
Approach
Initially select two use cases, one R-based, on non-R. For each, select a likely technology, and develop a working solution.
Emphasis on iterative development (first get working system, then add more features) and capturing "lessons learned" along the way.
Consider how to make computation open, reproducible (integrate with Synapse).
Technologies
(not meant to be exhaustive or prioritized):
R-Based
- Revolution Analytics 'foreach' (supports multiple back-ends, including AWS EC2)
- Google foreach-GCompute integration
- Parallel R
- The BioC group has set up an AWS CloudFormation template: http://bioconductor/help/bioconductor-cloud-ami/#parallel
- http://www.bioconductor.org/help/course-materials/2006/BioC2006/labs/mmorgan/ParallelR.pdf
- Chris Bare (Unlicensed) put together some documentation for Interactive distributed computing in R
- Segue (R wrapper around AWS MapReduce, http://code.google.com/p/segue/)
- RHIPE (R wrapper around Hadoop, http://www.datadr.org/)
- much, much more: http://cran.r-project.org/web/views/HighPerformanceComputing.html
Generic
StarCluster / Sun Grid Engine: http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/using-spot-instances-cluster.html