...
Automatically configuring the nodes for arbitrary requirements;
Resources
Sources of Cloud Computing Resources
AWS
Google Big Compute
It should be easy to move the solution among different computing resources based on changing business needs. The solution should be tolerant to moderately high failures in worker nodes; it's highly likely that we'd want to use AWS spot instances to reduce cost as many of our use cases do not require immediate results.
Process Initialization (allocate nodes, run generic 'worker' process on each node)
Sun Grid Engine
MapReduce/Hadoop
Job Assignment / Monitoring (create queue of jobs, assign jobs to nodes)
AWS Simple Workflow
AWS Simple Queue Service
...
- Elias' randomized simulation. Requires 10,000 runs of elastic net, lasso, ridge using slightly different data.
- In Sock's prediction pipeline. Very similar to Elias use case. Parallelization can be either on: a) each predictive model (as in Elias' case); b) each bootstrap run; c) each cross validation fold.
- Roche Collaboration: a Bayesian Network analysis which is computationally intensive because it performs a big exploration of parameter space
Solutions to explore
- iPython (on Amazon). Larsson says this allows parallelization in Python the same way we are trying to design into BigR. He says this is already set up to run using Star Cluster on Amazon.
- Revolution foreach (on Amazon). Chris Bare brings up a good point – have we explored if Revolution's foreach package can run on Amazon? I would think this is the first place they would implement it and likely someone has gotten it working? (Note: From http://blog.revolutionanalytics.com/2009/07/simple-scalable-parallel-computing-in-r.html "it also allows iterations of foreach loops to run on separate machines on a cluster, or in a cloud environment like Amazon EC2")
- Looks like there are tons of offerings for the R language: http://cran.r-project.org/web/views/HighPerformanceComputing.html