Distributed Computation Strategy

Distributed Computation Strategy

 

Goals of this effort

Evaluate technologies

Enable specific, real use cases

Create artifacts (software installations, scripts, guides, live demos) which enable future applications

 

NON-Goals

Reinvent the wheel (start a software engineering effort)

Exhaustively evaluate all available technologies

Try to create a one-size-fits-all solution

 

Use Cases

  1. Elias' randomized simulation. Requires 10,000 runs of elastic net, lasso, ridge using slightly different data.
  2. In Sock's prediction pipeline. Very similar to Elias use case. Parallelization can be either on: a) each predictive model (as in Elias' case); b) each bootstrap run; c) each cross validation fold.
  3. Roche Collaboration: a Matlab-based Bayesian Network analysis which is computationally intensive because it performs a big exploration (~1000 variations) of parameter space.

 

Approach

Initially select two use cases, one R-based, on non-R.  For each, select a likely technology, and develop a working solution. 

Emphasis on iterative development (first get working system, then add more features) and capturing "lessons learned" along the way.

Consider how to make computation open, reproducible (integrate with Synapse).

 

Technologies

(not meant to be exhaustive or prioritized):

R-Based

Generic

StarCluster / Sun Grid Engine: http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/using-spot-instances-cluster.html