Document toolboxDocument toolbox

Distributed Compute Jobs

Problem:

How to run multiple jobs in parallel on a common data set.  Code and results should open/transparent/reproducible.

Out of scope (for now):

Complex workflows (e.g. combining intermediate results from distributed jobs);

Automatically configuring the nodes for arbitrary requirements;

 

Resources

Sources of Cloud Computing Resources

AWS

Google Big Compute

It should be easy to move the solution among different computing resources based on changing business needs.  The solution should be tolerant to moderately high failures in worker nodes; it's highly likely that we'd want to use AWS spot instances to reduce cost as many of our use cases do not require immediate results.

Process Initialization (allocate nodes, run generic 'worker' process on each node)

Sun Grid Engine

MapReduce/Hadoop

Job Assignment / Monitoring (create queue of jobs, assign jobs to nodes)

AWS Simple Workflow

AWS Simple Queue Service

Approach

Phase 1 approach:

- use StarCluster to create a Sun Grid Engine (SGE) cluster.

- Put data and code on NFS file system on the cluster.

- Write SGE job files for the jobs; each job runs the code and sends the results to the NFS

Phase 2 approach:

- use Star Cluster to create a Sun Grid Engine (SGE) cluster.

- Create a Synapse dataset with two locations, (1) S3, (2) NFS file system on the cluster.

- Write SGE job files for the jobs; each job runs the code and sends the results to Synapse

- Push job files to Synapse for future reference

subsequent phases will tackle these issues:

- pull code from Synapse

- pass user credentials without putting them in file(s)

- move queue to AWS SWF or SQS

 - capture provenance in Synapse and expand to multistep analysis workflows.

 

Requirements (or at least desired initial functionality)

  1. Master slave type architecture to collect results from a distributed computations into an object that can be used in subsequent computations. Revolution foreach package meets this requirement by collecting results into an R list. Traditional batch submission systems do not meet this requirement without additional engineering, as results of each job may be output to a separate text file which need to be aggregated by a separate program, which becomes cumbersome.
  2. Code should run in serial (for normal interactive computing) and in parallel with as little modification to the user's typical workstream as possible Again, Revolution foreach meets this requirement as parallelization only requires changing %do% to %dopar% (or running the same %dopar% code with or without a registered parallel backend). Traditional batch submission systems require a significantly different workstream and code modifications to run in parallel.

Out of scope for initial functionality (though desirable in the future)

Inter-node communication – e.g. reduce step in map reduce. Sufficient to assume jobs are embarrassingly parallel for initial functionality.

Driving use cases to implement parallelization

  1. Elias' randomized simulation. Requires 10,000 runs of elastic net, lasso, ridge using slightly different data.
  2. In Sock's prediction pipeline. Very similar to Elias use case. Parallelization can be either on: a) each predictive model (as in Elias' case); b) each bootstrap run; c) each cross validation fold.
  3. Roche Collaboration: a Bayesian Network analysis which is computationally intensive because it performs a big exploration of parameter space

Solutions to explore

  1. iPython (on Amazon). Larsson says this allows parallelization in Python the same way we are trying to design into BigR. He says this is already set up to run using Star Cluster on Amazon.
  2. Revolution foreach (on Amazon). Chris Bare brings up a good point – have we explored if Revolution's foreach package can run on Amazon? I would think this is the first place they would implement it and likely someone has gotten it working? (Note:  From http://blog.revolutionanalytics.com/2009/07/simple-scalable-parallel-computing-in-r.html "it also allows iterations of foreach loops to run on separate machines on a cluster, or in a cloud environment like Amazon EC2")
    1. Looks like there are tons of offerings for the R language: http://cran.r-project.org/web/views/HighPerformanceComputing.html