Skip to end of banner
Go to start of banner

Distributed Compute Jobs

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Problem:

How to run multiple jobs in parallel on a common data set.  Code and results should open/transparent/reproducible.

Out of scope (for now):

Complex workflows (e.g. combining intermediate results from distributed jobs);

Automatically configuring the nodes for arbitrary requirements;

 

Resources

Computing Resources

AWS

Google Big Compute

Process Initialization

Sun Grid Engine

MapReduce/Hadoop

Job Assignment / Monitoring

AWS Simple Workflow

AWS Simple Queue Service

 

 

Approach

Phase 1 approach:

- use StarCluster to create a Sun Grid Engine (SGE) cluster.

- Put data and code on NFS file system on the cluster.

- Write SGE job files for the jobs; each job runs the code and sends the results to the NFS

Phase 2 approach:

- use Star Cluster to create a Sun Grid Engine (SGE) cluster.

- Create a Synapse dataset with two locations, (1) S3, (2) NFS file system on the cluster.

- Write SGE job files for the jobs; each job runs the code and sends the results to Synapse

- Push job files to Synapse for future reference

subsequent phases will tackle these issues:

- pull code from Synapse

- pass user credentials without putting them in file(s)

- move queue to AWS SWF or SQS

 

Requirements (or at least desired initial functionality)

  1. Master slave type architecture to collect results from a distributed computations into an object that can be used in subsequent computations. Revolution foreach package meets this requirement by collecting results into an R list. Traditional batch submission systems do not meet this requirement without additional engineering, as results of each job may be output to a separate text file which need to be aggregated by a separate program, which becomes cumbersome.
  2. Code should run in serial (for normal interactive computing) and in parallel with as little modification to the user's typical workstream as possible Again, Revolution foreach meets this requirement as parallelization only requires changing %do% to %dopar% (or running the same %dopar% code with or without a registered parallel backend). Traditional batch submission systems require a significantly different workstream and code modifications to run in parallel.

Out of scope for initial functionality (though desirable in the future)

Inter-node communication – e.g. reduce step in map reduce. Sufficient to assume jobs are embarrassingly parallel for initial functionality.

Driving use cases to implement parallelization

  1. Elias' randomized simulation. Requires 10,000 runs of elastic net, lasso, ridge using slightly different data.
  2. In Sock's prediction pipeline. Very similar to Elias use case. Parallelization can be either on: a) each predictive model (as in Elias' case); b) each bootstrap run; c) each cross validation fold.

Solutions to explore

  1. iPython (on Amazon). Larsson says this allows parallelization in Python the same way we are trying to design into BigR. He says this is already set up to run using Star Cluster on Amazon.
  2. Revolution foreach (on Amazon). Chris Bare brings up a good point – have we explored if Revolution's foreach package can run on Amazon? I would think this is the first place they would implement it and likely someone has gotten it working?
  • No labels

0 Comments

You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.