Skip to end of metadata
Go to start of metadata

You are viewing an old version of this content. View the current version.

Compare with Current View Version History

« Previous Version 3 Next »

Problem:

How to run multiple jobs in parallel on a common data set.  Code and results should open/transparent/reproducible.

Out of scope (for now):

Complex workflows (e.g. combining intermediate results from distributed jobs);

Automatically configuring the nodes for arbitrary requirements;

 

Resources

Computing Resources

AWS

Google Big Compute

Process Initialization

Sun Grid Engine

MapReduce/Hadoop

Job Assignment / Monitoring

AWS Simple Workflow

AWS Simple Queue Service

 

 

Approach

Phase 1 approach:

- use StarCluster to create a Sun Grid Engine (SGE) cluster.

- Put data and code on NFS file system on the cluster.

- Write SGE job files for the jobs; each job runs the code and sends the results to the NFS

Phase 2 approach:

- use Star Cluster to create a Sun Grid Engine (SGE) cluster.

- Create a Synapse dataset with two locations, (1) S3, (2) NFS file system on the cluster.

- Write SGE job files for the jobs; each job runs the code and sends the results to Synapse

- Push job files to Synapse for future reference

subsequent phases will tackle these issues:

- pull code from Synapse

- pass user credentials without putting them in file(s)

- move queue to AWS SWF or SQS

 

 

  • No labels