Distributed Compute Jobs

How to run multiple jobs in parallel on a common data set. Code and results should open/transparent/reproducible.

Complex workflows (e.g. combining intermediate results from distributed jobs);

Automatically configuring the nodes for arbitrary requirements;

AWS

Google Big Compute

Sun Grid Engine

MapReduce/Hadoop

AWS Simple Workflow

AWS Simple Queue Service

- use StarCluster to create a Sun Grid Engine (SGE) cluster.

- Put data and code on NFS file system on the cluster.

- Write SGE job files for the jobs; each job runs the code and sends the results to the NFS

- use Star Cluster to create a Sun Grid Engine (SGE) cluster.

- Create a Synapse dataset with two locations, (1) S3, (2) NFS file system on the cluster.

- Write SGE job files for the jobs; each job runs the code and sends the results to Synapse

- Push job files to Synapse for future reference

- pull code from Synapse

- pass user credentials without putting them in file(s)

- move queue to AWS SWF or SQS