How to run multiple jobs in parallel on a common data set. Code and results should open/transparent/reproducible.
Complex workflows (e.g. combining intermediate results from distributed jobs);
Automatically configuring the nodes for arbitrary requirements;
AWS
Google Big Compute
Sun Grid Engine
MapReduce/Hadoop
AWS Simple Workflow
AWS Simple Queue Service
- use StarCluster to create a Sun Grid Engine (SGE) cluster.
- Put data and code on NFS file system on the cluster.
- Write SGE job files for the jobs; each job runs the code and sends the results to the NFS
- use Star Cluster to create a Sun Grid Engine (SGE) cluster.
- Create a Synapse dataset with two locations, (1) S3, (2) NFS file system on the cluster.
- Write SGE job files for the jobs; each job runs the code and sends the results to Synapse
- Push job files to Synapse for future reference
- pull code from Synapse
- pass user credentials without putting them in file(s)
- move queue to AWS SWF or SQS