Problem:
How to run multiple jobs in parallel on a common data set. Code and results should open/transparent/reproducible.
Out of scope (for now):
Complex workflows (e.g. combining intermediate results from distributed jobs);
Automatically configuring the nodes for arbitrary requirements;
Resources
Computing Resources
AWS
Google Big Compute
Process Initialization
Sun Grid Engine
MapReduce/Hadoop
Job Assignment / Monitoring
AWS Simple Workflow
AWS Simple Queue Service
Approach
Phase 1 approach:
- use StarCluster to create a Sun Grid Engine (SGE) cluster.
- Put data and code on NFS file system on the cluster.
- Write SGE job files for the jobs; each job runs the code and sends the results to the NFS
Phase 2 approach:
- use Star Cluster to create a Sun Grid Engine (SGE) cluster.
- Create a Synapse dataset with two locations, (1) S3, (2) NFS file system on the cluster.
- Write SGE job files for the jobs; each job runs the code and sends the results to Synapse
- Push job files to Synapse for future reference
subsequent phases will tackle these issues:
- pull code from Synapse
- pass user credentials without putting them in file(s)
- move queue to AWS SWF or SQS
0 Comments