In this example, we are not using MapReduce to its full potential. We are only using it to run jobs in parallel, one job for each chromosome. The phase algorithm from UW writes its output to local files instead of stdout.
Mapper
~>cat phaseMapper.sh #!/bin/sh RESULT_BUCKET=s3://sagetest-YourUsername/results while read S3_INPUT_FILE; do echo input to process ${S3_INPUT_FILE} 1>&2 # For debugging purposes, print out the files cached for us ls -la 1>&2 # Parse the s3 file path to get the file name LOCAL_INPUT_FILE=$(echo ${S3_INPUT_FILE} | perl -pe 'if (/^((s3[n]?):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$/) {print "$6\n"};' | head -1) # Download the file from S3 echo hadoop fs -get ${S3_INPUT_FILE} ${LOCAL_INPUT_FILE} 1>&2 hadoop fs -get ${S3_INPUT_FILE} ${LOCAL_INPUT_FILE} 1>&2 # Run phase processing ./phase ${LOCAL_INPUT_FILE} ${LOCAL_INPUT_FILE}_out 100 1 100 # Upload the output files ls -la ${LOCAL_INPUT_FILE}*_out* 1>&2 for f in ${LOCAL_INPUT_FILE}*_out* do echo hadoop fs -put $f ${RESULT_BUCKET}/$LOCAL_INPUT_FILE/$f 1>&2 hadoop fs -put $f ${RESULT_BUCKET}/$LOCAL_INPUT_FILE/$f 1>&2 done echo processed ${S3_INPUT_FILE} 1>&2 echo 1>&2 echo 1>&2 done exit 0
Upload it to S3 via the AWS console or s3curl
/work/platform/bin/s3curl.pl --id $USER --put phaseMapper.sh https://s3.amazonaws.com/sagetest-$USER/scripts/phaseMapper.sh
Reducer
~>cat echoReducer.sh #!/bin/sh while read LINE; do echo ${LINE} 1>&2 echo ${LINE} done exit 0
Upload it to S3 via the AWS console or s3curl
/work/platform/bin/s3curl.pl --id $USER --put echoReducer.sh https://s3.amazonaws.com/sagetest-$USER/scripts/echoReducer.sh
Input
~>cat phaseInput.txt s3://sagetest-YourUsername/input/ProSM_chrom_MT.phase.inp ... many more files, one per chromosome
Upload it to S3 via the AWS console or s3curl
/work/platform/bin/s3curl.pl --id $USER --put phaseInput.txt https://s3.amazonaws.com/sagetest-$USER/input/phaseInput.txt
Also upload all the data files referenced in phaseInput.txt to the location specified in that file.
Run the MapReduce Job
Job Configuration
~>cat phase.json [ { "Name": "MapReduce Step 1: Run Phase", "ActionOnFailure": "CANCEL_AND_WAIT", "HadoopJarStep": { "Jar": "/home/hadoop/contrib/streaming/hadoop-streaming.jar", "Args": [ "-input", "s3n://sagetest-YourUsername/input/phaseInput.txt", "-output", "s3n://sagetest-YourUsername/output/phaseTry1", "-mapper", "s3n://sagetest-YourUsername/scripts/phaseMapper.sh", "-reducer", "s3n://sagetest-YourUsername/scripts/echoReducer.sh", "-cacheFile", "s3n://sagetest-YourUsername/scripts/phase#phase", "-jobconf", "mapred.map.tasks=1", "-jobconf", "mapred.reduce.tasks=1", ] } } ]
Start the MapReduce cluster
~>/work/platform/bin/elastic-mapreduce-cli/elastic-mapreduce --credentials ~/$USER-credentials.json --create --num-instances=1 --master-instance-type=m1.large --json phase.json --name phaseTry1 Created job flow j-GA47B7VD991Q
Check on the job status
If something is misconfigured, it will fail in a minute or two. Check on the job status and make sure it is running.
~>/work/platform/bin/elastic-mapreduce-cli/elastic-mapreduce --credentials ~/$USER-credentials.json --list --jobflow j-GA47B7VD991Q j-GA47B7VD991Q RUNNING ec2-174-129-134-200.compute-1.amazonaws.com filesysTry1 RUNNING MapReduce Step 1: Run Phase
If there were any errors, make corrections and resubmit the job step
~>/work/platform/bin/elastic-mapreduce-cli/elastic-mapreduce --credentials ~/$USER-credentials.json --json phase.json --jobflow j-GA47B7VD991Q Added jobflow steps
Get your results
Look in your S3 bucket for the results.