In this example, we are not using MapReduce to its full potential. We are only using it to run jobs in parallel, one job for each chromosome.
Mapper
~>cat phaseMapper.sh #!/bin/sh RESULT_BUCKET=s3://sagetest-YourUsername/results while read S3_INPUT_FILE; do echo input to process ${S3_INPUT_FILE} 1>&2 # For debugging purposes, print out the files cached for us ls -la 1>&2 # Parse the s3 file path to get the file name LOCAL_INPUT_FILE=$(echo ${S3_INPUT_FILE} | perl -pe 'if (/^((s3[n]?):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$/) {print "$6\n"};' | head -1) # Download the file from S3 echo hadoop fs -get ${S3_INPUT_FILE} ${LOCAL_INPUT_FILE} 1>&2 hadoop fs -get ${S3_INPUT_FILE} ${LOCAL_INPUT_FILE} 1>&2 # Run phase processing ./phase ${LOCAL_INPUT_FILE} ${LOCAL_INPUT_FILE}_out 100 1 100 # Upload the output files ls -la ${LOCAL_INPUT_FILE}*_out 1>&2 for f in ${LOCAL_INPUT_FILE}*_out do echo hadoop fs -put $f ${RESULT_BUCKET}/$LOCAL_INPUT_FILE/$f 1>&2 hadoop fs -put $f ${RESULT_BUCKET}/$LOCAL_INPUT_FILE/$f 1>&2 done echo processed ${S3_INPUT_FILE} 1>&2 echo 1>&2 echo 1>&2 done exit 0
Upload it to S3 via the AWS console or s3curl
/work/platform/bin/s3curl.pl --id $USER --put phaseMapper.sh https://s3.amazonaws.com/sagetest-$USER/scripts/phaseMapper.sh
Reducer
~>cat echoReducer.sh #!/bin/sh while read LINE; do echo ${LINE} 1>&2 echo ${LINE} done exit 0
Upload it to S3 via the AWS console or s3curl
/work/platform/bin/s3curl.pl --id $USER --put echoReducer.sh https://s3.amazonaws.com/sagetest-$USER/scripts/echoReducer.sh
Input
~>cat phaseInput.txt s3://sagetest-YourUsername/input/ProSM_chrom_MT.phase.inp
Upload it to S3 via the AWS console or s3curl
/work/platform/bin/s3curl.pl --id $USER --put phaseInput.txt https://s3.amazonaws.com/sagetest-$USER/input/phaseInput.txt
Also upload all the data files referenced in phaseInput.txt to the location specified in that file.
Run the MapReduce Job
Job Configuration
~>cat phase.json [ { "Name": "MapReduce Step 1: Run Phase", "ActionOnFailure": "CANCEL_AND_WAIT", "HadoopJarStep": { "Jar": "/home/hadoop/contrib/streaming/hadoop-streaming.jar", "Args": [ "-input", "s3n://sagetest-YourUsername/input/phaseInput.txt", "-output", "s3n://sagetest-YourUsername/output/phaseTryNumber1", "-mapper", "s3n://sagetest-YourUsername/scripts/phaseMapper.sh", "-reducer", "s3n://sagetest-YourUsername/scripts/echoReducer.sh", "-cacheFile", "s3n://sagetest-YourUsername/scripts/phase#phase", "-jobconf", "mapred.map.tasks=1", "-jobconf", "mapred.reduce.tasks=1", ] } } ]
Start the MapReduce cluster
~>/work/platform/bin/elastic-mapreduce-cli/elastic-mapreduce --create --alive --credentials ~/$USER-credentials.json --num-instances=1 --master-instance-type=m1.large --name phaseTry1 --json phase.json Created job flow j-GA47B7VD991Q
Check on the job status
If something is misconfigured, it will fail in a minute or two. Check on the job status and make sure it is running.
~>/work/platform/bin/elastic-mapreduce-cli/elastic-mapreduce --credentials ~/ndsage-credentials.json --jobflow j-GA47B7VD991Q --list j-GA47B7VD991Q RUNNING ec2-174-129-134-200.compute-1.amazonaws.com filesysTry1 RUNNING MapReduce Step 1: Run Phase
If there were any errors, make corrections and resubmit the job step
~>/work/platform/bin/elastic-mapreduce-cli/elastic-mapreduce --credentials ~/ndsage-credentials.json --jobflow j-GA47B7VD991Q --json phase.json Added jobflow steps
Get your results
Look in your S3 bucket for the results.