Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Section
Column
width50%

On This page

Table of Contents
Column
width5%

Column
width45%

On Related Pages

Page Tree
rootSCICOMP:@self
startDepth3

Note that Brian upgraded the elastic map reduce client on the shared servers but the wiki has not been updated yet, you need to "load" the module and from then on out its in your PATH

...

has moved. See the updated instructions on Computation Examples to get it into your PATH.

Word Count In R

The following example in R performs MapReduce on a large input corpus and counts the number of times each word occurs in the input.

...

  1. Start your map reduce cluster, when you are trying out new jobs for the first time, specifying --alive will keep your hosts alive as you work through the any bugs. But in general you do not want to run jobs with --alive because you'll need to remember to explicitly shut the hosts down when the job is done.
    Code Block
    ~/WordCount>/work/platform/bin/elasticWordCount>elastic-mapreduce-cli/elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --create --master-instance-type=m1.small \
    --slave-instance-type=m1.small --num-instances=3 --enable-debugging --bootstrap-action s3://sagebio-$USER/scripts/bootstrapLatestR.sh --name RWordCount --alive
    
    Created job flow j-1H8GKG5L6WAB4
    
    ~/WordCount>/work/platform/bin/elasticWordCount>elastic-mapreduce-cli/elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --list
    j-1H8GKG5L6WAB4     STARTING                                                         RWordCount
       PENDING        Setup Hadoop Debugging
    
  2. Note that j-1H8GKG5L6WAB4 is $YOUR_JOB_ID
    1. You can set your YOUR_JOB_ID variable with the command (but use the value output from the above command):
    2. Code Block
      export YOUR_JOB_ID=j-1H8GKG5L6WAB4
  3. Look around on the AWS Console:
  4. See your new job listed in the Elastic MapReduce tab
  5. See the individual hosts listed in the EC2 tab
  6. Create your job step file
    Code Block
    ~/WordCount>cat wordCount.json
    [
      {
        "Name": "R Word Count MapReduce Step 1: small input file",
        "ActionOnFailure": "CANCEL_AND_WAIT",
        "HadoopJarStep": {
           "Jar":
               "/home/hadoop/contrib/streaming/hadoop-streaming.jar",
                 "Args": [
                     "-input","s3n://sagebio-ndeflaux/input/AnInputFile.txt",
                     "-output","s3n://sagebio-ndeflaux/output/wordCountTry1",
                     "-mapper","s3n://sagebio-ndeflaux/scripts/mapper.R",
                     "-reducer","s3n://sagebio-ndeflaux/scripts/reducer.R",
                 ]
             }
      },
      {
        "Name": "R Word Count MapReduce Step 2: lots of input",
        "ActionOnFailure": "CANCEL_AND_WAIT",
        "HadoopJarStep": {
           "Jar":
               "/home/hadoop/contrib/streaming/hadoop-streaming.jar",
                 "Args": [
                     "-input","s3://elasticmapreduce/samples/wordcount/input",
                     "-output","s3n://sagebio-ndeflaux/output/wordCountTry2",
                     "-mapper","s3n://sagebio-ndeflaux/scripts/mapper.R",
                     "-reducer","s3n://sagebio-ndeflaux/scripts/reducer.R",
                 ]
             }
      }
    ]
    
  7. Add the steps to your jobflow
    Code Block
    ~/WordCount>/work/platform/bin/elastic-mapreduce-cli/elasticWordCount>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --json wordCount.json --jobflow $YOUR_JOB_ID
    Added jobflow steps
    
  8. Check progress by "Debugging" your job flow
  9. When your jobs are done, look for your output in your S3 bucket
  10. Bonus points: there is a bug in the reducer script. Can you look at the debugging output for the job and determine what to fix in the script so that the second job runs to completion?

...