Section

Column

width	50%

On This page

Table of Contents

Column

width	5%

Column

width	45%

On Related Pages

Page Tree

root	SCICOMP:@self@parent
startDepth	3

...

Word Count In R

The following example in R performs MapReduce on a large input corpus and counts the number of times each word occurs in the input.

Create the bootstrap script

The following script will download and install the latest version of R on each of your Elastic MapReduce hosts. (The default version of R is very old.)

Name this script bootstrapLatestR.sh and it should contain the following code:

Iframe

src	http://sagebionetworks.jira.com/source/browse/~raw,r=HEAD/PLFM/users/deflaux/scripts/EMR/rWordCountExample/bootstrapLatestR.sh
style	height:250px;width:80%;

...

IframeWhat is going on in this script?

Create the mapper script

The following script will output each word found in the input passed line by line to STDIN with its count of 1.

...

iframe

Iframe

src	http://sagebionetworks.jira.com/source/browse/~raw,r=HEAD/PLFM/users/deflaux/scripts/EMR/rWordCountExample/mapper.R
style	height:300px;width:80%

Create the reducer script

The following script will aggregate the count for each word found and output the final results.

...

Code Block

#!/usr/bin/env Rscript

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)

splitLine <- function(line) {
    val <- unlist(strsplit(line, "\t"))
    list(word = val[1], count = as.integer(val[2]))
}

env <- new.env(hash = TRUE)

con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
    line <- trimWhiteSpace(line)
    split <- splitLine(line)
    word <- split$word
    count <- split$count
    if (exists(word, envir = env, inherits = FALSE)) {
        oldcount <- get(word, envir = env)
        assign(word, oldcount + count, envir = env)
    }
    else assign(word, count, envir = env)
}
close(con)

for (w in ls(env, all = TRUE))
    cat(w, "\t", get(w, envir = env), "\n", sep = "")

Create a small input file for testing

Name this file AnInputFile.txt and it should contain the following text:

Code Block

Jack and Jill went up the hill
To fetch a pail of water.
Jack fell down and broke his crown,
And Jill came tumbling after.
Up Jack got, and home did trot,
As fast as he could caper,
To old Dame Dob, who patched his nob
With vinegar and brown paper.

Sanity check -> run it locally

First make your R scripts executable:

...

Code Block
~>cat AnInputFile.txt \| ./mapper.R \| sort \| ./reducer.R a 1 after. 1 and 4 And 1 as 1 ... who 1 With 1

Upload your scripts and input file to S3

You can use the AWS Console or s3curl to upload your files.

...

Code Block

~/WordCount>/work/platform/bin/s3curl.pl --id $USER --put mapper.R https://s3.amazonaws.com/sagebio-$USER/scripts/mapper.R
~/WordCount>/work/platform/bin/s3curl.pl --id $USER --put reducer.R https://s3.amazonaws.com/sagebio-$USER/scripts/reducer.R
~/WordCount>/work/platform/bin/s3curl.pl --id $USER --put bootstrapLatestR.sh https://s3.amazonaws.com/sagebio-$USER/scripts/bootstrapLatestR.sh
~/WordCount>/work/platform/bin/s3curl.pl --id $USER --put AnInputFile.txt https://s3.amazonaws.com/sagebio-$USER/input/AnInputFile.txt

How to run it on Elastic MapReduce

Start your map reduce cluster, when you are trying out new jobs for the first time, specifying --alive will keep your hosts alive as you work through the any bugs. But in general you do not want to run jobs with --alive because you'll need to remember to explicitly shut the hosts down when the job is done.

Code Block

~/WordCount>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --create --master-instance-type=m1.small \
--slave-instance-type=m1.small --num-instances=3 --enable-debugging --bootstrap-action s3://sagebio-$USER/scripts/bootstrapLatestR.sh --name RWordCount --alive

Created job flow j-1H8GKG5L6WAB4

~/WordCount>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --list
j-1H8GKG5L6WAB4     STARTING                                                         RWordCount
   PENDING        Setup Hadoop Debugging

Note that j-1H8GKG5L6WAB4 is $YOUR_JOB_ID
1. You can set your YOUR_JOB_ID variable with the command (but use the value output from the above command):
2. Code Block
  export YOUR_JOB_ID=j-1H8GKG5L6WAB4
Look around on the AWS Console:
See your new job listed in the Elastic MapReduce tab
See the individual hosts listed in the EC2 tab

Create your job step file

Code Block

~/WordCount>cat wordCount.json
[
  {
    "Name": "R Word Count MapReduce Step 1: small input file",
    "ActionOnFailure": "CANCEL_AND_WAIT",
    "HadoopJarStep": {
       "Jar":
           "/home/hadoop/contrib/streaming/hadoop-streaming.jar",
             "Args": [
                 "-input","s3n://sagebio-ndeflaux/input/AnInputFile.txt",
                 "-output","s3n://sagebio-ndeflaux/output/wordCountTry1",
                 "-mapper","s3n://sagebio-ndeflaux/scripts/mapper.R",
                 "-reducer","s3n://sagebio-ndeflaux/scripts/reducer.R",
             ]
         }
  },
  {
    "Name": "R Word Count MapReduce Step 2: lots of input",
    "ActionOnFailure": "CANCEL_AND_WAIT",
    "HadoopJarStep": {
       "Jar":
           "/home/hadoop/contrib/streaming/hadoop-streaming.jar",
             "Args": [
                 "-input","s3://elasticmapreduce/samples/wordcount/input",
                 "-output","s3n://sagebio-ndeflaux/output/wordCountTry2",
                 "-mapper","s3n://sagebio-ndeflaux/scripts/mapper.R",
                 "-reducer","s3n://sagebio-ndeflaux/scripts/reducer.R",
             ]
         }
  }
]

Add the steps to your jobflow

Code Block
~/WordCount>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --json wordCount.json --jobflow $YOUR_JOB_ID Added jobflow steps

Check progress by "Debugging" your job flow
When your jobs are done, look for your output in your S3 bucket
Bonus points: there is a bug in the reducer script. Can you look at the debugging output for the job and determine what to fix in the script so that the second job runs to completion?

What next?

Take a look at the Elastic MapReduce FAQ
Take a look at the other Computation Examples

Versions Compared

Old Version 25

New Version Current

Key

Word Count In R

Create the bootstrap script

Create the mapper script

Create the reducer script

Create a small input file for testing

Sanity check -> run it locally

Upload your scripts and input file to S3

How to run it on Elastic MapReduce

What next?

Page Comparison

Versions Compared

Old Version 25

New Version Current

Key

Word Count In R

Create the bootstrap script

Create the mapper script

Create the reducer script

Create a small input file for testing

Sanity check -> run it locally

Upload your scripts and input file to S3

How to run it on Elastic MapReduce

What next?