Section

Column

width	50%

On This page

Table of Contents

Column

width	5%

Column

width	45%

On Related Pages

Page Tree

root	SCICOMP:@self
startDepth	3

...

Word Count In R

The following example in R performs MapReduce on a large input corpus and counts the number of times each word occurs in the input.

Create the bootstrap script

The following script will download and install the latest version of R on each of your Elastic MapReduce hosts. (The default version of R is very old.)

...

Code Block

#!/bin/bash

###########################################
# How to get the latest version of R onto Elastic MapReduce
# http://www.r-bloggers.com/bootstrapping-the-latest-r-into-amazon-elastic-map-reduce/
###########################################

# Perform the debian R upgrade
echo "deb http://cran.fhcrc.org/bin/linux/debian lenny-cran/" | sudo tee -a /etc/apt/sources.list
sudo apt-get update
sudo apt-get -t lenny-cran install --yes --force-yes r-base r-base-dev

Create the mapper script

The following script will output each word found in the input passed line by line to STDIN with its count of 1.

...

Code Block

#!/usr/bin/env Rscript

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+"))
    
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
    line <- trimWhiteSpace(line)
    words <- splitIntoWords(line)
    ## can also be done as cat(paste(words, "\t1\n", sep=""), sep="")
    for (w in words)
        cat(w, "\t1\n", sep="")
}

close(con)

Create the reducer script

The following script will aggregate the count for each word found and output the final results.

...

Code Block

#!/usr/bin/env Rscript

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)

splitLine <- function(line) {
    val <- unlist(strsplit(line, "\t"))
    list(word = val[1], count = as.integer(val[2]))
}
    
env <- new.env(hash = TRUE)

con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
    line <- trimWhiteSpace(line)
    split <- splitLine(line)
    word <- split$word
    count <- split$count
    if (exists(word, envir = env, inherits = FALSE)) {
        oldcount <- get(word, envir = env)
        assign(word, oldcount + count, envir = env)
    }
    else assign(word, count, envir = env)
}
close(con)

for (w in ls(env, all = TRUE))
    cat(w, "\t", get(w, envir = env), "\n", sep = "")

Create a small input file for testing

Name this file AnInputFile.txt and it should contain the following text:

Code Block

Jack and Jill went up the hill
To fetch a pail of water.
Jack fell down and broke his crown,
And Jill came tumbling after.
Up Jack got, and home did trot,
As fast as he could caper,
To old Dame Dob, who patched his nob
With vinegar and brown paper.

Sanity check -> run it locally

The command line to run it

Code Block
~>cat AnInputFile.txt \| ./mapper.R \| sort \| ./reducer.R a 1 after. 1 and 4 And 1 as 1 As... who 1 With broke 1 brown 1 ...

Upload your scripts and input file to S3

You can use the AWS Console or s3curl to upload your files.

s3curl example:

Code Block


~/WordCount>/work/platform/bin/s3curl.pl --id $USER --put mapper.R https://s3.amazonaws.com/sagebio-$USER/scripts/mapper.R
~/WordCount>/work/platform/bin/s3curl.pl --id $USER --put reducer.R https://s3.amazonaws.com/sagebio-$USER/scripts/reducer.R
~/WordCount>/work/platform/bin/s3curl.pl --id $USER --put bootstrapLatestR.sh https://s3.amazonaws.com/sagebio-$USER/scripts/bootstrapLatestR.sh
~/WordCount>/work/platform/bin/s3curl.pl --id $USER --put AnInputFile.txt https://s3.amazonaws.com/sagebio-$USER/input/AnInputFile.txt

Image Added

How to run it on Elastic MapReduce

Code Block

./elastic-mapreduce --create --stream \
--bootstrap-action s3://YOUR_BUCKET/scripts/bootstrapLatestR.sh \
--mapper s3://YOUR_BUCKET/scripts/mapper.R \
--reducer s3://YOUR_BUCKET/scripts/reducer.R \
--input s3://YOUR_BUCKET/input/AnInputFile.txt \
--output s3://YOUR_BUCKET/output/try1 \
--name try1

Version	Old Version 1	New Version 2
Changes made by	Nicole Deflaux	Nicole Deflaux
Saved on	Jun 30, 2011	Jun 30, 2011

Versions Compared

Key

Word Count In R

Create the bootstrap script

Create the mapper script

Create the reducer script

Create a small input file for testing

Sanity check -> run it locally

Upload your scripts and input file to S3

How to run it on Elastic MapReduce

Content Comparison

Versions Compared

Key

Word Count In R

Create the bootstrap script

Create the mapper script

Create the reducer script

Create a small input file for testing

Sanity check -> run it locally

Upload your scripts and input file to S3

How to run it on Elastic MapReduce