On This page
On Related Pages
Word Count In R
The following example in R performs MapReduce on a large input corpus and counts the number of times each word occurs in the input.
Create the bootstrap script
The following script will download and install the latest version of R on each of your Elastic MapReduce hosts. (The default version of R is very old.)
Name this script bootstrapLatestR.sh
and it should contain the following code:
#!/bin/bash ########################################### # How to get the latest version of R onto Elastic MapReduce # http://www.r-bloggers.com/bootstrapping-the-latest-r-into-amazon-elastic-map-reduce/ ########################################### # Perform the debian R upgrade echo "deb http://cran.fhcrc.org/bin/linux/debian lenny-cran/" | sudo tee -a /etc/apt/sources.list sudo apt-get update sudo apt-get -t lenny-cran install --yes --force-yes r-base r-base-dev
Create the mapper script
The following script will output each word found in the input passed line by line to STDIN with its count of 1.
Name this script mapper.R
and it should contain the following code:
#!/usr/bin/env Rscript trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+")) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) ## can also be done as cat(paste(words, "\t1\n", sep=""), sep="") for (w in words) cat(w, "\t1\n", sep="") } close(con)
Create the reducer script
The following script will aggregate the count for each word found and output the final results.
Name this script reducer.R
and it should contain the following code:
#!/usr/bin/env Rscript trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitLine <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = "")
Create a small input file for testing
Name this file AnInputFile.txt
and it should contain the following text:
Jack and Jill went up the hill To fetch a pail of water. Jack fell down and broke his crown, And Jill came tumbling after. Up Jack got, and home did trot, As fast as he could caper, To old Dame Dob, who patched his nob With vinegar and brown paper.
Sanity check -> run it locally
The command line to run it
~>cat AnInputFile.txt | ./mapper.R | sort | ./reducer.R a 1 after. 1 and 4 And 1 as 1 ... who 1 With 1
Upload your scripts and input file to S3
You can use the AWS Console or s3curl to upload your files.
s3curl example:
~/WordCount>/work/platform/bin/s3curl.pl --id $USER --put mapper.R https://s3.amazonaws.com/sagebio-$USER/scripts/mapper.R ~/WordCount>/work/platform/bin/s3curl.pl --id $USER --put reducer.R https://s3.amazonaws.com/sagebio-$USER/scripts/reducer.R ~/WordCount>/work/platform/bin/s3curl.pl --id $USER --put bootstrapLatestR.sh https://s3.amazonaws.com/sagebio-$USER/scripts/bootstrapLatestR.sh ~/WordCount>/work/platform/bin/s3curl.pl --id $USER --put AnInputFile.txt https://s3.amazonaws.com/sagebio-$USER/input/AnInputFile.txt
How to run it on Elastic MapReduce
./elastic-mapreduce --create --stream \ --bootstrap-action s3://YOUR_BUCKET/scripts/bootstrapLatestR.sh \ --mapper s3://YOUR_BUCKET/scripts/mapper.R \ --reducer s3://YOUR_BUCKET/scripts/reducer.R \ --input s3://YOUR_BUCKET/input/AnInputFile.txt \ --output s3://YOUR_BUCKET/output/try1 \ --name try1