Page Comparison

...

What is going on in this script?

[http://www.r-bloggers.com/bootstrapping-the-latest-r-into-amazon-elastic-map-\
]reduce/
http://stackoverflow.com/questions/4473123/script-to-load-latest-r-onto-a-fresh-debian-machine
https://forums.aws.amazon.com/thread.jspa?messageID=253397

...

Code Block

#!/usr/bin/env Rscript

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+"))
   

con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
    line <- trimWhiteSpace(line)
    words <- splitIntoWords(line)
    ## can also be done as cat(paste(words, "\t1\n", sep=""), sep="")
    for (w in words)
        cat(w, "\t1\n", sep="")
}

close(con)

...

Code Block

#!/usr/bin/env Rscript

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)

splitLine <- function(line) {
    val <- unlist(strsplit(line, "\t"))
    list(word = val[1], count = as.integer(val[2]))
}
   

env <- new.env(hash = TRUE)

con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
    line <- trimWhiteSpace(line)
    split <- splitLine(line)
    word <- split$word
    count <- split$count
    if (exists(word, envir = env, inherits = FALSE)) {
        oldcount <- get(word, envir = env)
        assign(word, oldcount + count, envir = env)
    }
    else assign(word, count, envir = env)
}
close(con)

for (w in ls(env, all = TRUE))
    cat(w, "\t", get(w, envir = env), "\n", sep = "")

...

Code Block
~>cat AnInputFile.txt \| ./mapper.R \| sort \| ./reducer.R a 1 after. 1 and 4 And 1 as 1 ... who 1 With 1

...

Start your map reduce cluster, when you are trying out new jobs for the first time, specifying --alive will keep your hosts alive as you work through the any bugs. But in general you do not want to run jobs with --alive because you'll need to remember to explicitly shut the hosts down when the job is done.

Code Block

~/WordCount>/work/platform/bin/elastic-mapreduce-cli/elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --create --master-instance-type=m1.small \
--slave-instance-type=m1.small --num-instances=3 --enable-debugging --bootstrap-action s3://sagebio-$USER/scripts/bootstrapLatestR.sh --name RWordCount --alive

Created job flow j-1H8GKG5L6WAB4

~/WordCount>/work/platform/bin/elastic-mapreduce-cli/elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --list
j-1H8GKG5L6WAB4     STARTING                                                         RWordCount
   PENDING        Setup Hadoop Debugging

Look around on the AWS Console:
- See your new job listed in the Elastic MapReduce tab
- See the individual hosts listed in the EC2 tab

Create your job step file

Code Block

~/WordCount>cat wordCount.json
[
  {
    "Name": "R Word Count MapReduce Step 1: small input file",
    "ActionOnFailure": "CANCEL_AND_WAIT",
    "HadoopJarStep": {
       "Jar":
           "/home/hadoop/contrib/streaming/hadoop-streaming.jar",
             "Args": [
                 "-input","s3n://sagebio-ndeflaux/input/AnInputFile.txt",
                 "-output","s3n://sagebio-ndeflaux/output/wordCountTry1",
                 "-mapper","s3n://sagebio-ndeflaux/scripts/mapper.R",
                 "-reducer","s3n://sagebio-ndeflaux/scripts/reducer.R",
             ]
         }
  },
  {
    "Name": "R Word Count MapReduce Step 2: lots of input",
    "ActionOnFailure": "CANCEL_AND_WAIT",
    "HadoopJarStep": {
       "Jar":
           "/home/hadoop/contrib/streaming/hadoop-streaming.jar",
             "Args": [
                 "-input","s3://elasticmapreduce/samples/wordcount/input",
                 "-output","s3n://sagebio-ndeflaux/output/wordCountTry2",
                 "-mapper","s3n://sagebio-ndeflaux/scripts/mapper.R",
                 "-reducer","s3n://sagebio-ndeflaux/scripts/reducer.R",
             ]
         }
  }
]

Add the steps to your jobflow

Code Block
~/WordCount>/work/platform/bin/elastic-mapreduce-cli/elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --json wordCount.json --jobflow j-1H8GKG5L6WAB4 Added jobflow steps

Check progress by "Debugging" your job flow
When your jobs are done, look for your output in your S3 bucket
Bonus points: there is a bug in the reducer script. Can you look at the debugging output for the job and determine what to fix in the script so that the second job runs to completion?

...

Versions Compared

Old Version 12

New Version 13

Key