Computing squares in R

The following example uses the RHadoop packages to perform MapReduce on the sequence 1:10 and computes the square of each. Its a pretty tiny example from https://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial. Note that there are more complicated examples in that tutorial such as Logistic Regression and K-means.

Create the bootstrap scripts

Bootstrap the latest version of R

The following script will download and install the latest version of R on each of your Elastic MapReduce hosts. (The default version of R is very old.)

Name this script bootstrapLatestR.sh and it should contain the following code:
hello world

What is going on in this script?

Bootstrap RHadoop

The following script will download and install several packages needed for RHadoop

Name this script bootstrapRHadoop.sh and it should contain the following code:
hello world

Upload your scripts to S3

You can use the AWS Console or s3curl to upload your files.

s3curl example:

~/RHadoopExample>/work/platform/bin/s3curl.pl --id $USER --put bootstrapLatestR.sh https://s3.amazonaws.com/sagebio-$USER/scripts/bootstrapLatestR.sh
~/RHadoopExample>/work/platform/bin/s3curl.pl --id $USER --put bootstrapRHadoop.sh https://s3.amazonaws.com/sagebio-$USER/scripts/bootstrapRHadoop.sh

How to run it on Elastic MapReduce

Start your Hadoop cluster

~>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --create \
--master-instance-type=m1.small --slave-instance-type=m1.small \
--num-instances=1 --enable-debugging \
--bootstrap-action s3://sagebio-$USER/scripts/bootstrapLatestR.sh \
--bootstrap-action s3://sagebio-$USER/scripts/bootstrapRHadoop.sh \
--name rmrTry1 --alive

Created job flow j-79VXH9Z07ECL

SSH to the Hadoop master

~>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --ssh --jobflow j-79VXH9Z07ECL

Set JAVA_HOME and start R

hadoop@ip-10-114-89-121:/mnt/var/log/bootstrap-actions$ export JAVA_HOME=/usr/lib/jvm/java-6-sun/jre
hadoop@ip-10-114-89-121:/mnt/var/log/bootstrap-actions$ R

R version 2.14.0 (2011-10-31)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: i486-pc-linux-gnu (32-bit)

Initialize RHadoop

> Sys.setenv(HADOOP_HOME="/home/hadoop", HADOOP_CONF="/home/hadoop/conf", JAVA_HOME="/usr/lib/jvm/java-6-sun/jre"); library(rmr); library(rhdfs);  hdfs.init();

Loading required package: RJSONIO
Loading required package: itertools
Loading required package: iterators
Loading required package: digest
Loading required package: rJava

Send your input to HDFS

> small.ints = to.dfs(1:10);

Run a Hadoop job

You can run one or more jobs in a session.

> out = mapreduce(input = small.ints, map = function(k,v) keyval(k, k^2))

packageJobJar: [/tmp/Rtmpbaa6dV/rhstr.map63284ca9, /tmp/Rtmpbaa6dV/rmrParentEnv, /tmp/Rtmpbaa6dV/rmrLocalEnv, /mnt/var/lib/hadoop/tmp/hadoop-unjar2859463891039338350/] [] /tmp/streamjob1543774456515588690.jar tmpDir=null
11/11/08 03:21:18 INFO mapred.JobClient: Default number of map tasks: 2
11/11/08 03:21:18 INFO mapred.JobClient: Default number of reduce tasks: 1
11/11/08 03:21:19 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
11/11/08 03:21:19 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 2334756312e0012cac793f12f4151bdaa1b4b1bb]
11/11/08 03:21:19 INFO mapred.FileInputFormat: Total input paths to process : 1
11/11/08 03:21:20 INFO streaming.StreamJob: getLocalDirs(): [/mnt/var/lib/hadoop/mapred]
11/11/08 03:21:20 INFO streaming.StreamJob: Running job: job_201111080311_0001
11/11/08 03:21:20 INFO streaming.StreamJob: To kill this job, run:
11/11/08 03:21:20 INFO streaming.StreamJob: /home/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=ip-10-114-89-121.ec2.internal:9001 -kill job_201111080311_0001
11/11/08 03:21:20 INFO streaming.StreamJob: Tracking URL: http://ip-10-114-89-121.ec2.internal:9100/jobdetails.jsp?jobid=job_201111080311_0001
11/11/08 03:21:21 INFO streaming.StreamJob:  map 0%  reduce 0%
11/11/08 03:21:35 INFO streaming.StreamJob:  map 50%  reduce 0%
11/11/08 03:21:38 INFO streaming.StreamJob:  map 100%  reduce 0%
11/11/08 03:21:50 INFO streaming.StreamJob:  map 100%  reduce 100%
11/11/08 03:21:53 INFO streaming.StreamJob: Job complete: job_201111080311_0001
11/11/08 03:21:53 INFO streaming.StreamJob: Output: /tmp/Rtmpbaa6dV/file6caa3721

Get your output from HDFS

> from.dfs(out)

[[1]]
[[1]]$key
[1] 1

[[1]]$val
[1] 1

attr(,"keyval")
[1] TRUE

[[2]]
[[2]]$key
[1] 2

[[2]]$val
[1] 4

attr(,"keyval")
[1] TRUE
[[3]]
[[3]]$key
[1] 3

[[3]]$val
[1] 9

attr(,"keyval")
[1] TRUE

[[4]]
[[4]]$key
[1] 4

[[4]]$val
[1] 16

attr(,"keyval")
[1] TRUE

[[5]]
[[5]]$key
[1] 5

[[5]]$val
[1] 25

attr(,"keyval")
[1] TRUE

[[6]]
[[6]]$key
[1] 6

[[6]]$val
[1] 36

attr(,"keyval")
[1] TRUE
[[7]]
[[7]]$key
[1] 7

[[7]]$val
[1] 49

attr(,"keyval")
[1] TRUE

[[8]]
[[8]]$key
[1] 8

[[8]]$val
[1] 64

attr(,"keyval")
[1] TRUE

[[9]]
[[9]]$key
[1] 9

[[9]]$val
[1] 81

attr(,"keyval")
[1] TRUE

[[10]]
[[10]]$key
[1] 10

[[10]]$val
[1] 100

attr(,"keyval")
[1] TRUE

Stop your Hadoop cluster

Quit r, exit ssh, and stop the cluster:

> q()Save workspace image? [y/n/c]: n
hadoop@ip-10-114-89-121:/mnt/var/log/bootstrap-actions$ exit
logout
Connection to ec2-107-20-108-57.compute-1.amazonaws.com closed.
~>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --terminate --jobflow j-79VXH9Z07ECL
Terminated job flow j-79VXH9Z07ECL

What next?

Take a look at the Elastic MapReduce FAQ for how to SCP files to the Hadoop master host.
Take a look at the other Computation Examples

Getting Started with RHadoop on Elastic Map Reduce