Computing squares in R
The following example uses the RHadoop packages to perform MapReduce on the sequence 1:10 and computes the square of each. Its a pretty tiny example from https://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial. Note that there are more complicated examples in that tutorial such as Logistic Regression and K-means.
Create the bootstrap scripts
Bootstrap the latest version of R
The following script will download and install the latest version of R on each of your Elastic MapReduce hosts. (The default version of R is very old.)
Download script bootstrapLatestR.sh and it should contain the following code:
What is going on in this script?
http://www.r-bloggers.com/bootstrapping-the-latest-r-into-amazon-elastic-map-reduce/
http://stackoverflow.com/questions/4473123/script-to-load-latest-r-onto-a-fresh-debian-machine
Bootstrap RHadoop
The following script will download and install several packages needed for RHadoop.
Download script bootstrapRHadoop.sh and it should contain the following code:
Upload your scripts to S3
You can use the AWS Console or s3curl to upload your files.
s3curl example:
~/RHadoopExample>/work/platform/bin/s3curl.pl --id $USER --put bootstrapLatestR.sh https://s3.amazonaws.com/sagebio-$USER/scripts/bootstrapLatestR.sh
~/RHadoopExample>/work/platform/bin/s3curl.pl --id $USER --put bootstrapRHadoop.sh https://s3.amazonaws.com/sagebio-$USER/scripts/bootstrapRHadoop.sh
How to run it on Elastic MapReduce
Start your Hadoop cluster
~>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --create \
--master-instance-type=m1.small --slave-instance-type=m1.small \
--num-instances=1 --enable-debugging \
--bootstrap-action s3://sagebio-$USER/scripts/bootstrapLatestR.sh \
--bootstrap-action s3://sagebio-$USER/scripts/bootstrapRHadoop.sh \
--name rmrTry1 --alive
Created job flow j-79VXH9Z07ECL
SSH to the Hadoop master
~>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --ssh --jobflow j-79VXH9Z07ECL
ssh -i /home/ndeflaux/.ssh/SageKeyPair.pem hadoop@ec2-107-20-44-27.compute-1.amazonaws.com
Linux domU-12-31-39-04-08-C8 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:39:36 EST 2008 i686
--------------------------------------------------------------------------------
Welcome to Amazon Elastic MapReduce running Hadoop and Debian/Lenny.
Hadoop is installed in /home/hadoop. Log files are in /mnt/var/log/hadoop. Check
/mnt/var/log/hadoop/steps for diagnosing step failures.
The Hadoop UI can be accessed via the following commands:
JobTracker lynx http://localhost:9100/
NameNode lynx http://localhost:9101/
--------------------------------------------------------------------------------
hadoop@domU-12-31-39-04-08-C8:~$
Set JAVA_HOME and start R
hadoop@ip-10-114-89-121:/mnt/var/log/bootstrap-actions$ export JAVA_HOME=/usr/lib/jvm/java-6-sun/jre
hadoop@ip-10-114-89-121:/mnt/var/log/bootstrap-actions$ R
R version 2.14.0 (2011-10-31)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: i486-pc-linux-gnu (32-bit)
Initialize RHadoop
> Sys.setenv(HADOOP_HOME="/home/hadoop", HADOOP_CONF="/home/hadoop/conf", JAVA_HOME="/usr/lib/jvm/java-6-sun/jre"); library(rmr); library(rhdfs); hdfs.init();
Loading required package: RJSONIO
Loading required package: itertools
Loading required package: iterators
Loading required package: digest
Loading required package: rJava
Send your input to HDFS
> small.ints = to.dfs(1:10);
Run a Hadoop job
You can run one or more jobs in a session.
> out = mapreduce(input = small.ints, map = function(k,v) keyval(k, k^2))
packageJobJar: [/tmp/Rtmpbaa6dV/rhstr.map63284ca9, /tmp/Rtmpbaa6dV/rmrParentEnv, /tmp/Rtmpbaa6dV/rmrLocalEnv, /mnt/var/lib/hadoop/tmp/hadoop-unjar2859463891039338350/] [] /tmp/streamjob1543774456515588690.jar tmpDir=null
11/11/08 03:21:18 INFO mapred.JobClient: Default number of map tasks: 2
11/11/08 03:21:18 INFO mapred.JobClient: Default number of reduce tasks: 1
11/11/08 03:21:19 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
11/11/08 03:21:19 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 2334756312e0012cac793f12f4151bdaa1b4b1bb]
11/11/08 03:21:19 INFO mapred.FileInputFormat: Total input paths to process : 1
11/11/08 03:21:20 INFO streaming.StreamJob: getLocalDirs(): [/mnt/var/lib/hadoop/mapred]
11/11/08 03:21:20 INFO streaming.StreamJob: Running job: job_201111080311_0001
11/11/08 03:21:20 INFO streaming.StreamJob: To kill this job, run:
11/11/08 03:21:20 INFO streaming.StreamJob: /home/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=ip-10-114-89-121.ec2.internal:9001 -kill job_201111080311_0001
11/11/08 03:21:20 INFO streaming.StreamJob: Tracking URL: http://ip-10-114-89-121.ec2.internal:9100/jobdetails.jsp?jobid=job_201111080311_0001
11/11/08 03:21:21 INFO streaming.StreamJob: map 0% reduce 0%
11/11/08 03:21:35 INFO streaming.StreamJob: map 50% reduce 0%
11/11/08 03:21:38 INFO streaming.StreamJob: map 100% reduce 0%
11/11/08 03:21:50 INFO streaming.StreamJob: map 100% reduce 100%
11/11/08 03:21:53 INFO streaming.StreamJob: Job complete: job_201111080311_0001
11/11/08 03:21:53 INFO streaming.StreamJob: Output: /tmp/Rtmpbaa6dV/file6caa3721Get your output from HDFS
> from.dfs(out)
[[1]]
[[1]]$key
[1] 1
[[1]]$val
[1] 1
attr(,"keyval")
[1] TRUE
[[2]]
[[2]]$key
[1] 2
[[2]]$val
[1] 4
attr(,"keyval")
[1] TRUE
[[3]]
[[3]]$key
[1] 3
[[3]]$val
[1] 9
attr(,"keyval")
[1] TRUE
[[4]]
[[4]]$key
[1] 4
[[4]]$val
[1] 16
attr(,"keyval")
[1] TRUE
[[5]]
[[5]]$key
[1] 5
[[5]]$val
[1] 25
attr(,"keyval")
[1] TRUE
[[6]]
[[6]]$key
[1] 6
[[6]]$val
[1] 36
attr(,"keyval")
[1] TRUE
[[7]]
[[7]]$key
[1] 7
[[7]]$val
[1] 49
attr(,"keyval")
[1] TRUE
[[8]]
[[8]]$key
[1] 8
[[8]]$val
[1] 64
attr(,"keyval")
[1] TRUE
[[9]]
[[9]]$key
[1] 9
[[9]]$val
[1] 81
attr(,"keyval")
[1] TRUE
[[10]]
[[10]]$key
[1] 10
[[10]]$val
[1] 100
attr(,"keyval")
[1] TRUE
Stop your Hadoop cluster
Quit r, exit ssh, and stop the cluster:
> q()Save workspace image? [y/n/c]: n
hadoop@ip-10-114-89-121:/mnt/var/log/bootstrap-actions$ exit
logout
Connection to ec2-107-20-108-57.compute-1.amazonaws.com closed.
~>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --terminate --jobflow j-79VXH9Z07ECL
Terminated job flow j-79VXH9Z07ECL
What next?
Try the more complicated examples such as Logistic Regression and K-means in https://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial.
Take a look at the Elastic MapReduce FAQ for how to SCP files to the Hadoop master host.
Take a look at the other Computation Examples