Section

Column

width	50%

On This page

Table of Contents

Column

width	5%

Column

width	45%

On Related Pages

Page Tree

root	SCICOMP:@parent
startDepth	3

Computing squares in R

...

The following script will download and install the latest version of R on each of your Elastic MapReduce hosts. (The default version of R is very old.)

Name this Download script bootstrapLatestR.sh and it should contain the following code:

Iframe

src	http://sagebionetworks.jira.com/source/browse/~raw,r=HEAD/PLFM/users/deflaux/scripts/EMR/rWordCountExample/bootstrapLatestR.sh
style	height:250px;width:80%;

...

What is going on in this script?

...

The following script will download and install several packages needed for RHadoop.

Name this Download script bootstrapRHadoop.sh and it should contain the following code:

Iframe

src	http://sagebionetworks.jira.com/source/browse/~raw,r=HEAD/PLFM/users/deflaux/scripts/EMR/rmrExample/bootstrapRHadoop.sh
style	height:250px;width:80%;

...

Iframe

Upload your scripts to S3

...

How to run it on Elastic MapReduce

Start your Hadoop cluster

Code Block


~>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --create \
--master-instance-type=m1.small --slave-instance-type=m1.small \
--num-instances=1 --enable-debugging \
--bootstrap-action s3://sagebio-$USER/scripts/bootstrapLatestR.sh \
--bootstrap-action s3://sagebio-

...

$USER/scripts/bootstrapRHadoop.sh \
--name rmrTry1 --alive

Created job flow j-79VXH9Z07ECL

SSH to the Hadoop master

Code Block


~>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --ssh --jobflow j-79VXH9Z07ECL
ssh -i /home/ndeflaux/.ssh/SageKeyPair.pem hadoop@ec2-107-20-44-27.compute-1.amazonaws.com 
Linux domU-12-31-39-04-08-C8 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:39:36 EST 2008 i686
--------------------------------------------------------------------------------

Welcome to Amazon Elastic MapReduce running Hadoop and Debian/Lenny.
 
Hadoop is installed in /home/hadoop. Log files are in /mnt/var/log/hadoop. Check
/mnt/var/log/hadoop/steps for diagnosing step failures.

The Hadoop UI can be accessed via the following commands: 

  JobTracker    lynx http://localhost:9100/
  NameNode      lynx http://localhost:9101/
 
--------------------------------------------------------------------------------
hadoop@domU-12-31-39-04-08-C8:~$

Set JAVA_HOME and start R

Code Block


hadoop@ip-10-114-89-121:/mnt/var/log/bootstrap-actions$ export JAVA_HOME=/usr/lib/jvm/java-6-sun/jre
hadoop@ip-10-114-89-121:/mnt/var/log/bootstrap-actions$ R

R version 2.14.0 (2011-10-31)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: i486-pc-linux-gnu (32-bit)

Initialize RHadoop

Code Block


> Sys.setenv(HADOOP_HOME="/home/hadoop", HADOOP_CONF="/home/hadoop/conf", JAVA_HOME="/usr/lib/jvm/java-6-sun/jre"); library(rmr); library(rhdfs);  hdfs.init();

Loading required package: RJSONIO
Loading required package: itertools
Loading required package: iterators
Loading required package: digest
Loading required package: rJava

Send your input to HDFS

Code Block
> small.ints = to.dfs(1:10);

Run a Hadoop job

You can run one or more jobs in a session.

Code Block


> out = mapreduce(input = small.ints, map = function(k,v) keyval(k, k^2))

packageJobJar: [/tmp/Rtmpbaa6dV/rhstr.map63284ca9, /tmp/Rtmpbaa6dV/rmrParentEnv, /tmp/Rtmpbaa6dV/rmrLocalEnv, /mnt/var/lib/hadoop/tmp/hadoop-unjar2859463891039338350/] [] /tmp/streamjob1543774456515588690.jar tmpDir=null
11/11/08 03:21:18 INFO mapred.JobClient: Default number of map tasks: 2
11/11/08 03:21:18 INFO mapred.JobClient: Default number of reduce tasks: 1
11/11/08 03:21:19 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
11/11/08 03:21:19 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 2334756312e0012cac793f12f4151bdaa1b4b1bb]
11/11/08 03:21:19 INFO mapred.FileInputFormat: Total input paths to process : 1
11/11/08 03:21:20 INFO streaming.StreamJob: getLocalDirs(): [/mnt/var/lib/hadoop/mapred]
11/11/08 03:21:20 INFO streaming.StreamJob: Running job: job_201111080311_0001
11/11/08 03:21:20 INFO streaming.StreamJob: To kill this job, run:
11/11/08 03:21:20 INFO streaming.StreamJob: /home/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=ip-10-114-89-121.ec2.internal:9001 -kill job_201111080311_0001
11/11/08 03:21:20 INFO streaming.StreamJob: Tracking URL: http://ip-10-114-89-121.ec2.internal:9100/jobdetails.jsp?jobid=job_201111080311_0001
11/11/08 03:21:21 INFO streaming.StreamJob:  map 0%  reduce 0%
11/11/08 03:21:35 INFO streaming.StreamJob:  map 50%  reduce 0%
11/11/08 03:21:38 INFO streaming.StreamJob:  map 100%  reduce 0%
11/11/08 03:21:50 INFO streaming.StreamJob:  map 100%  reduce 100%
11/11/08 03:21:53 INFO streaming.StreamJob: Job complete: job_201111080311_0001
11/11/08 03:21:53 INFO streaming.StreamJob: Output: /tmp/Rtmpbaa6dV/file6caa3721

Get your output from HDFS

Code Block


> from.dfs(out)

[[1]]
[[1]]$key
[1] 1

[[1]]$val
[1] 1

attr(,"keyval")
[1] TRUE

[[2]]
[[2]]$key
[1] 2

[[2]]$val
[1] 4

attr(,"keyval")
[1] TRUE
[[3]]
[[3]]$key
[1] 3

[[3]]$val
[1] 9

attr(,"keyval")
[1] TRUE

[[4]]
[[4]]$key
[1] 4

[[4]]$val
[1] 16

attr(,"keyval")
[1] TRUE

[[5]]
[[5]]$key
[1] 5

[[5]]$val
[1] 25

attr(,"keyval")
[1] TRUE

[[6]]
[[6]]$key
[1] 6

[[6]]$val
[1] 36

attr(,"keyval")
[1] TRUE
[[7]]
[[7]]$key
[1] 7

[[7]]$val
[1] 49

attr(,"keyval")
[1] TRUE

[[8]]
[[8]]$key
[1] 8

[[8]]$val
[1] 64

attr(,"keyval")
[1] TRUE

[[9]]
[[9]]$key
[1] 9

[[9]]$val
[1] 81

attr(,"keyval")
[1] TRUE

[[10]]
[[10]]$key
[1] 10

[[10]]$val
[1] 100

attr(,"keyval")
[1] TRUE

Stop your Hadoop cluster

Quit r, exit ssh, and stop the cluster:

Code Block

> q()Save workspace image? [y/n/c]: n
hadoop@ip-10-114-89-121:/mnt/var/log/bootstrap-actions$ exit
logout
Connection to ec2-107-20-108-57.compute-1.amazonaws.com closed.
~>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --terminate --jobflow j-79VXH9Z07ECL
Terminated job flow j-79VXH9Z07ECL

What next?

Try the more complicated examples such as Logistic Regression and K-means in https://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial.
Take a look at the Elastic MapReduce FAQ for how to SCP files to the Hadoop master host.
Take a look at the other Computation Examples

Versions Compared

Old Version 3

New Version Current

Key

Computing squares in R

Upload your scripts to S3

How to run it on Elastic MapReduce

Start your Hadoop cluster

SSH to the Hadoop master

Set JAVA_HOME and start R

Initialize RHadoop

Send your input to HDFS

Run a Hadoop job

Get your output from HDFS

Stop your Hadoop cluster

What next?

Page Comparison

Versions Compared

Old Version 3

New Version Current

Key

Computing squares in R

Upload your scripts to S3

How to run it on Elastic MapReduce

Start your Hadoop cluster

SSH to the Hadoop master

Set JAVA_HOME and start R

Initialize RHadoop

Send your input to HDFS

Run a Hadoop job

Get your output from HDFS

Stop your Hadoop cluster

What next?