Page Comparison

...

How to run it on Elastic MapReduce

Start your Hadoop cluster

Code Block


~>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --create --master-instance-type=m1.small --slave-instance-type=m1.small --num-instances=1 --enable-debugging --bootstrap-action s3://sagebio-$USER/scripts/bootstrapLatestR.sh --bootstrap-action s3://sagebio-ndeflaux$USER/scripts/bootstrapRHadoop.sh --name rmrTry1 --alive

Created job flow j-79VXH9Z07ECL

SSH to the Hadoop master

Code Block
~>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --ssh --jobflow j-79VXH9Z07ECL

Set JAVA_HOME and start R

Code Block

hadoop@ip-10-114-89-121:/mnt/var/log/bootstrap-actions$ export JAVA_HOME=/usr/lib/jvm/java-6-sun/jre
hadoop@ip-10-114-89-121:/mnt/var/log/bootstrap-actions$ R

R version 2.14.0 (2011-10-31)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: i486-pc-linux-gnu (32-bit)

Initialize RHadoop

Code Block


> Sys.setenv(HADOOP_HOME="/home/hadoop", HADOOP_CONF="/home/hadoop/conf", JAVA_HOME="/usr/lib/jvm/java-6-sun/jre"); library(rmr); library(rhdfs);  hdfs.init();
Loading required package: RJSONIO
Loading required package: itertools
Loading required package: iterators
Loading required package: digest
Loading required package: rJava

Send your input to HDFS

Code Block
> small.ints = to.dfs(1:10);

Run a Hadoop job

You can run one or more jobs in a session.

Code Block


> out = mapreduce(input = small.ints, map = function(k,v) keyval(k, k^2))

packageJobJar: [/tmp/Rtmpbaa6dV/rhstr.map63284ca9, /tmp/Rtmpbaa6dV/rmrParentEnv, /tmp/Rtmpbaa6dV/rmrLocalEnv, /mnt/var/lib/hadoop/tmp/hadoop-unjar2859463891039338350/] [] /tmp/streamjob1543774456515588690.jar tmpDir=null
11/11/08 03:21:18 INFO mapred.JobClient: Default number of map tasks: 2
11/11/08 03:21:18 INFO mapred.JobClient: Default number of reduce tasks: 1
11/11/08 03:21:19 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
11/11/08 03:21:19 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 2334756312e0012cac793f12f4151bdaa1b4b1bb]
11/11/08 03:21:19 INFO mapred.FileInputFormat: Total input paths to process : 1
11/11/08 03:21:20 INFO streaming.StreamJob: getLocalDirs(): [/mnt/var/lib/hadoop/mapred]
11/11/08 03:21:20 INFO streaming.StreamJob: Running job: job_201111080311_0001
11/11/08 03:21:20 INFO streaming.StreamJob: To kill this job, run:
11/11/08 03:21:20 INFO streaming.StreamJob: /home/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=ip-10-114-89-121.ec2.internal:9001 -kill job_201111080311_0001
11/11/08 03:21:20 INFO streaming.StreamJob: Tracking URL: http://ip-10-114-89-121.ec2.internal:9100/jobdetails.jsp?jobid=job_201111080311_0001
11/11/08 03:21:21 INFO streaming.StreamJob:  map 0%  reduce 0%
11/11/08 03:21:35 INFO streaming.StreamJob:  map 50%  reduce 0%
11/11/08 03:21:38 INFO streaming.StreamJob:  map 100%  reduce 0%
11/11/08 03:21:50 INFO streaming.StreamJob:  map 100%  reduce 100%
11/11/08 03:21:53 INFO streaming.StreamJob: Job complete: job_201111080311_0001
11/11/08 03:21:53 INFO streaming.StreamJob: Output: /tmp/Rtmpbaa6dV/file6caa3721

Get your output from HDFS

Code Block


> from.dfs(out)
[[1]]
[[1]]$key
[1] 1

[[1]]$val
[1] 1

attr(,"keyval")
[1] TRUE

[[2]]
[[2]]$key
[1] 2

[[2]]$val
[1] 4

attr(,"keyval")
[1] TRUE
[[3]]
[[3]]$key
[1] 3

[[3]]$val
[1] 9

attr(,"keyval")
[1] TRUE

[[4]]
[[4]]$key
[1] 4

[[4]]$val
[1] 16

attr(,"keyval")
[1] TRUE

[[5]]
[[5]]$key
[1] 5

[[5]]$val
[1] 25

attr(,"keyval")
[1] TRUE

[[6]]
[[6]]$key
[1] 6

[[6]]$val
[1] 36

attr(,"keyval")
[1] TRUE
[[7]]
[[7]]$key
[1] 7

[[7]]$val
[1] 49

attr(,"keyval")
[1] TRUE

[[8]]
[[8]]$key
[1] 8

[[8]]$val
[1] 64

attr(,"keyval")
[1] TRUE

[[9]]
[[9]]$key
[1] 9

[[9]]$val
[1] 81

attr(,"keyval")
[1] TRUE

[[10]]
[[10]]$key
[1] 10

[[10]]$val
[1] 100

attr(,"keyval")
[1] TRUE

Stop your Hadoop cluster

Quit r, exit ssh, and stop the cluster

...

Take a look at the Elastic MapReduce FAQ for how to SCP files to the Hadoop master host.
Take a look at the other Computation Examples

Versions Compared

Old Version 3

New Version 4

Key

How to run it on Elastic MapReduce

Start your Hadoop cluster

SSH to the Hadoop master

Initialize RHadoop

Send your input to HDFS

Run a Hadoop job

Get your output from HDFS

Stop your Hadoop cluster