Section |
---|
Column |
---|
| On Related Pages Page Tree |
---|
root | SCICOMP:@parent |
---|
startDepth | 3 |
---|
|
|
|
Computing squares in R
...
The following script will download and install the latest version of R on each of your Elastic MapReduce hosts. (The default version of R is very old.)
Name this Download script bootstrapLatestR.sh and it should contain the following code:
Iframe |
---|
src | http://sagebionetworks.jira.com/source/browse/~raw,r=HEAD/PLFM/users/deflaux/scripts/EMR/rWordCountExample/bootstrapLatestR.sh |
---|
style | height:250px;width:80%; |
---|
|
...
What is going on in this script?
...
The following script will download and install several packages needed for RHadoop.
Name this Download script bootstrapRHadoop.sh and it should contain the following code:
Iframe |
---|
src | http://sagebionetworks.jira.com/source/browse/~raw,r=HEAD/PLFM/users/deflaux/scripts/EMR/rmrExample/bootstrapRHadoop.sh |
---|
style | height:250px;width:80%; |
---|
|
...
Upload your scripts to S3
...
How to run it on Elastic MapReduce
Start your Hadoop cluster
Code Block |
---|
~>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --create \
--master-instance-type=m1.small --slave-instance-type=m1.small \
--num-instances=1 --enable-debugging \
--bootstrap-action s3://sagebio-$USER/scripts/bootstrapLatestR.sh \
--bootstrap-action s3://sagebio- |
...
$USER/scripts/bootstrapRHadoop.sh \
--name rmrTry1 --alive
Created job flow j-79VXH9Z07ECL
|
SSH to the Hadoop master
Code Block |
---|
~>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --ssh --jobflow j-79VXH9Z07ECL
ssh -i /home/ndeflaux/.ssh/SageKeyPair.pem hadoop@ec2-107-20-44-27.compute-1.amazonaws.com
Linux domU-12-31-39-04-08-C8 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:39:36 EST 2008 i686
--------------------------------------------------------------------------------
Welcome to Amazon Elastic MapReduce running Hadoop and Debian/Lenny.
Hadoop is installed in /home/hadoop. Log files are in /mnt/var/log/hadoop. Check
/mnt/var/log/hadoop/steps for diagnosing step failures.
The Hadoop UI can be accessed via the following commands:
JobTracker lynx http://localhost:9100/
NameNode lynx http://localhost:9101/
--------------------------------------------------------------------------------
hadoop@domU-12-31-39-04-08-C8:~$
|
Set JAVA_HOME and start R
Code Block |
---|
hadoop@ip-10-114-89-121:/mnt/var/log/bootstrap-actions$ export JAVA_HOME=/usr/lib/jvm/java-6-sun/jre
hadoop@ip-10-114-89-121:/mnt/var/log/bootstrap-actions$ R
R version 2.14.0 (2011-10-31)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: i486-pc-linux-gnu (32-bit)
|
Initialize RHadoop
Code Block |
---|
> Sys.setenv(HADOOP_HOME="/home/hadoop", HADOOP_CONF="/home/hadoop/conf", JAVA_HOME="/usr/lib/jvm/java-6-sun/jre"); library(rmr); library(rhdfs); hdfs.init();
Loading required package: RJSONIO
Loading required package: itertools
Loading required package: iterators
Loading required package: digest
Loading required package: rJava
|
Send your input to HDFS
Code Block |
---|
> small.ints = to.dfs(1:10);
|
Run a Hadoop job
You can run one or more jobs in a session.
Code Block |
---|
> out = mapreduce(input = small.ints, map = function(k,v) keyval(k, k^2))
packageJobJar: [/tmp/Rtmpbaa6dV/rhstr.map63284ca9, /tmp/Rtmpbaa6dV/rmrParentEnv, /tmp/Rtmpbaa6dV/rmrLocalEnv, /mnt/var/lib/hadoop/tmp/hadoop-unjar2859463891039338350/] [] /tmp/streamjob1543774456515588690.jar tmpDir=null
11/11/08 03:21:18 INFO mapred.JobClient: Default number of map tasks: 2
11/11/08 03:21:18 INFO mapred.JobClient: Default number of reduce tasks: 1
11/11/08 03:21:19 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
11/11/08 03:21:19 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 2334756312e0012cac793f12f4151bdaa1b4b1bb]
11/11/08 03:21:19 INFO mapred.FileInputFormat: Total input paths to process : 1
11/11/08 03:21:20 INFO streaming.StreamJob: getLocalDirs(): [/mnt/var/lib/hadoop/mapred]
11/11/08 03:21:20 INFO streaming.StreamJob: Running job: job_201111080311_0001
11/11/08 03:21:20 INFO streaming.StreamJob: To kill this job, run:
11/11/08 03:21:20 INFO streaming.StreamJob: /home/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=ip-10-114-89-121.ec2.internal:9001 -kill job_201111080311_0001
11/11/08 03:21:20 INFO streaming.StreamJob: Tracking URL: http://ip-10-114-89-121.ec2.internal:9100/jobdetails.jsp?jobid=job_201111080311_0001
11/11/08 03:21:21 INFO streaming.StreamJob: map 0% reduce 0%
11/11/08 03:21:35 INFO streaming.StreamJob: map 50% reduce 0%
11/11/08 03:21:38 INFO streaming.StreamJob: map 100% reduce 0%
11/11/08 03:21:50 INFO streaming.StreamJob: map 100% reduce 100%
11/11/08 03:21:53 INFO streaming.StreamJob: Job complete: job_201111080311_0001
11/11/08 03:21:53 INFO streaming.StreamJob: Output: /tmp/Rtmpbaa6dV/file6caa3721 |
Get your output from HDFS
Code Block |
---|
> from.dfs(out)
[[1]]
[[1]]$key
[1] 1
[[1]]$val
[1] 1
attr(,"keyval")
[1] TRUE
[[2]]
[[2]]$key
[1] 2
[[2]]$val
[1] 4
attr(,"keyval")
[1] TRUE
[[3]]
[[3]]$key
[1] 3
[[3]]$val
[1] 9
attr(,"keyval")
[1] TRUE
[[4]]
[[4]]$key
[1] 4
[[4]]$val
[1] 16
attr(,"keyval")
[1] TRUE
[[5]]
[[5]]$key
[1] 5
[[5]]$val
[1] 25
attr(,"keyval")
[1] TRUE
[[6]]
[[6]]$key
[1] 6
[[6]]$val
[1] 36
attr(,"keyval")
[1] TRUE
[[7]]
[[7]]$key
[1] 7
[[7]]$val
[1] 49
attr(,"keyval")
[1] TRUE
[[8]]
[[8]]$key
[1] 8
[[8]]$val
[1] 64
attr(,"keyval")
[1] TRUE
[[9]]
[[9]]$key
[1] 9
[[9]]$val
[1] 81
attr(,"keyval")
[1] TRUE
[[10]]
[[10]]$key
[1] 10
[[10]]$val
[1] 100
attr(,"keyval")
[1] TRUE
|
Stop your Hadoop cluster
Quit r, exit ssh, and stop the cluster:
Code Block |
---|
> q()Save workspace image? [y/n/c]: n
hadoop@ip-10-114-89-121:/mnt/var/log/bootstrap-actions$ exit
logout
Connection to ec2-107-20-108-57.compute-1.amazonaws.com closed.
~>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --terminate --jobflow j-79VXH9Z07ECL
Terminated job flow j-79VXH9Z07ECL
|
What next?