...
The following example uses the RHadoop packages to perform MapReduce on the sequence 1:10 and
computes the square of each. Its a pretty tiny example from
https://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial. Note that there are more complicated examples in that tutorial such as Logistic Regression and K-means.
...
- Start your Hadoop cluster
Code Block ~>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --create --master-instance-type=m1.small --slave-instance-type=m1.small --num-instances=1 --enable-debugging --bootstrap-action s3://sagebio-$USER/scripts/bootstrapLatestR.sh --bootstrap-action s3://sagebio-ndeflaux$USER/scripts/bootstrapRHadoop.sh --name rmrTry1 --alive Created job flow j-79VXH9Z07ECL
- SSH to the Hadoop master
Code Block ~>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --ssh --jobflow j-79VXH9Z07ECL
- Set JAVA_HOME and start R
Code Block hadoop@ip-10-114-89-121:/mnt/var/log/bootstrap-actions$ export JAVA_HOME=/usr/lib/jvm/java-6-sun/jre hadoop@ip-10-114-89-121:/mnt/var/log/bootstrap-actions$ R R version 2.14.0 (2011-10-31) Copyright (C) 2011 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: i486-pc-linux-gnu (32-bit)
- Initialize RHadoop
Code Block > Sys.setenv(HADOOP_HOME="/home/hadoop", HADOOP_CONF="/home/hadoop/conf", JAVA_HOME="/usr/lib/jvm/java-6-sun/jre"); library(rmr); library(rhdfs); hdfs.init(); Loading required package: RJSONIO Loading required package: itertools Loading required package: iterators Loading required package: digest Loading required package: rJava
- Send your input to HDFS
Code Block > small.ints = to.dfs(1:10);
- Run a Hadoop job
Code Block > out = mapreduce(input = small.ints, map = function(k,v) keyval(k, k^2)) packageJobJar: [/tmp/Rtmpbaa6dV/rhstr.map63284ca9, /tmp/Rtmpbaa6dV/rmrParentEnv, /tmp/Rtmpbaa6dV/rmrLocalEnv, /mnt/var/lib/hadoop/tmp/hadoop-unjar2859463891039338350/] [] /tmp/streamjob1543774456515588690.jar tmpDir=null 11/11/08 03:21:18 INFO mapred.JobClient: Default number of map tasks: 2 11/11/08 03:21:18 INFO mapred.JobClient: Default number of reduce tasks: 1 11/11/08 03:21:19 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library 11/11/08 03:21:19 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 2334756312e0012cac793f12f4151bdaa1b4b1bb] 11/11/08 03:21:19 INFO mapred.FileInputFormat: Total input paths to process : 1 11/11/08 03:21:20 INFO streaming.StreamJob: getLocalDirs(): [/mnt/var/lib/hadoop/mapred] 11/11/08 03:21:20 INFO streaming.StreamJob: Running job: job_201111080311_0001 11/11/08 03:21:20 INFO streaming.StreamJob: To kill this job, run: 11/11/08 03:21:20 INFO streaming.StreamJob: /home/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=ip-10-114-89-121.ec2.internal:9001 -kill job_201111080311_0001 11/11/08 03:21:20 INFO streaming.StreamJob: Tracking URL: http://ip-10-114-89-121.ec2.internal:9100/jobdetails.jsp?jobid=job_201111080311_0001 11/11/08 03:21:21 INFO streaming.StreamJob: map 0% reduce 0% 11/11/08 03:21:35 INFO streaming.StreamJob: map 50% reduce 0% 11/11/08 03:21:38 INFO streaming.StreamJob: map 100% reduce 0% 11/11/08 03:21:50 INFO streaming.StreamJob: map 100% reduce 100% 11/11/08 03:21:53 INFO streaming.StreamJob: Job complete: job_201111080311_0001 11/11/08 03:21:53 INFO streaming.StreamJob: Output: /tmp/Rtmpbaa6dV/file6caa3721
- Get your output from HDFS
Code Block > from.dfs(out) [[1]] [[1]]$key [1] 1 [[1]]$val [1] 1 attr(,"keyval") [1] TRUE [[2]] [[2]]$key [1] 2 [[2]]$val [1] 4 attr(,"keyval") [1] TRUE [[3]] [[3]]$key [1] 3 [[3]]$val [1] 9 attr(,"keyval") [1] TRUE [[4]] [[4]]$key [1] 4 [[4]]$val [1] 16 attr(,"keyval") [1] TRUE [[5]] [[5]]$key [1] 5 [[5]]$val [1] 25 attr(,"keyval") [1] TRUE [[6]] [[6]]$key [1] 6 [[6]]$val [1] 36 attr(,"keyval") [1] TRUE [[7]] [[7]]$key [1] 7 [[7]]$val [1] 49 attr(,"keyval") [1] TRUE [[8]] [[8]]$key [1] 8 [[8]]$val [1] 64 attr(,"keyval") [1] TRUE [[9]] [[9]]$key [1] 9 [[9]]$val [1] 81 attr(,"keyval") [1] TRUE [[10]] [[10]]$key [1] 10 [[10]]$val [1] 100 attr(,"keyval") [1] TRUE
Stop your Hadoop cluster
Quit r, exit ssh, and stop the cluster
Code Block |
---|
> q()Save workspace image? [y/n/c]: n hadoop@ip-10-114-89-121:/mnt/var/log/bootstrap-actions$ exit logout Connection to ec2-107-20-108-57.compute-1.amazonaws.com closed. ~>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --terminate --jobflow j-79VXH9Z07ECL Terminated job flow j-79VXH9Z07ECL |
What next?
- Take a look at the Elastic MapReduce FAQ
- Take a look at the other Computation Examples