Skip to end of banner
Go to start of banner

Getting Started with RHadoop on Elastic Map Reduce

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

On This page

On Related Pages

The selected root page could not be found.

Computing squares in R

The following example uses the RHadoop packages to perform MapReduce on the sequence 1:10 and
computes the square of each. Its a pretty tiny example from
https://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial. Note that there are more complicated examples in that tutorial such as Logistic Regression and K-means.

Create the bootstrap scripts

Bootstrap the latest version of R

The following script will download and install the latest version of R on each of your Elastic MapReduce hosts. (The default version of R is very old.)

Name this script bootstrapLatestR.sh and it should contain the following code:
hello world

What is going on in this script?

Bootstrap RHadoop

The following script will download and install several packages needed for RHadoop

Name this script bootstrapRHadoop.sh and it should contain the following code:
hello world

Upload your scripts to S3

You can use the AWS Console or s3curl to upload your files.

s3curl example:

~/RHadoopExample>/work/platform/bin/s3curl.pl --id $USER --put bootstrapLatestR.sh https://s3.amazonaws.com/sagebio-$USER/scripts/bootstrapLatestR.sh
~/RHadoopExample>/work/platform/bin/s3curl.pl --id $USER --put bootstrapRHadoop.sh https://s3.amazonaws.com/sagebio-$USER/scripts/bootstrapRHadoop.sh

How to run it on Elastic MapReduce

  1. Start your Hadoop cluster
    ~>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --create --master-instance-type=m1.small --slave-instance-type=m1.small --num-instances=1 --enable-debugging --bootstrap-action s3://sagebio-$USER/scripts/bootstrapLatestR.sh --bootstrap-action s3://sagebio-ndeflaux$USER/scripts/bootstrapRHadoop.sh --name rmrTry1 --alive
    
    Created job flow j-79VXH9Z07ECL
    
  2. SSH to the Hadoop master
    ~>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --ssh --jobflow j-79VXH9Z07ECL
    
  3. Set JAVA_HOME and start R
    hadoop@ip-10-114-89-121:/mnt/var/log/bootstrap-actions$ export JAVA_HOME=/usr/lib/jvm/java-6-sun/jre
    hadoop@ip-10-114-89-121:/mnt/var/log/bootstrap-actions$ R
    
    R version 2.14.0 (2011-10-31)
    Copyright (C) 2011 The R Foundation for Statistical Computing
    ISBN 3-900051-07-0
    Platform: i486-pc-linux-gnu (32-bit)
    
  4. Initialize RHadoop
    > Sys.setenv(HADOOP_HOME="/home/hadoop", HADOOP_CONF="/home/hadoop/conf", JAVA_HOME="/usr/lib/jvm/java-6-sun/jre"); library(rmr); library(rhdfs);  hdfs.init();
    Loading required package: RJSONIO
    Loading required package: itertools
    Loading required package: iterators
    Loading required package: digest
    Loading required package: rJava
    
  5. Send your input to HDFS
    > small.ints = to.dfs(1:10);
    
  6. Run a Hadoop job
    > out = mapreduce(input = small.ints, map = function(k,v) keyval(k, k^2))
    
    packageJobJar: [/tmp/Rtmpbaa6dV/rhstr.map63284ca9, /tmp/Rtmpbaa6dV/rmrParentEnv, /tmp/Rtmpbaa6dV/rmrLocalEnv, /mnt/var/lib/hadoop/tmp/hadoop-unjar2859463891039338350/] [] /tmp/streamjob1543774456515588690.jar tmpDir=null
    11/11/08 03:21:18 INFO mapred.JobClient: Default number of map tasks: 2
    11/11/08 03:21:18 INFO mapred.JobClient: Default number of reduce tasks: 1
    11/11/08 03:21:19 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
    11/11/08 03:21:19 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 2334756312e0012cac793f12f4151bdaa1b4b1bb]
    11/11/08 03:21:19 INFO mapred.FileInputFormat: Total input paths to process : 1
    11/11/08 03:21:20 INFO streaming.StreamJob: getLocalDirs(): [/mnt/var/lib/hadoop/mapred]
    11/11/08 03:21:20 INFO streaming.StreamJob: Running job: job_201111080311_0001
    11/11/08 03:21:20 INFO streaming.StreamJob: To kill this job, run:
    11/11/08 03:21:20 INFO streaming.StreamJob: /home/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=ip-10-114-89-121.ec2.internal:9001 -kill job_201111080311_0001
    11/11/08 03:21:20 INFO streaming.StreamJob: Tracking URL: http://ip-10-114-89-121.ec2.internal:9100/jobdetails.jsp?jobid=job_201111080311_0001
    11/11/08 03:21:21 INFO streaming.StreamJob:  map 0%  reduce 0%
    11/11/08 03:21:35 INFO streaming.StreamJob:  map 50%  reduce 0%
    11/11/08 03:21:38 INFO streaming.StreamJob:  map 100%  reduce 0%
    11/11/08 03:21:50 INFO streaming.StreamJob:  map 100%  reduce 100%
    11/11/08 03:21:53 INFO streaming.StreamJob: Job complete: job_201111080311_0001
    11/11/08 03:21:53 INFO streaming.StreamJob: Output: /tmp/Rtmpbaa6dV/file6caa3721
  7. Get your output from HDFS
    > from.dfs(out)
    [[1]]
    [[1]]$key
    [1] 1
    
    [[1]]$val
    [1] 1
    
    attr(,"keyval")
    [1] TRUE
    
    [[2]]
    [[2]]$key
    [1] 2
    
    [[2]]$val
    [1] 4
    
    attr(,"keyval")
    [1] TRUE
    [[3]]
    [[3]]$key
    [1] 3
    
    [[3]]$val
    [1] 9
    
    attr(,"keyval")
    [1] TRUE
    
    [[4]]
    [[4]]$key
    [1] 4
    
    [[4]]$val
    [1] 16
    
    attr(,"keyval")
    [1] TRUE
    
    [[5]]
    [[5]]$key
    [1] 5
    
    [[5]]$val
    [1] 25
    
    attr(,"keyval")
    [1] TRUE
    
    [[6]]
    [[6]]$key
    [1] 6
    
    [[6]]$val
    [1] 36
    
    attr(,"keyval")
    [1] TRUE
    [[7]]
    [[7]]$key
    [1] 7
    
    [[7]]$val
    [1] 49
    
    attr(,"keyval")
    [1] TRUE
    
    [[8]]
    [[8]]$key
    [1] 8
    
    [[8]]$val
    [1] 64
    
    attr(,"keyval")
    [1] TRUE
    
    [[9]]
    [[9]]$key
    [1] 9
    
    [[9]]$val
    [1] 81
    
    attr(,"keyval")
    [1] TRUE
    
    [[10]]
    [[10]]$key
    [1] 10
    
    [[10]]$val
    [1] 100
    
    attr(,"keyval")
    [1] TRUE    
    

What next?

  • No labels