Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The following example uses the RHadoop packages to perform MapReduce on the sequence 1:10 and
computes the square of each. Its a pretty tiny example from
https://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial. Note that there are more complicated examples in that tutorial such as Logistic Regression and K-means.

...

  1. Start your Hadoop cluster
    Code Block
    ~>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --create --master-instance-type=m1.small --slave-instance-type=m1.small --num-instances=1 --enable-debugging --bootstrap-action s3://sagebio-$USER/scripts/bootstrapLatestR.sh --bootstrap-action s3://sagebio-ndeflaux$USER/scripts/bootstrapRHadoop.sh --name rmrTry1 --alive
    
    Created job flow j-79VXH9Z07ECL
    
  2. SSH to the Hadoop master
    Code Block
    ~>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --ssh --jobflow j-79VXH9Z07ECL
    
  3. Set JAVA_HOME and start R
    Code Block
    hadoop@ip-10-114-89-121:/mnt/var/log/bootstrap-actions$ export JAVA_HOME=/usr/lib/jvm/java-6-sun/jre
    hadoop@ip-10-114-89-121:/mnt/var/log/bootstrap-actions$ R
    
    R version 2.14.0 (2011-10-31)
    Copyright (C) 2011 The R Foundation for Statistical Computing
    ISBN 3-900051-07-0
    Platform: i486-pc-linux-gnu (32-bit)
    
  4. Initialize RHadoop
    Code Block
    > Sys.setenv(HADOOP_HOME="/home/hadoop", HADOOP_CONF="/home/hadoop/conf", JAVA_HOME="/usr/lib/jvm/java-6-sun/jre"); library(rmr); library(rhdfs);  hdfs.init();
    Loading required package: RJSONIO
    Loading required package: itertools
    Loading required package: iterators
    Loading required package: digest
    Loading required package: rJava
    
  5. Send your input to HDFS
    Code Block
    > small.ints = to.dfs(1:10);
    
  6. Run a Hadoop job
    Code Block
    > out = mapreduce(input = small.ints, map = function(k,v) keyval(k, k^2))
    
    packageJobJar: [/tmp/Rtmpbaa6dV/rhstr.map63284ca9, /tmp/Rtmpbaa6dV/rmrParentEnv, /tmp/Rtmpbaa6dV/rmrLocalEnv, /mnt/var/lib/hadoop/tmp/hadoop-unjar2859463891039338350/] [] /tmp/streamjob1543774456515588690.jar tmpDir=null
    11/11/08 03:21:18 INFO mapred.JobClient: Default number of map tasks: 2
    11/11/08 03:21:18 INFO mapred.JobClient: Default number of reduce tasks: 1
    11/11/08 03:21:19 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
    11/11/08 03:21:19 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 2334756312e0012cac793f12f4151bdaa1b4b1bb]
    11/11/08 03:21:19 INFO mapred.FileInputFormat: Total input paths to process : 1
    11/11/08 03:21:20 INFO streaming.StreamJob: getLocalDirs(): [/mnt/var/lib/hadoop/mapred]
    11/11/08 03:21:20 INFO streaming.StreamJob: Running job: job_201111080311_0001
    11/11/08 03:21:20 INFO streaming.StreamJob: To kill this job, run:
    11/11/08 03:21:20 INFO streaming.StreamJob: /home/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=ip-10-114-89-121.ec2.internal:9001 -kill job_201111080311_0001
    11/11/08 03:21:20 INFO streaming.StreamJob: Tracking URL: http://ip-10-114-89-121.ec2.internal:9100/jobdetails.jsp?jobid=job_201111080311_0001
    11/11/08 03:21:21 INFO streaming.StreamJob:  map 0%  reduce 0%
    11/11/08 03:21:35 INFO streaming.StreamJob:  map 50%  reduce 0%
    11/11/08 03:21:38 INFO streaming.StreamJob:  map 100%  reduce 0%
    11/11/08 03:21:50 INFO streaming.StreamJob:  map 100%  reduce 100%
    11/11/08 03:21:53 INFO streaming.StreamJob: Job complete: job_201111080311_0001
    11/11/08 03:21:53 INFO streaming.StreamJob: Output: /tmp/Rtmpbaa6dV/file6caa3721
  7. Get your output from HDFS
    Code Block
    > from.dfs(out)
    [[1]]
    [[1]]$key
    [1] 1
    
    [[1]]$val
    [1] 1
    
    attr(,"keyval")
    [1] TRUE
    
    [[2]]
    [[2]]$key
    [1] 2
    
    [[2]]$val
    [1] 4
    
    attr(,"keyval")
    [1] TRUE
    [[3]]
    [[3]]$key
    [1] 3
    
    [[3]]$val
    [1] 9
    
    attr(,"keyval")
    [1] TRUE
    
    [[4]]
    [[4]]$key
    [1] 4
    
    [[4]]$val
    [1] 16
    
    attr(,"keyval")
    [1] TRUE
    
    [[5]]
    [[5]]$key
    [1] 5
    
    [[5]]$val
    [1] 25
    
    attr(,"keyval")
    [1] TRUE
    
    [[6]]
    [[6]]$key
    [1] 6
    
    [[6]]$val
    [1] 36
    
    attr(,"keyval")
    [1] TRUE
    [[7]]
    [[7]]$key
    [1] 7
    
    [[7]]$val
    [1] 49
    
    attr(,"keyval")
    [1] TRUE
    
    [[8]]
    [[8]]$key
    [1] 8
    
    [[8]]$val
    [1] 64
    
    attr(,"keyval")
    [1] TRUE
    
    [[9]]
    [[9]]$key
    [1] 9
    
    [[9]]$val
    [1] 81
    
    attr(,"keyval")
    [1] TRUE
    
    [[10]]
    [[10]]$key
    [1] 10
    
    [[10]]$val
    [1] 100
    
    attr(,"keyval")
    [1] TRUE    
    

Stop your Hadoop cluster

Quit r, exit ssh, and stop the cluster

Code Block

> q()Save workspace image? [y/n/c]: n
hadoop@ip-10-114-89-121:/mnt/var/log/bootstrap-actions$ exit
logout
Connection to ec2-107-20-108-57.compute-1.amazonaws.com closed.
~>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --terminate --jobflow j-79VXH9Z07ECL
Terminated job flow j-79VXH9Z07ECL

What next?