Document toolboxDocument toolbox

Elastic MapReduce FAQ

On This page

On Related Pages

Elastic Map Reduce Troubleshooting

The job keeps timing out

If your job takes more than an hour to write anything to stdout, you'll need to do one or more of the following things.

How to change the timeout

This is only important if your mapper or reducer can take more than an hour to write anything to stdout

Job Configuration Settings:

  • "-jobconf", "mapred.task.timeout=604800000",
  • mapred.task.timeout is expressed in milliseconds
  • the example above is one week worth of milliseconds

How to run a background task to generate some fake output to stdout

See the Phase Algorithm on Elastic MapReduce example where the mapper script spawns a background task to generate some output to stdout and stderr

#!/bin/sh

# Spawn a background task
perl -e 'while(! -e "./timetostop") { print "keepalive\n"; print STDERR "reporter:status:keepalive\n"; sleep 300; }' &

RUN YOUR REAL MAPPER HERE

# Tell the background keepalive task to exit
touch ./timetostop

The job keeps getting killed

How to check memory usage

Look at your EC2 instances in the AWS Console. There are performance graphs for cpu, memory, disk I/O, and network I/O utilization.

Use the --alive option when you start the EMR job so that the host stay alive when MapReduce stops and we can still look at their memory usage data from before they were killed.

How to configure Hadoop for memory intensive tasks

Use the following bootstrap action to configure Hadoop for a high-memory use case

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive

How to configure a larger swap on the machine

Use a bootstrap action to configure larger swap

#!/bin/bash
SWAP_SIZE_MEGABYTES=10000 #10 GB of swap space
sudo dd if=/dev/zero of=/mnt/swapfile bs=1024 \
  count=$(($SWAP_SIZE_MEGABYTES*1024))
sudo /sbin/mkswap /mnt/swapfile
sudo /sbin/swapon /mnt/swapfile

Resources:

Use a machine with more RAM

See the High Memory instances here: http://aws.amazon.com/ec2/instance-types/

The tasks are not balanced the right way across the machines

How to limit the number of concurrently running mappers and/or reducers per slave

Job Configuration Settings:

  • mapred.tasktracker.map.tasks.maximum
  • mapred.tasktracker.reduce.tasks.maximum

My problem is not covered here

  1. Search in the AWS Elastic MapReduce User Forum and post a question
  2. Take a look at the AWS documentation for Elastic MapReduce

Elastic Map Reduce FAQ

How to run a Mapper-only job

Job Configuration Settings:

  • "-jobconf", "mapred.reduce.tasks=0",

http://hadoop.apache.org/common/docs/r0.20.0/streaming.html#Mapper-Only+Jobs

How to test a MapReduce Job locally

  1. The command line to run it
    ~>cat AnInputFile.txt | ./mapper.R | sort | ./reducer.R
    a       1
    after.  1
    and     4
    And     1
    as      1
    As      1
    broke   1
    brown   1
    ...
    

How to install the latest version of R on all hosts

Run a bootstrap-action script. See Getting Started with R on Elastic Map Reduce for that script.

How to transfer files from your job to S3

  • hadoop fs -put file.txt s3n://mybucket/file.txt
  • Note that thiw will not overwrite file.txt if it already exists in S3

How to transfer additional files to all hosts

Option 1: You can use the --cache or --cacheArchive options for the EMR CLI. Note that the CLI cannot handle multiple --cache parameters so use json job config instead if you need to do that.

Option 2: In your job configuration file

"-cacheFile", "s3n://sagetestemr/scripts/phase#phase",
"-cacheArchive", "s3n://sagetestemr/scripts/myArchive.tar.gz#myDestinationDirectory",

Resources:

How to SSH to the hosts

Check that the firewall settings will allow SSH connections

By default the master instance will allow SSH connections but the slave instances will not. Add a rule to allow SSH connections from the Fred Hutch network 140.107.0.0/16

Connecting from Linux

You can ssh to the master host:

~/>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --jobflow j-2N7A7GDJJ64HX --ssh
Thu Jun 30 20:32:51 UTC 2011 INFO Jobflow is in state STARTING, waiting....

To ssh to slave hosts you need to get the public DNS hostnames from the AWS Console

ssh -i SageKeyPair.pem hadoop@<the ec2 hadoop slave host>

For screen shots see EC2 docs

Connecting from Windows using Putty

For screen shots see EC2 docs

Window's users can also connect using PuTTY or WinSCP, however you will to first create a PuTTY private key file using puttygen.exe
Here is how to create the private key file:

  1. Run the 'puttygen.exe' tool
  2. Select the 'load' button from the UI.
  3. From the file dialog select your the KeyPair file (i.e. SageKeyPair.pem)
  4. A popup dialog should tell you the key file was imported successfully and to save it using "Save private Key"
  5. Select 'Save Private Key' and give it a name such as SageKeyPair.ppk to create the PuTTY private key file.

Once you have a PuTTY private key file you can use it to connect to your host using PuTTY or WinSCP.
To connect with WinSCP:

  1. Set the host name, and keep the default port (22). Note: Make sure port 22 is open on the box you are connecting to.
  2. Set the user name to hadoop
  3. Select the '...' button under 'Private Key File' and select the .ppk file you created above.
  4. Select 'Login'

Where are the log and job files?

  • Hadoop is installed in /home/hadoop.
  • Log files are in /mnt/var/log/hadoop.
  • Check /mnt/var/log/hadoop/steps for diagnosing step failures.