Elastic Map Reduce Troubleshooting

The job keeps timing out

If your job takes more than an hour to write anything to stdout, you'll need to do one or more of the following things.

How to change the timeout

This is only important if your mapper or reducer can take more than an hour to write anything to stdout

Job Configuration Settings:

"-jobconf", "mapred.task.timeout=604800000",
mapred.task.timeout is expressed in milliseconds
the example above is one week worth of milliseconds

How to run a background task to generate some fake output to stdout

See the Phase Algorithm on Elastic MapReduce example where the mapper script spawns a background task to generate some output to stdout and stderr


#!/bin/sh

# Spawn a background task
perl -e 'while(! -e "./timetostop") { print "keepalive\n"; print STDERR "reporter:status:keepalive\n"; sleep 300; }' &

RUN YOUR REAL MAPPER HERE

# Tell the background keepalive task to exit
touch ./timetostop

The job keeps getting killed

How to check memory usage

Look at your EC2 instances in the AWS Console. There are performance graphs for cpu, memory, disk I/O, and network I/O utilization.

Use the --alive option when you start the EMR job so that the host stay alive when MapReduce stops and we can still look at their memory usage data from before they were killed.

How to configure Hadoop for memory intensive tasks

Use the following bootstrap action to configure Hadoop for a high-memory use case

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive

How to configure a larger swap on the machine

Use a bootstrap action to configure larger swap


#!/bin/bash
SWAP_SIZE_MEGABYTES=10000 #10 GB of swap space
sudo dd if=/dev/zero of=/mnt/swapfile bs=1024 \
  count=$(($SWAP_SIZE_MEGABYTES*1024))
sudo /sbin/mkswap /mnt/swapfile
sudo /sbin/swapon /mnt/swapfile

Resources:

Use a machine with more RAM

See the High Memory instances here: http://aws.amazon.com/ec2/instance-types/

The tasks are not balanced the right way across the machines

How to limit the number of concurrently running mappers and/or reducers per slave

Job Configuration Settings:

mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum

My problem is not covered here

Search in the AWS Elastic MapReduce User Forum and post a question
Take a look at the AWS documentation for Elastic MapReduce

Elastic Map Reduce FAQ

How to run a Mapper-only job

Job Configuration Settings:

"-jobconf", "mapred.reduce.tasks=0",

http://hadoop.apache.org/common/docs/r0.20.0/streaming.html#Mapper-Only+Jobs

How to test a MapReduce Job locally

The command line to run it
~>cat AnInputFile.txt | ./mapper.R | sort | ./reducer.R a 1 after. 1 and 4 And 1 as 1 As 1 broke 1 brown 1 ...

How to install the latest version of R on all hosts

Run a bootstrap-action script. See Getting Started with R on Elastic Map Reduce for that script.

How to transfer files from your job to S3

hadoop fs -put file.txt s3n://mybucket/file.txt
Note that thiw will not overwrite file.txt if it already exists in S3

How to transfer additional files to all hosts

Option 1: You can use the --cache or --cacheArchive options for the EMR CLI. Note that the CLI cannot handle multiple --cache parameters so use json job config instead if you need to do that.

Option 2: In your job configuration file


"-cacheFile", "s3n://sagetestemr/scripts/phase#phase",
"-cacheArchive", "s3n://sagetestemr/scripts/myArchive.tar.gz#myDestinationDirectory",

Resources:

http://hadoop.apache.org/common/docs/r0.15.2/streaming.html#Large+files+and+archives+in+Hadoop+Streaming

How to SSH to the hosts

Check that the firewall settings will allow SSH connections

By default the master instance will allow SSH connections but the slave instances will not. Add a rule to allow SSH connections from the Fred Hutch network 140.107.0.0/16

Connecting from Linux

You can ssh to the master host:


~/>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --jobflow j-2N7A7GDJJ64HX --ssh
Thu Jun 30 20:32:51 UTC 2011 INFO Jobflow is in state STARTING, waiting....

To ssh to slave hosts you need to get the public DNS hostnames from the AWS Console

ssh -i SageKeyPair.pem hadoop@<the ec2 hadoop slave host>

For screen shots see EC2 docs

Connecting from Windows using Putty

For screen shots see EC2 docs

Elastic MapReduce FAQ

Elastic Map Reduce Troubleshooting

The job keeps timing out

How to change the timeout

How to run a background task to generate some fake output to stdout

The job keeps getting killed

How to check memory usage

How to configure Hadoop for memory intensive tasks

How to configure a larger swap on the machine

Use a machine with more RAM

The tasks are not balanced the right way across the machines

How to limit the number of concurrently running mappers and/or reducers per slave

My problem is not covered here

Elastic Map Reduce FAQ

How to run a Mapper-only job

How to test a MapReduce Job locally

How to install the latest version of R on all hosts

How to transfer files from your job to S3

How to transfer additional files to all hosts

How to SSH to the hosts

Check that the firewall settings will allow SSH connections

Connecting from Linux

Connecting from Windows using Putty