Elastic Map Reduce Troubleshooting
The job keeps timing out
If your job takes more than an hour to write anything to stdout, you'll need to do one or more of the following things.
How to change the timeout
This is only important if your mapper or reducer can take more than an hour to write anything to stdout
Job Configuration Settings:
"-jobconf", "mapred.task.timeout=604800000",mapred.task.timeout is expressed in milliseconds
the example above is one week worth of milliseconds
How to run a background task to generate some fake output to stdout
See the Phase Algorithm on Elastic MapReduce example where the mapper script spawns a background task to generate some output to stdout and stderr
#!/bin/sh
# Spawn a background task
perl -e 'while(! -e "./timetostop") { print "keepalive\n"; print STDERR "reporter:status:keepalive\n"; sleep 300; }' &
RUN YOUR REAL MAPPER HERE
# Tell the background keepalive task to exit
touch ./timetostop
The job keeps getting killed
How to check memory usage
Look at your EC2 instances in the AWS Console. There are performance graphs for cpu, memory, disk I/O, and network I/O utilization.
Use the --alive option when you start the EMR job so that the host stay alive when MapReduce stops and we can still look at their memory usage data from before they were killed.
How to configure Hadoop for memory intensive tasks
Use the following bootstrap action to configure Hadoop for a high-memory use case
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive
How to configure a larger swap on the machine
Use a bootstrap action to configure larger swap
#!/bin/bash
SWAP_SIZE_MEGABYTES=10000 #10 GB of swap space
sudo dd if=/dev/zero of=/mnt/swapfile bs=1024 \
count=$(($SWAP_SIZE_MEGABYTES*1024))
sudo /sbin/mkswap /mnt/swapfile
sudo /sbin/swapon /mnt/swapfile
Resources:
Use a machine with more RAM
See the High Memory instances here: http://aws.amazon.com/ec2/instance-types/
The tasks are not balanced the right way across the machines
How to limit the number of concurrently running mappers and/or reducers per slave
Job Configuration Settings:
mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum
My problem is not covered here
Search in the AWS Elastic MapReduce User Forum and post a question
Take a look at the AWS documentation for Elastic MapReduce
Elastic Map Reduce FAQ
How to run a Mapper-only job
Job Configuration Settings:
"-jobconf", "mapred.reduce.tasks=0",
http://hadoop.apache.org/common/docs/r0.20.0/streaming.html#Mapper-Only+Jobs
How to test a MapReduce Job locally
The command line to run it
~>cat AnInputFile.txt | ./mapper.R | sort | ./reducer.R a 1 after. 1 and 4 And 1 as 1 As 1 broke 1 brown 1 ...
How to install the latest version of R on all hosts
Run a bootstrap-action script. See Getting Started with R on Elastic Map Reduce for that script.
How to transfer files from your job to S3
hadoop fs -put file.txt s3n://mybucket/file.txtNote that thiw will not overwrite file.txt if it already exists in S3
How to transfer additional files to all hosts
Option 1: You can use the --cache or --cacheArchive options for the EMR CLI. Note that the CLI cannot handle multiple --cache parameters so use json job config instead if you need to do that.
Option 2: In your job configuration file
"-cacheFile", "s3n://sagetestemr/scripts/phase#phase",
"-cacheArchive", "s3n://sagetestemr/scripts/myArchive.tar.gz#myDestinationDirectory",
Resources:
How to SSH to the hosts
Check that the firewall settings will allow SSH connections
By default the master instance will allow SSH connections but the slave instances will not. Add a rule to allow SSH connections from the Fred Hutch network 140.107.0.0/16
Connecting from Linux
You can ssh to the master host:
~/>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --jobflow j-2N7A7GDJJ64HX --ssh
Thu Jun 30 20:32:51 UTC 2011 INFO Jobflow is in state STARTING, waiting....
To ssh to slave hosts you need to get the public DNS hostnames from the AWS Console
ssh -i SageKeyPair.pem hadoop@<the ec2 hadoop slave host>
For screen shots see EC2 docs
Connecting from Windows using Putty
For screen shots see EC2 docs