On This page
On Related Pages
Elastic Map Reduce Troubleshooting
The job keeps timing out
If your job takes more than an hour to write anything to stdout, you'll need to do one or more of the following things.
How to change the timeout
This is only important if your mapper or reducer can take more than an hour to write anything to stdout
Job Configuration Settings:
"-jobconf", "mapred.task.timeout=604800000",
- mapred.task.timeout is expressed in milliseconds
- the example above is one week worth of milliseconds
How to run a background task to generate some fake output to stdout
See the Phase Algorithm example where the mapper script spawns a background task to generate some output to stdout and stderr
#!/bin/sh # Spawn a background task perl -e 'while(! -e "./timetostop") { print "keepalive\n"; print STDERR "reporter:status:keepalive\n"; sleep 300; }' & RUN YOUR REAL MAPPER HERE # Tell the background keepalive task to exit touch ./timetostop
The job keeps getting killed
How to check memory usage
Look at your EC2 instances in the AWS Console. There are performance graphs for cpu, memory, disk I/O, and network I/O utilization.
Use the --alive option when you start the EMR job so that the host stay alive when MapReduce stops and we can still look at their memory usage data from before they were killed.
How to configure Hadoop for memory intensive tasks
Use the following bootstrap action to configure Hadoop for a high-memory use case
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive
How to configure a larger swap on the machine
Use a bootstrap action to configure larger swap
#!/bin/bash SWAP_SIZE_MEGABYTES=10000 #10 GB of swap space sudo dd if=/dev/zero of=/mnt/swapfile bs=1024 \ count=$(($SWAP_SIZE_MEGABYTES*1024)) sudo /sbin/mkswap /mnt/swapfile sudo /sbin/swapon /mnt/swapfile
Resources:
- http://tech.backtype.com/swap-space-on-ec2
- http://support.rightscale.com/06-FAQs/FAQ_0023_-_How_can_I_create_a_swap_partition_in_EC2%3F
Use a machine with more RAM
See the High Memory instances here: http://aws.amazon.com/ec2/instance-types/
The tasks are not balanced the right way across the machines
How to limit the number of concurrently running mappers and/or reducers per slave
Job Configuration Settings:
- mapred.tasktracker.map.tasks.maximum
- mapred.tasktracker.reduce.tasks.maximum
My problem is not covered here
- Search in the AWS Elastic MapReduce User Forum and post a question
- Take a look at the AWS documentation for Elastic MapReduce
Elastic Map Reduce FAQ
How to run a Mapper-only job
Job Configuration Settings:
- "-jobconf", "mapred.reduce.tasks=0",
http://hadoop.apache.org/common/docs/r0.20.0/streaming.html#Mapper-Only+Jobs
How to test a MapReduce Job locally
- The command line to run it
~>cat AnInputFile.txt | ./mapper.R | sort | ./reducer.R a 1 after. 1 and 4 And 1 as 1 As 1 broke 1 brown 1 ...
How to install the latest version of R on all hosts
Run a bootstrap-action script. See A Simple Example of an R MapReduce Job for that script.
How to transfer files from your job to S3
hadoop fs -put file.txt s3n://mybucket/file.txt
- Note that thiw will not overwrite file.txt if it already exists in S3
How to transfer additional files to all hosts
Option 1: You can use the --cache
or --cacheArchive
options for the EMR CLI. Note that the CLI cannot handle multiple --cache
parameters so use json job config instead if you need to do that.
Option 2: In your job configuration file
"-cacheFile", "s3n://sagetestemr/scripts/phase#phase", "-cacheArchive", "s3n://sagetestemr/scripts/myArchive.tar.gz#myDestinationDirectory",
Resources:
How to SSH to the hosts
Check that the firewall settings will allow SSH connections
By default the master instance will allow SSH connections but the slave instances will not. Add a rule to allow SSH connections from the Fred Hutch network 140.107.0.0/16
Connecting from Linux
You can ssh to the master host:
~/>elastic-mapreduce --credentials ~/.ssh/$USER-credentials.json --jobflow j-2N7A7GDJJ64HX --ssh Thu Jun 30 20:32:51 UTC 2011 INFO Jobflow is in state STARTING, waiting....
To ssh to slave hosts you need to get the public DNS hostnames from the AWS Console
ssh -i SageKeyPair.pem hadoop@<the ec2 hadoop slave host>
For screen shots see EC2 docs
Connecting from Windows using Putty
For screen shots see EC2 docs
Window's users can also connect using PuTTY or WinSCP, however you will to first create a PuTTY private key file using puttygen.exe
Here is how to create the private key file:
- Run the 'puttygen.exe' tool
- Select the 'load' button from the UI.
- From the file dialog select your the KeyPair file (i.e. SageKeyPair.pem)
- A popup dialog should tell you the key file was imported successfully and to save it using "Save private Key"
- Select 'Save Private Key' and give it a name such as SageKeyPair.ppk to create the PuTTY private key file.
Once you have a PuTTY private key file you can use it to connect to your host using PuTTY or WinSCP.
To connect with WinSCP:
- Set the host name, and keep the default port (22). Note: Make sure port 22 is open on the box you are connecting to.
- Set the user name to hadoop
- Select the '...' button under 'Private Key File' and select the .ppk file you created above.
- Select 'Login'
Where are the log and job files?
- Hadoop is installed in /home/hadoop.
- Log files are in /mnt/var/log/hadoop.
- Check /mnt/var/log/hadoop/steps for diagnosing step failures.