...
- Write your job configuration. Note that you need to change the output location each time you run this!
Code Block ~>cat phase.json [ { "Name": "MapReduce Step 1: Run Phase", "ActionOnFailure": "CANCEL_AND_WAIT", "HadoopJarStep": { "Jar": "/home/hadoop/contrib/streaming/hadoop-streaming.jar", "Args": [ "-input", "s3n://sagetest-YourUsername/input/phaseInput.txt", "-output", "s3n://sagetest-YourUsername/output/phaseTry1", "-mapper", "s3n://sagetest-YourUsername/scripts/phaseMapper.sh", "-cacheFile", "s3n://sagetest-YourUsername/scripts/phase#phase", "-jobconf", "mapred.reduce.tasks=0", "-jobconf", "mapred.task.timeout=604800000", ] } } ]
- Put it on one of the shared servers sodo/ballard/belltown.
If you find that your mapper tasks are not getting balanced evenly across your fleet, you can add lines like the following to your job config:
Code Block |
---|
"-jobconf", "mapred.map.tasks=26",
"-jobconf", "mapred.tasktracker.map.tasks.maximum=2",
|
Start the MapReduce cluster
...