Page Comparison

...

Write your job configuration. Note that you need to change the output location each time you run this!

Code Block

~>cat phase.json
[
    {
        "Name": "MapReduce Step 1: Run Phase",
        "ActionOnFailure": "CANCEL_AND_WAIT",
        "HadoopJarStep": {
            "Jar": "/home/hadoop/contrib/streaming/hadoop-streaming.jar",
                "Args": [
                    "-input",     "s3n://sagetest-YourUsername/input/phaseInput.txt",
                    "-output",    "s3n://sagetest-YourUsername/output/phaseTry1",
                    "-mapper",    "s3n://sagetest-YourUsername/scripts/phaseMapper.sh",
                    "-cacheFile", "s3n://sagetest-YourUsername/scripts/phase#phase",
                    "-jobconf",   "mapred.reduce.tasks=0",
                    "-jobconf",   "mapred.task.timeout=604800000",
                ]
            }
    }
]

Put it on one of the shared servers sodo/ballard/belltown.

If you find that your mapper tasks are not getting balanced evenly across your fleet, you can add lines like the following to your job config:

Code Block
"-jobconf", "mapred.map.tasks=26", "-jobconf", "mapred.tasktracker.map.tasks.maximum=2",

Start the MapReduce cluster

...

Versions Compared

Old Version 15

New Version 16

Key

Start the MapReduce cluster