Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Write your job configuration. Note that you need to change the output location each time you run this!
    Code Block
    ~>cat phase.json
    [
        {
            "Name": "MapReduce Step 1: Run Phase",
            "ActionOnFailure": "CANCEL_AND_WAIT",
            "HadoopJarStep": {
                "Jar": "/home/hadoop/contrib/streaming/hadoop-streaming.jar",
                    "Args": [
                        "-input",     "s3n://sagetest-YourUsername/input/phaseInput.txt",
                        "-output",    "s3n://sagetest-YourUsername/output/phaseTry1",
                        "-mapper",    "s3n://sagetest-YourUsername/scripts/phaseMapper.sh",
                        "-cacheFile", "s3n://sagetest-YourUsername/scripts/phase#phase",
                        "-jobconf",   "mapred.reduce.tasks=0",
                        "-jobconf",   "mapred.task.timeout=604800000",
                    ]
                }
        }
    ]
    
  2. Put it on one of the shared servers sodo/ballard/belltown.

If you find that your mapper tasks are not getting balanced evenly across your fleet, you can add lines like the following to your job config:

Code Block

                    "-jobconf", "mapred.map.tasks=26", 
                    "-jobconf", "mapred.tasktracker.map.tasks.maximum=2",

Start the MapReduce cluster

...