1

I used this code to run the word count hadoop job. WordCountDriver runs when I run it from inside eclipse with the hadoop eclipse plugin. WordCountDriver also runs from the command line when I package the mapper and reducer classes as a jar and drop it in the classpath.

However, it fails if I try to run it from the command line without adding the mapper and reducer class as a jar to the classpath although I added both the classes to the classpath. I wanted to know is there some restriction in hadoop from accepting mapper & reducer classes as normal class files. Is creating a jar always mandatory ?

public class WordCountDriver extends Configured implements Tool {

public static final String HADOOP_ROOT_DIR = "hdfs://universe:54310/app/hadoop/tmp";


static class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    private Text word = new Text();
    private final IntWritable one = new IntWritable(1);

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String line = value.toString();
        StringTokenizer itr = new StringTokenizer(line.toLowerCase());
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }
};

static class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

        int sum = 0;

        for (IntWritable value : values) {
            sum += value.get(); // process value
        }       
        context.write(key, new IntWritable(sum));
    }
};


/**
 * 
 */
public int run(String[] args) throws Exception {

    Configuration conf = getConf();

    conf.set("mapred.job.tracker", "universe:54311");

    Job job = new Job(conf, "Word Count");

    // specify output types
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    // specify input and output dirs
    FileInputFormat.addInputPath(job, new Path(HADOOP_ROOT_DIR + "/input"));
    FileOutputFormat.setOutputPath(job, new Path(HADOOP_ROOT_DIR + "/output"));

    // specify a mapper
    job.setMapperClass(WordCountDriver.WordCountMapper.class);

    // specify a reducer
    job.setReducerClass(WordCountDriver.WordCountReducer.class);
    job.setCombinerClass(WordCountDriver.WordCountReducer.class);

    job.setJarByClass(WordCountDriver.WordCountMapper.class);

    return job.waitForCompletion(true) ? 0 : 1;
}

/**
 * 
 * @param args
 * @throws Exception
 */
public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(new Configuration(), new WordCountDriver(), args);
    System.exit(res);
}

}

2 Answers 2

1

It's not entirely clear which classpath you're referring to, but in the end, if you're running on a remote Hadoop cluster, you need to provide all classes in a JAR file that is sent to Hadoop during the hadoop jar execution. The classpath of your local program is irrelevant.

It is probably working locally since you are actually running a Hadoop instance inside the local process there. So, in that case it happens to be able to find the classes in your local program's classpath.

Sign up to request clarification or add additional context in comments.

1 Comment

My driver class is local and hadoop is setup as a 1-node cluster
1

Adding classes to the hadoop classpath will make them available client side (i.e. to your Driver).

Your mapper and reducer need to be available cluster-wide, and to make this easier on hadoop, you bundle them up into a jar and either supply with the Job.setJarByClass(..) class, or add them to the job classpath using the -libjars option with the GenericOptionsParser:

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.