Propagating custom configuration values in Hadoop

Question

Is there any way to set and (later) get a custom configuration object in Hadoop, during Map/Reduce?

For example, assume an application that preprocesses a large file and determines dynamically some characteristics related to the file. Furthermore, assume that those characteristics are saved in a custom Java object (e.g., a Properties object, but not exclusively, since some may not be strings) and are subsequently necessary for each of the map and of the reduce jobs.

How could the application "propagate" this configuration, so that each mapper and reducer function can access it, when needed?

One approach could be to use the set(String, String) method of the JobConf class and, for instance, pass the configuration object serialized as a JSON string via the second parameter, but this may be too much of a hack and then the appropriate JobConf instance must be accessed by each Mapper and Reducer anyway (e.g., following an approach like the one suggested in an earlier question).

Charles Menguy · Accepted Answer · 2013-02-20 05:59:11Z

11

Unless I'm missing something, if you have a Properties object containing every property you need in your M/R job, you simply need to write the content of the Properties object to the Hadoop Configuration object. For example, something like this:

Configuration conf = new Configuration();
Properties params = getParameters(); // do whatever you need here to create your object
for (Entry<Object, Object> entry : params.entrySet()) {
    String propName = (String)entry.getKey();
    String propValue = (String)entry.getValue();
    conf.set(propName, propValue);
}

Then inside your M/R job, you can use the Context object to get back your Configuration in both the mapper (the map function) or the reducer (the reduce function), like this:

public void map(MD5Hash key, OverlapDataWritable value, Context context)
    Configuration conf = context.getConfiguration();
    String someProperty = conf.get("something");
    ....
}

Note that when using the Configuration object, you can also access the Context in the setup and cleanup methods, useful to do some initialization if needed.

Also it's worth mentioning you could probably directly call the addResource method from the Configuration object to add your properties directly as an InputStream or a file, but I believe this has to be an XML configuration like the regular Hadoop XML configs, so that might just be overkill.

EDIT: In case of non-String objects, I would advise using serialization: You can serialize your objects, and then convert them to Strings (probably encode them for example with Base64 as I'm not sure what would happen if you have unusual characters), and then on the mapper/reducer side de-serialize the objects from the Strings you get from the properties inside Configuration.

Another approach would be to do the same serialization technique, but instead write to HDFS, and then add these files to the DistributedCache. Sounds a bit overkill, but this would probably work.

edited Feb 20, 2013 at 5:59

answered Feb 20, 2013 at 4:07

Charles Menguy

41.6k18 gold badges97 silver badges117 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

PNS Over a year ago

This is fine if all the properties are strings. In my use case, some are not, they are (rather complex) custom Java objects that are needed for both the map and the reduce parts. But +1 anyway for the answer. :-)

Charles Menguy Over a year ago

@PNS Well if you want to pass non-string object, you could always serialize your objects before to convert the to Strings and pass these strings to the Configuration, and then in the mapper/reducer de-serialize the objects. I don't think there's any native way to pass objects around like this. You could probably also serialize the object into a file and put it in HDFS and then get this file with the DistributedCache probably.

PNS Over a year ago

Yes, serialization seems necessary anyway, since MapReduce is a distributed application.

Dmitry Over a year ago

it is a very good idea to encode serialized strings with Base64. Special symbols like *, ' , " etc. can cause job failure.

Collectives™ on Stack Overflow

Propagating custom configuration values in Hadoop

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related