Hadoop streaming with Python: splitting input files manually

Question

I am new to Hadoop and am trying to use its streaming feature with Python written mapper and reducer. The problem is that my original input file will contain sequences of lines which are to be identified by a mapper. If I let Hadoop split the input file, it might do it in the middle of a sequence and, thus, that sequence will not be detected. So, I was thinking about splitting the files manually. This will also break some sequences, therefore, in addition to that I would also provide and alternative split that would create files overlapping the "first" split. That way I will not loose any sequences.

I will be running the following command described in this article:

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \
-file /home/hduser/mapper.py    -mapper /home/hduser/mapper.py \
-file /home/hduser/reducer.py   -reducer /home/hduser/reducer.py \
-input /user/hduser/seq_files/* -output /user/hduser/output_files

I plan to put my input files (i.e. files generated by manual splits) in /user/hduser/seq_files/

The question is:

How do I configure Hadoop to take each input file and send it to a mapper as it is?
If number of input files is greater than number of nodes, will all the files be mapped? Thanks.

firelynx · Accepted Answer · 2015-05-10 08:03:02Z

1

There are a number of issues to consider here.

The map part of map/reduce requires that all data you need to map the line resides on the line. Trying to go around this is very bad practice and may be considered a smell that you are doing something wrong.
Hadoop only splits input files which are split-able, such as bz2 or uncompressed files. Gzip does not get split for instance.
If you are analysing sequences or "things that require a particular order to them to make sense", this is typically done in a reducer, since the data streamed to it is always sorted on the Hadoop sort key (and this is why you map the key out).
The reducers will receive a split dataset from the mappers after the dataset has been sorted, to avoid separation of information which all needs to go to the same reducer to be interpreted, use the Hadoop partitioning key

Note that all the links point to the same page, just different chapters. In general, I think reading that page from top to bottom will give you a much firmer notion of how to use Hadoop in a streaming fashion.

Bonus tip: If you want to do map/reduce with python, I can recommend looking at Apache Spark for Python, which runs on Hadoop, but is A whole lot faster It also lets you use the iPython console for developing your map/reduce algorithms, which increases development speed tremendously.

answered May 10, 2015 at 8:03

firelynx

32.5k10 gold badges94 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

jazzblue Over a year ago

Thanks, this is brilliant! It is a much better idea, indeed, to take care of the order in the reducer. One question here: The information about line ordering, such as line number, should probably probably still passed into the mapper, otherwise, that info would be lost. I wonder if I should manually add that line number as a key in front of each line in the file (i.e. manually preprocess the input file) or can Hadoop in its streaming mechanism automatically assign a "line number" key to each line in the file before passing it to the reducer? Thanks.

firelynx Over a year ago

@jazzblue, The mappers only job is to change the input line so it fits your needs for sorting, partitioning and reducing. There is no automagical way that anything from the original data gets transferred to the reducer, if you want it, you have to output it yourself in your mapper code.

jazzblue Over a year ago

Thanks. One more thought: if I I still want to have control over how I split input file, I could, for example, make JSON list out of each block of lines (partition) and put each list in a new input file as a line. This way, the whole block will reside on one line and will be sent to the same mapper. I wonder, if this kind of JSON manipulation is an acceptable practice or whether it could have some drawbacks? Thanks.

firelynx Over a year ago

@jazzblue, json is very slow to deserialize, I would suggest you keep away from it. Also, each line should be atomic. This means that if you would split up the data even more out of it, it would stop making sense. Gluing lots of lines together to one is the job of the reducer and this is where it should be done, trying to do it before the mapper manually defeats the purpose of map/reduce. Keep in mind you can run two streaming jobs in Hadoop, where ones output is the input of the next. Thereby you first sort out your dataset the way you now describe, then you do your actual reduction.

Collectives™ on Stack Overflow

Hadoop streaming with Python: splitting input files manually

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related