I am new to Hadoop and am trying to use its streaming feature with Python written mapper and reducer. The problem is that my original input file will contain sequences of lines which are to be identified by a mapper. If I let Hadoop split the input file, it might do it in the middle of a sequence and, thus, that sequence will not be detected. So, I was thinking about splitting the files manually. This will also break some sequences, therefore, in addition to that I would also provide and alternative split that would create files overlapping the "first" split. That way I will not loose any sequences.
I will be running the following command described in this article:
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \
-file /home/hduser/mapper.py -mapper /home/hduser/mapper.py \
-file /home/hduser/reducer.py -reducer /home/hduser/reducer.py \
-input /user/hduser/seq_files/* -output /user/hduser/output_files
I plan to put my input files (i.e. files generated by manual splits) in /user/hduser/seq_files/
The question is:
How do I configure Hadoop to take each input file and send it to a mapper as it is?
If number of input files is greater than number of nodes, will all the files be mapped? Thanks.