Hadoop streaming with multiple python files

Question

I have a hadoop streaming job. This job makes use of a python script which imports another python script. The command works fine from the command line but fails when using hadoop streaming. Here is an example of my hadoop streaming command

hadoop jar $streamingJar \
    -D mapreduce.map.memory.mb=4096 \
    -files preprocess.py,parse.py \
    -input $input \
    -output $output \
    -mapper "python parse.py" \
    -reducer NONE

And here is the first line in parse.py

from preprocess import normalize_large_text, normalize_small_text

When I run the command through hadoop streaming I see the following output in the logs

Traceback (most recent call last):
  File "preprocess.py", line 1, in <module>
    from preprocess import normalize_large_text, normalize_small_text, normalize_skill_cluster
ImportError: No module named preprocess

My understanding is that hadoop put all the files in the same directory. If this is true then I don't see how this could fail. Does anyone know what's going on?

Thanks

bearrito · Accepted Answer · 2014-11-08 02:42:03Z

3

You need to add the scripts to the same directory and add them using files flag.

hadoop jar $streamingJar -D mapreduce.map.memory.mb=4096 -files python_files 
-input $input -output $output -mapper "python_files\python parse.py" -reducer NONE

answered Nov 8, 2014 at 2:42

bearrito

2,3761 gold badge26 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Hadoop streaming with multiple python files

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related