1

I have a hadoop streaming job. This job makes use of a python script which imports another python script. The command works fine from the command line but fails when using hadoop streaming. Here is an example of my hadoop streaming command

hadoop jar $streamingJar \
    -D mapreduce.map.memory.mb=4096 \
    -files preprocess.py,parse.py \
    -input $input \
    -output $output \
    -mapper "python parse.py" \
    -reducer NONE

And here is the first line in parse.py

from preprocess import normalize_large_text, normalize_small_text

When I run the command through hadoop streaming I see the following output in the logs

Traceback (most recent call last):
  File "preprocess.py", line 1, in <module>
    from preprocess import normalize_large_text, normalize_small_text, normalize_skill_cluster
ImportError: No module named preprocess

My understanding is that hadoop put all the files in the same directory. If this is true then I don't see how this could fail. Does anyone know what's going on?

Thanks

1 Answer 1

3

You need to add the scripts to the same directory and add them using files flag.

hadoop jar $streamingJar -D mapreduce.map.memory.mb=4096 -files python_files 
-input $input -output $output -mapper "python_files\python parse.py" -reducer NONE
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.