1

I have two small python scripts

CountWordOccurence_mapper.py

#!/usr/bin/env python
import sys

#print(sys.argv[1])

text = sys.argv[1]

wordCount = text.count(sys.argv[2])

#print (sys.argv[2],wordCount)
print '%s\t%s' % (sys.argv[2], wordCount)

PrintWordCount_reducer.py

#!/usr/bin/env python
import sys

finalCount = 0

for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t')

    count=int(count)

    finalCount += count

    print(word,finalCount)

I executed the same as follows :

$ ./CountWordOccurence_mapper.py \
    "I am a Honda customer 100%.. 94 Accord ex 96 Accord exV6 98 Accord exv6 cpe 2001 S2000 ... 2003 Pilot for me and 2003 Accord for hubby that are still going beautifully...\n\nBUT.... Honda lawnmower motor blown 2months after the warranty expired. Sad $600 didn't last very long." \
    "Accord" \
     | /home/hadoopranch/omkar/PrintWordCount_reducer.py
('Accord', 4)

As seen, my objective is dumb - count the no. of occurrences of the supplied word(in this case, Accord) in the given text.

Now, I intend to execute the same using Hadoop streaming. The text file on HDFS(partial) is :

"message" : "I am a Honda customer 100%.. 94 Accord ex   96 Accord exV6  98 Accord exv6 cpe  2001 S2000 ... 2003 Pilot for me and 2003 Accord for hubby that are still going beautifully...\n\nBUT.... Honda lawnmower   motor blown 2months after the warranty expired. Sad $600 didn't last very long."
"message" : "I am an angry Honda owner! In 2009 I bought a new Honda Civic and have taken great care of it.  Yesterday   I tried to start it unsuccessfully.  After hours at the auto mechanics  it was found that there was a glitch in the electric/computer system.  The news was disappointing enough (and expensive)    but to find out the problem is basically a defect/common problem with the year/make/model I purchased is awful.  When I bought a NEW Honda I thought I bought quality.  I was wrong! Will Honda step up?"

I modified the CountWordOccurence_mapper.py

#!/usr/bin/env python
import sys

for text in sys.stdin:  

    wordCount = text.count(sys.argv[1])

    print '%s\t%s' % (sys.argv[1], wordCount)

My first confusion was - how to send the word to be counted e.g "Accord", "Honda" as an argument to the mapper(-cmdenv name=value) only confused me. I still went ahead and executed the following command :

$HADOOP_HOME/bin/hadoop jar \
  $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.4.jar \
 -input /random_data/honda_service_within_warranty.txt \
 -output /random_op/cnt.txt \
 -file /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py \
 -mapper /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py "Accord" \
 -file /home/hduser/dumphere/codes/python/PrintWordCount_reducer.py \
 -reducer /home/hduser/dumphere/codes/python/PrintWordCount_reducer.py

As expected, the job failed and I got the following error:

Traceback (most recent call last):
  File "/tmp/hadoop-hduser/mapred/local/taskTracker/hduser/jobcache/job_201304232210_0007/attempt_201304232210_0007_m_000001_3/work/./CountWordOccurence_mapper.py", line 6, in <module>
    wordCount = text.count(sys.argv[1])
IndexError: list index out of range
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

Please correct the syntactical and basic mistakes that I have committed.

Thanks and regards !

2 Answers 2

1

I think you're problem lies in the following portion of your command line invocation:

-mapper /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py "Accord"

I think your assumption here is that the string "Accord" is being passed as the first argument to the mapper. I'm pretty sure this isn't the case, in fact the string "Accord" is most probably being ignored by the streaming driver entry point class (StreamJob.java).

To fix this you'll want to go back to using the -cmdenv parameter and then extract this key/value pair in your python code (I'm not a python programmer but i'm sure a quick Google will point you towards the snippet you need).

Sign up to request clarification or add additional context in comments.

1 Comment

I conveniently forgot to make changes to my python file :| Thanks to your answer, it dawned upon me :D
0

Actually, I had tried with -cmdenv wordToBeCounted="Accord" but the issue was with my python file - I forgot to make changes to it to read the value "Accord" from the environment variable(and NOT from the argument array). I'm attaching the code for CountWordOccurence_mapper.py, just in case anyone would like to use it for reference :

#!/usr/bin/env python
import sys
import os

wordToBeCounted = os.environ['wordToBeCounted']

for text in sys.stdin:  

    wordCount = text.count(wordToBeCounted)

    print '%s\t%s' % (wordToBeCounted,wordCount)

Thanks and regards !

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.