0

In connection with my earlier question, when I give the command,

filePath = sc.textFile("/user/cloudera/input/Hin*/datafile.txt")
filePath.collect()

some part of the data has '\xa0' prefixed to every word, and other part of the data doesn't have that special character. I am attaching 2 pictures, one with '\xa0', and another without '\xa0'. With '\xa0' Without '\xa0' The content shown in 2 pictures belong to same file. Only some part of the data from same file is read that way by Spark. I have checked the original data file present in HDFS, and it was problem free.

I feel that it has something to do with encoding. I tried all methods like using replaceoption in flatMap like flatMap(lambda line: line.replace(u'\xa0', ' ').split(" ")), flatMap(lambda line: line.replace(u'\xa0', u' ').split(" ")), but none worked for me. This question might sound dump, but I am newbie in using Apache Spark, and I require some assistance to overcome this problem.

Can anyone please help me? Thanks in advance.

1 Answer 1

2

Check the encoding of your file. When you use sc.textFile, spark expects an UTF-8 encoded file. One of the solution is to acquire your file with sc.binaryFiles and then apply the expected encoding.

sc.binaryFile create a key/value rdd where key is the path to file and value is the content as a byte. If you need to keep only the text and apply an decoding function, :

filePath = sc.binaryFile("/user/cloudera/input/Hin*/datafile.txt")
filePath.map(lambda x :x[1].decode('utf-8')) #or another encoding depending on your file
Sign up to request clarification or add additional context in comments.

15 Comments

Like filePath = sc.binaryFiles("/user/cloudera/Hin*/datafile.txt")? Then where should I apply the encoding method?
Can you please elaborate?
Where should I add your map function? filePath.flatMap(lambda line: line.split(" ")).filter(lambda w: w.lower() in words).map(lambda word: (word, 1)).reduceByKey(add) I want to search for predefined words and count their number of occurrences. This is the link of my earlier question.
@JohnDeer you start with the decode function, as I wrote it. Then you apply your count.
Like, first 2 lines will be your code, followed by filePath.flatMap(lambda line: line.split(" ")).filter(lambda w: w.lower() in words).map(lambda word: (word, 1)).reduceByKey(add) will be 3rd line and collect() in the end, right?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.