Difficulty with encoding while reading data in Spark

Question

In connection with my earlier question, when I give the command,

filePath = sc.textFile("/user/cloudera/input/Hin*/datafile.txt")
filePath.collect()

some part of the data has '\xa0' prefixed to every word, and other part of the data doesn't have that special character. I am attaching 2 pictures, one with '\xa0', and another without '\xa0'. $With '\xa0'$ $Without '\xa0'$ The content shown in 2 pictures belong to same file. Only some part of the data from same file is read that way by Spark. I have checked the original data file present in HDFS, and it was problem free.

I feel that it has something to do with encoding. I tried all methods like using replaceoption in flatMap like flatMap(lambda line: line.replace(u'\xa0', ' ').split(" ")), flatMap(lambda line: line.replace(u'\xa0', u' ').split(" ")), but none worked for me. This question might sound dump, but I am newbie in using Apache Spark, and I require some assistance to overcome this problem.

Can anyone please help me? Thanks in advance.

Steven · Accepted Answer · 2018-02-19 09:58:56Z

2

Check the encoding of your file. When you use sc.textFile, spark expects an UTF-8 encoded file. One of the solution is to acquire your file with sc.binaryFiles and then apply the expected encoding.

sc.binaryFile create a key/value rdd where key is the path to file and value is the content as a byte. If you need to keep only the text and apply an decoding function, :

filePath = sc.binaryFile("/user/cloudera/input/Hin*/datafile.txt")
filePath.map(lambda x :x[1].decode('utf-8')) #or another encoding depending on your file

edited Feb 19, 2018 at 9:58

answered Feb 19, 2018 at 9:11

Steven

15.4k7 gold badges49 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

15 Comments

John Deer Over a year ago

Like filePath = sc.binaryFiles("/user/cloudera/Hin*/datafile.txt")? Then where should I apply the encoding method?

John Deer Over a year ago

Can you please elaborate?

John Deer Over a year ago

Where should I add your map function?

filePath.flatMap(lambda line: line.split(" ")).filter(lambda w: w.lower() in words).map(lambda word: (word, 1)).reduceByKey(add)

I want to search for predefined words and count their number of occurrences. This is the link of my earlier question.

Steven Over a year ago

@JohnDeer you start with the decode function, as I wrote it. Then you apply your count.

John Deer Over a year ago

Like, first 2 lines will be your code, followed by

filePath.flatMap(lambda line: line.split(" ")).filter(lambda w: w.lower() in words).map(lambda word: (word, 1)).reduceByKey(add)

will be 3rd line and collect() in the end, right?

|

Collectives™ on Stack Overflow

Difficulty with encoding while reading data in Spark

1 Answer 1

15 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

15 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related