I am writing a python spark utility to read files and do some transformation. File has large amount of data ( upto 12GB ). I use sc.textFile to create a RDD and logic is to pass each line from RDD to a map function which in turn split's the line by "," and run some data transformation( changing fields value based on a mapping ).
Sample line from the file. 0014164,02,031270,09,1,,0,0,0000000000,134314,Mobile,ce87862158eb0dff3023e16850f0417a-cs31,584e2cd63057b7ed,Privé,Gossip
Due to values "Privé" I get UnicodeDecodeError. I tried to following to parse this value:
if isinstance(v[12],basestring):
v[12] = v[12].encode('utf8')
else:
v[12] = unicode(v[12]).encode('utf8')
but when I write data back to file this field gets translated as 'Priv�'. on Linux source file type is shown as "ISO-8859 text, with very long lines, with CRLF line terminators".
Could someone let me know right way in Spark to read/write files with mixed encoding please.