PySpark - Parse CSV file with UTF-16 encoding

Question

I have a CSV file which has UTF-16 LE encoding.

I am able to parse the data using below code (Spark 2.4.5):

df = spark.read \
    .schema('`my_id` string') \
    .option('sep', '\t') \
    .option('header', 'true') \
    .option('encoding', 'UTF-16') \
    .csv(my_path)

The Source data looks like this

my_id

123
456

When using df.show() or writing the data to Parquet df.repartition(1).write.mode('append').format('parquet').save(my_target_path') I get the below output

my_id
�
123�
456�

Opening the raw file in notepad++, i can see the below (note: Notepad++ used UCS-2 LE BOM)

When I open the file using VsCode, it uses UTF-16 LE

Question: Is it possible for me to use native spark.read.csv() to avoid the additional character which get added at the end of each line?

what do you mean by The spark dataframe returns? Did you use .show() to print it out? .show() might have used an incorrect encoding. — mck
– mck, Commented Dec 5, 2020 at 12:01
I have updated question to clarify how data is show/written out — nsc060
– nsc060, Commented Dec 5, 2020 at 12:38

nsc060 · Accepted Answer · 2020-12-05 23:30:57Z

2

I found a resolution after doing some more digging. Enabling multiline resolved my issue. Data is being parsed without the extra � characters.

df = spark.read \
    .schema('`my_id` string') \
    .option('sep', '\t') \
    .option('header', 'true') \
    .option('encoding', 'UTF-16') \
    .option('multiline', 'true') \
    .csv(my_path)

There are 2 Spark issues which helped my analysis:

SPARK-32961 - "For the issue itself, I am almost 100% sure we can't fix with multiLine disabled"
SPARK-32965

edited Dec 5, 2020 at 23:30

answered Dec 5, 2020 at 23:24

nsc060

4571 gold badge7 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

PySpark - Parse CSV file with UTF-16 encoding

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related