1

I have a CSV file which has UTF-16 LE encoding.

I am able to parse the data using below code (Spark 2.4.5):

df = spark.read \
    .schema('`my_id` string') \
    .option('sep', '\t') \
    .option('header', 'true') \
    .option('encoding', 'UTF-16') \
    .csv(my_path)

The Source data looks like this

my_id

123
456

When using df.show() or writing the data to Parquet df.repartition(1).write.mode('append').format('parquet').save(my_target_path') I get the below output

my_id
�
123�
456�

Opening the raw file in notepad++, i can see the below (note: Notepad++ used UCS-2 LE BOM)

enter image description here

When I open the file using VsCode, it uses UTF-16 LE

enter image description here

Question: Is it possible for me to use native spark.read.csv() to avoid the additional character which get added at the end of each line?

2
  • what do you mean by The spark dataframe returns? Did you use .show() to print it out? .show() might have used an incorrect encoding. Commented Dec 5, 2020 at 12:01
  • I have updated question to clarify how data is show/written out Commented Dec 5, 2020 at 12:38

1 Answer 1

2

I found a resolution after doing some more digging. Enabling multiline resolved my issue. Data is being parsed without the extra characters.

df = spark.read \
    .schema('`my_id` string') \
    .option('sep', '\t') \
    .option('header', 'true') \
    .option('encoding', 'UTF-16') \
    .option('multiline', 'true') \
    .csv(my_path)

There are 2 Spark issues which helped my analysis:

  1. SPARK-32961 - "For the issue itself, I am almost 100% sure we can't fix with multiLine disabled"
  2. SPARK-32965
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.