2

How can I apply UTF8 encoding properly when writing a dataframe into a CSV file in Spark2-Scala? I am using this:

df.repartition(1).write.mode(SaveMode.Overwrite)
.format("csv").option("header", true).option("delimiter", "|")
.save(Path)

And it is not working: example: replacing é to weird strings.

Thank you.

7
  • UTF-8 is the default encoding used by Spark. Commented Oct 21, 2019 at 8:41
  • @Shaido Why am I having weird characters in output then? I checked my DF in Spark-Shell and it is good Commented Oct 21, 2019 at 8:42
  • 1
    can you post the images of your shell & other for better understanding. Commented Oct 21, 2019 at 9:54
  • 1
    Try setting the encoding option explicitly to UTF-8, though that's the default encoding if the option is unset. Perhaps Spark is running with a different locale. Commented Oct 21, 2019 at 9:59
  • 3
    I mean .option("encoding", "UTF-8"). Commented Oct 21, 2019 at 10:17

1 Answer 1

4

So as @Hristo Iliev suggested I needed to force UTF encoding using:

df.repartition(1).write.mode(SaveMode.Overwrite)
.format("csv").option("header", true).option("encoding", "UTF-8").option("delimiter", "|")
.save(Path)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.