Java Spark Data Encoding

Question

Is there a way to set the encoding in SparkConf? I'm building a Java application with Spark that processes Arabic data. When I run it in the dev environment with Spark Master set to Local[*], data is processed correctly. However, when I prepare the JAR and submit it to Spark Cluster, data seems to need encoding.
I used:

--conf spark.driver.extraJavaOptions=-Dfile.encoding=UTF-8

as configuration in Spark submit, but didn't work.
OS: Windows 10 Java 1.8.0.131
Spark 2.1.0

DNA · Accepted Answer · 2017-04-22 16:37:19Z

1

For reading textual data, Spark uses the underlying Hadoop InputFormat, which assumes UTF-8 encoding. If your data is actually UTF-8, then it should be read correctly. If not, you will need to convert it before passing it to Spark.

Handling other character encodings has been raised an an issue (SPARK-1849) but has been marked as "Won't Fix".

It is odd that your data works in a local job, but not a cluster job - but you may need to provide further details before anyone here can help - e.g. what OS are you using for the cluster and for your client node - and how do you know there is an encoding problem?

answered Apr 22, 2017 at 16:37

DNA

42.7k12 gold badges114 silver badges153 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

fattah.safa Over a year ago

Thanks for your response. I updated my question with the environment details.

Collectives™ on Stack Overflow

Java Spark Data Encoding

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related