0

Is there a way to set the encoding in SparkConf? I'm building a Java application with Spark that processes Arabic data. When I run it in the dev environment with Spark Master set to Local[*], data is processed correctly. However, when I prepare the JAR and submit it to Spark Cluster, data seems to need encoding.
I used:

--conf spark.driver.extraJavaOptions=-Dfile.encoding=UTF-8

as configuration in Spark submit, but didn't work.
OS: Windows 10 Java 1.8.0.131
Spark 2.1.0

1 Answer 1

1

For reading textual data, Spark uses the underlying Hadoop InputFormat, which assumes UTF-8 encoding. If your data is actually UTF-8, then it should be read correctly. If not, you will need to convert it before passing it to Spark.

Handling other character encodings has been raised an an issue (SPARK-1849) but has been marked as "Won't Fix".

It is odd that your data works in a local job, but not a cluster job - but you may need to provide further details before anyone here can help - e.g. what OS are you using for the cluster and for your client node - and how do you know there is an encoding problem?

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your response. I updated my question with the environment details.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.