3

I am trying to set the proper encoding while saving a CSV compressed file using pyspark.

Here my test:

# read main tabular data
sp_df = spark.read.csv(file_path, header=True, sep=';', encoding='cp1252')
sp_df.show(5)
+----------+---------+--------+---------+------+
|      Date|     Zone|   Duree|     Type|Volume|
+----------+---------+--------+---------+------+
|2019-01-16|010010000| 30min3h|Etrangers|   684|
|2019-01-16|010010000| 30min3h| Français| 21771|
|2019-01-16|010010000|Inf30min|Etrangers|  7497|
|2019-01-16|010010000|Inf30min| Français| 74852|
|2019-01-16|010010000|   Sup3h|Etrangers|   429|
+----------+---------+--------+---------+------+
only showing top 5 rows

We can see that the data was properly interpreted by using the encoding CP1252. The problem is that when I save the data in a CSV gzipped file using the CP1252 encoding and checking it back, the special characters are not well decoded:

# Save Data
sp_df.repartition(5, 'Zone').write.option('encoding', 'cp1252').csv(output_path, mode='overwrite', sep=';', compression='gzip')

# read saved data
spark.read.csv(os.path.join(output_path, '*.csv.gz'), header=True, sep=';', encoding='cp1252').show()
+----------+---------+--------+---------+------+
|      Date|     Zone|   Duree|     Type|Volume|
+----------+---------+--------+---------+------+
|2019-01-16|010070000| 30min3h|Etrangers|  1584|
|2019-01-16|010070000| 30min3h|Français| 18662|
|2019-01-16|010070000|Inf30min|Etrangers| 12327|
|2019-01-16|010070000|Inf30min|Français| 30368|
|2019-01-16|010070000|   Sup3h|Etrangers|   453|
+----------+---------+--------+---------+------+
only showing top 5 rows

Any ideas? I am using spark 2.3

1 Answer 1

2

According to official documentation, encoding is an option you should put directly in the csv method the same way you use it for read.

sp_df.repartition(5, 'Zone').write.option('encoding', 'cp1252').csv(output_path, mode='overwrite', sep=';', compression='gzip')

to become

sp_df.repartition(5, 'Zone').write.csv(output_path, mode='overwrite', sep=';', compression='gzip', encoding='cp1252')

The way you wrote it is overwritten by the default argument of the csv method encoding=None which leads to a UTF-8 encoding.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your reply! I have this error while putting the encoding parameters inside the csv() method: TypeError: csv() got an unexpected keyword argument 'encoding' And when I try the method: sp_df.repartition(5, 'Zone').write.option('sep', ';').option('encoding', 'cp1252').mode('overwrite').format('com.databricks.spark.csv').option('codec', 'org.apache.hadoop.io.compress.GzipCodec').save(output_path, header = 'true') I still have the default 'utf-8' encoding applied while saving the file...
@oso_ted there is obviously a problem with your csv method because you should not receive the error TypeError you describe. encoding is a legit argument for this method. Do you still get the same encoding without using the compression ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.