I am trying to set the proper encoding while saving a CSV compressed file using pyspark.
Here my test:
# read main tabular data
sp_df = spark.read.csv(file_path, header=True, sep=';', encoding='cp1252')
sp_df.show(5)
+----------+---------+--------+---------+------+
| Date| Zone| Duree| Type|Volume|
+----------+---------+--------+---------+------+
|2019-01-16|010010000| 30min3h|Etrangers| 684|
|2019-01-16|010010000| 30min3h| Français| 21771|
|2019-01-16|010010000|Inf30min|Etrangers| 7497|
|2019-01-16|010010000|Inf30min| Français| 74852|
|2019-01-16|010010000| Sup3h|Etrangers| 429|
+----------+---------+--------+---------+------+
only showing top 5 rows
We can see that the data was properly interpreted by using the encoding CP1252. The problem is that when I save the data in a CSV gzipped file using the CP1252 encoding and checking it back, the special characters are not well decoded:
# Save Data
sp_df.repartition(5, 'Zone').write.option('encoding', 'cp1252').csv(output_path, mode='overwrite', sep=';', compression='gzip')
# read saved data
spark.read.csv(os.path.join(output_path, '*.csv.gz'), header=True, sep=';', encoding='cp1252').show()
+----------+---------+--------+---------+------+
| Date| Zone| Duree| Type|Volume|
+----------+---------+--------+---------+------+
|2019-01-16|010070000| 30min3h|Etrangers| 1584|
|2019-01-16|010070000| 30min3h|Français| 18662|
|2019-01-16|010070000|Inf30min|Etrangers| 12327|
|2019-01-16|010070000|Inf30min|Français| 30368|
|2019-01-16|010070000| Sup3h|Etrangers| 453|
+----------+---------+--------+---------+------+
only showing top 5 rows
Any ideas? I am using spark 2.3