Select encoding while writing a CSV file in pypark

Question

I am trying to set the proper encoding while saving a CSV compressed file using pyspark.

Here my test:

# read main tabular data
sp_df = spark.read.csv(file_path, header=True, sep=';', encoding='cp1252')
sp_df.show(5)
+----------+---------+--------+---------+------+
|      Date|     Zone|   Duree|     Type|Volume|
+----------+---------+--------+---------+------+
|2019-01-16|010010000| 30min3h|Etrangers|   684|
|2019-01-16|010010000| 30min3h| Français| 21771|
|2019-01-16|010010000|Inf30min|Etrangers|  7497|
|2019-01-16|010010000|Inf30min| Français| 74852|
|2019-01-16|010010000|   Sup3h|Etrangers|   429|
+----------+---------+--------+---------+------+
only showing top 5 rows

We can see that the data was properly interpreted by using the encoding CP1252. The problem is that when I save the data in a CSV gzipped file using the CP1252 encoding and checking it back, the special characters are not well decoded:

# Save Data
sp_df.repartition(5, 'Zone').write.option('encoding', 'cp1252').csv(output_path, mode='overwrite', sep=';', compression='gzip')

# read saved data
spark.read.csv(os.path.join(output_path, '*.csv.gz'), header=True, sep=';', encoding='cp1252').show()
+----------+---------+--------+---------+------+
|      Date|     Zone|   Duree|     Type|Volume|
+----------+---------+--------+---------+------+
|2019-01-16|010070000| 30min3h|Etrangers|  1584|
|2019-01-16|010070000| 30min3h|FranÃ§ais| 18662|
|2019-01-16|010070000|Inf30min|Etrangers| 12327|
|2019-01-16|010070000|Inf30min|FranÃ§ais| 30368|
|2019-01-16|010070000|   Sup3h|Etrangers|   453|
+----------+---------+--------+---------+------+
only showing top 5 rows

Any ideas? I am using spark 2.3

Steven · Accepted Answer · 2021-10-05 16:20:30Z

2

According to official documentation, encoding is an option you should put directly in the csv method the same way you use it for read.

sp_df.repartition(5, 'Zone').write.option('encoding', 'cp1252').csv(output_path, mode='overwrite', sep=';', compression='gzip')

to become

sp_df.repartition(5, 'Zone').write.csv(output_path, mode='overwrite', sep=';', compression='gzip', encoding='cp1252')

The way you wrote it is overwritten by the default argument of the csv method encoding=None which leads to a UTF-8 encoding.

edited Oct 5, 2021 at 16:20

answered Feb 4, 2019 at 12:51

Steven

15.4k7 gold badges49 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Oscar Mike Over a year ago

Thanks for your reply! I have this error while putting the encoding parameters inside the csv() method: TypeError: csv() got an unexpected keyword argument 'encoding' And when I try the method:

sp_df.repartition(5, 'Zone').write.option('sep', ';').option('encoding', 'cp1252').mode('overwrite').format('com.databricks.spark.csv').option('codec', 'org.apache.hadoop.io.compress.GzipCodec').save(output_path, header = 'true')

I still have the default 'utf-8' encoding applied while saving the file...

Steven Over a year ago

@oso_ted there is obviously a problem with your csv method because you should not receive the error TypeError you describe. encoding is a legit argument for this method. Do you still get the same encoding without using the compression ?

Collectives™ on Stack Overflow

Select encoding while writing a CSV file in pypark

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related