Wrong encoding when reading csv file with pyspark

Question

For my course in university, I run pyspark-notebook docker image

docker pull jupyter/pyspark-notebook
docker run -it --rm -p 8888:8888 -v /path/to/my/working/directory:/home/jovyan/work jupyter/pyspark-notebook

And then run next python code

import pyspark 
from pyspark.sql import SparkSession
from pyspark.sql.types import *

sc = pyspark.SparkContext('local[*]')
spark = SparkSession(sc)
spark

listings_df = spark.read.csv("listings.csv", header=True, mode='DROPMALFORMED') 
# adding encoding="utf8" to the line above doesn't help also
listings_df.printSchema()

The problem appears during reading a file. It seems that spark reads my file incorrectly (possibly because of encodings problem?) and after reading listings_df has 16494 lines, while the correct number of lines is 16478 (checked with pandas.read_csv()). You can see that something definitely is broken also by running

listings_df.groupBy("room_type").count().show()

which gives next output

+---------------+-----+
|      room_type|count|
+---------------+-----+
|            169|    1|
|        4.88612|    1|
|        4.90075|    1|
|    Shared room|   44|
|             35|    1|
|            187|    1|
|           null|   16|
|             70|    1|
|             27|    1|
|             75|    1|
|     Hotel room|  109|
|            198|    1|
|             60|    1|
|            280|    1|
|Entire home/apt|12818|
|            220|    1|
|            190|    1|
|            156|    1|
|            450|    1|
|        4.88865|    1|
+---------------+-----+
only showing top 20 rows

while real room_type values are only ['Private room', 'Entire home/apt', 'Hotel room', 'Shared room'].

Spark info which might be useful:

SparkSession - in-memory

SparkContext

Spark UI

Version
v3.1.2
Master
local[*]
AppName
pyspark-shell

And encoding of the file

!file listings.csv
listings.csv: UTF-8 Unicode text

listings.csv is an Airbnb statistics csv file downloaded from here

All run & drive code I've also uploaded to Colab

It seems that spark reads my file incorrectly… Disagree. The file itself is malformed at IDs [6113864, 10725464, 11233751, 13678506, 15713489, 18623118, 19013423, 27394704, 28892595, 29705321, 29750219, 30758748, 33980766, 34070600, 41442089, 47488890]. (There are some unwanted line breaks). Maybe your IDs could differ as the file source appears unsafe… — JosefZ
– JosefZ, Commented Sep 11, 2021 at 13:56
BTW, those unwanted line breaks are probably home.unicode.org/U+2029 is used to represents a line break within a paragraph, aka "shift-Enter" or "soft" line break. And file encoding is definitely UTF-8. — JosefZ
– JosefZ, Commented Sep 11, 2021 at 14:18

Kafels · Accepted Answer · 2021-09-11 15:08:28Z

2

There are two things that I've found:

Some lines have quotes to escape (escape='"')
Also @JosefZ has mentioned about unwanted line breaks (multiLine=True)

That's how you must read it:

input_df = spark.read.csv(path, header=True, multiLine=True, escape='"')

output_df = input_df.groupBy("room_type").count()
output_df.show()
+---------------+-----+
|      room_type|count|
+---------------+-----+
|    Shared room|   44|
|     Hotel room|  110|
|Entire home/apt|12829|
|   Private room| 3495|
+---------------+-----+

answered Sep 11, 2021 at 15:08

Kafels

4,0891 gold badge18 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Mario · Accepted Answer · 2023-01-12 15:25:28Z

0

I think encoding the file from here should solve the problem. So you add encoding="utf8" to your tuple of the variable listings_df.

As shown below:

listings_df = spark.read.csv("listings.csv", encoding="utf8", header=True, mode='DROPMALFORMED')

edited Jan 12, 2023 at 15:25

Mario

2,0844 gold badges34 silver badges81 bronze badges

answered Sep 11, 2021 at 11:51

user16282816

1 Comment

maria Over a year ago

Tried but this didn't help. I've updated the question with this info

Collectives™ on Stack Overflow

Wrong encoding when reading csv file with pyspark

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related