2

For my course in university, I run pyspark-notebook docker image

docker pull jupyter/pyspark-notebook
docker run -it --rm -p 8888:8888 -v /path/to/my/working/directory:/home/jovyan/work jupyter/pyspark-notebook

And then run next python code

import pyspark 
from pyspark.sql import SparkSession
from pyspark.sql.types import *

sc = pyspark.SparkContext('local[*]')
spark = SparkSession(sc)
spark

listings_df = spark.read.csv("listings.csv", header=True, mode='DROPMALFORMED') 
# adding encoding="utf8" to the line above doesn't help also
listings_df.printSchema()

The problem appears during reading a file. It seems that spark reads my file incorrectly (possibly because of encodings problem?) and after reading listings_df has 16494 lines, while the correct number of lines is 16478 (checked with pandas.read_csv()). You can see that something definitely is broken also by running

listings_df.groupBy("room_type").count().show()

which gives next output

+---------------+-----+
|      room_type|count|
+---------------+-----+
|            169|    1|
|        4.88612|    1|
|        4.90075|    1|
|    Shared room|   44|
|             35|    1|
|            187|    1|
|           null|   16|
|             70|    1|
|             27|    1|
|             75|    1|
|     Hotel room|  109|
|            198|    1|
|             60|    1|
|            280|    1|
|Entire home/apt|12818|
|            220|    1|
|            190|    1|
|            156|    1|
|            450|    1|
|        4.88865|    1|
+---------------+-----+
only showing top 20 rows

while real room_type values are only ['Private room', 'Entire home/apt', 'Hotel room', 'Shared room'].

Spark info which might be useful:

SparkSession - in-memory

SparkContext

Spark UI

Version
v3.1.2
Master
local[*]
AppName
pyspark-shell

And encoding of the file

!file listings.csv
listings.csv: UTF-8 Unicode text

listings.csv is an Airbnb statistics csv file downloaded from here

All run & drive code I've also uploaded to Colab

2
  • It seems that spark reads my file incorrectly… Disagree. The file itself is malformed at IDs [6113864, 10725464, 11233751, 13678506, 15713489, 18623118, 19013423, 27394704, 28892595, 29705321, 29750219, 30758748, 33980766, 34070600, 41442089, 47488890]. (There are some unwanted line breaks). Maybe your IDs could differ as the file source appears unsafe… Commented Sep 11, 2021 at 13:56
  • BTW, those unwanted line breaks are probably home.unicode.org/U+2029 is used to represents a line break within a paragraph, aka "shift-Enter" or "soft" line break. And file encoding is definitely UTF-8. Commented Sep 11, 2021 at 14:18

2 Answers 2

2

There are two things that I've found:

  1. Some lines have quotes to escape (escape='"')
  2. Also @JosefZ has mentioned about unwanted line breaks (multiLine=True)

That's how you must read it:

input_df = spark.read.csv(path, header=True, multiLine=True, escape='"')

output_df = input_df.groupBy("room_type").count()
output_df.show()
+---------------+-----+
|      room_type|count|
+---------------+-----+
|    Shared room|   44|
|     Hotel room|  110|
|Entire home/apt|12829|
|   Private room| 3495|
+---------------+-----+
Sign up to request clarification or add additional context in comments.

Comments

0

I think encoding the file from here should solve the problem. So you add encoding="utf8" to your tuple of the variable listings_df.

As shown below:

listings_df = spark.read.csv("listings.csv", encoding="utf8", header=True, mode='DROPMALFORMED')

1 Comment

Tried but this didn't help. I've updated the question with this info

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.