For my course in university, I run pyspark-notebook docker image
docker pull jupyter/pyspark-notebook
docker run -it --rm -p 8888:8888 -v /path/to/my/working/directory:/home/jovyan/work jupyter/pyspark-notebook
And then run next python code
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
sc = pyspark.SparkContext('local[*]')
spark = SparkSession(sc)
spark
listings_df = spark.read.csv("listings.csv", header=True, mode='DROPMALFORMED')
# adding encoding="utf8" to the line above doesn't help also
listings_df.printSchema()
The problem appears during reading a file. It seems that spark reads my file incorrectly (possibly because of encodings problem?) and after reading listings_df has 16494 lines, while the correct number of lines is 16478 (checked with pandas.read_csv()). You can see that something definitely is broken also by running
listings_df.groupBy("room_type").count().show()
which gives next output
+---------------+-----+
| room_type|count|
+---------------+-----+
| 169| 1|
| 4.88612| 1|
| 4.90075| 1|
| Shared room| 44|
| 35| 1|
| 187| 1|
| null| 16|
| 70| 1|
| 27| 1|
| 75| 1|
| Hotel room| 109|
| 198| 1|
| 60| 1|
| 280| 1|
|Entire home/apt|12818|
| 220| 1|
| 190| 1|
| 156| 1|
| 450| 1|
| 4.88865| 1|
+---------------+-----+
only showing top 20 rows
while real room_type values are only ['Private room', 'Entire home/apt', 'Hotel room', 'Shared room'].
Spark info which might be useful:
SparkSession - in-memory
SparkContext
Spark UI
Version
v3.1.2
Master
local[*]
AppName
pyspark-shell
And encoding of the file
!file listings.csv
listings.csv: UTF-8 Unicode text
listings.csv is an Airbnb statistics csv file downloaded from here
All run & drive code I've also uploaded to Colab
[6113864, 10725464, 11233751, 13678506, 15713489, 18623118, 19013423, 27394704, 28892595, 29705321, 29750219, 30758748, 33980766, 34070600, 41442089, 47488890]. (There are some unwanted line breaks). Maybe your IDs could differ as the file source appears unsafe…