PySpark dataframe shows wrong values

Question

I just switched to PySpark dataframe from Pandas and found out that printing out the same column in PySpark dataframe gives wrong values. Here's an example: Using Pandas:

df_pandas=pd.read_csv("crime.csv", low_memory=False)
print(df_pandas["CRIMEID"].head(5))

Output:

Whereas using PySpark dataframe:

df_spark = sqlContext.read.format('csv').options(header='true', inferSchema='true').load('crime.csv')
df_spark.select("CRIMEID").show(5)

Output:

+-------+
|CRIMEID|
+-------+
|1321797|
|   null|
|   null|
|1344185|
|   null|
+-------+

I haven't dropped any null rows either. Could somebody explain why that happens? I would really appreciate some help.

Abhishek Arora · Accepted Answer · 2018-03-01 01:28:59Z

1

Here's what is happening:

When you read a csv in Pandas, the order of the records are preserved. And since the pandas is not distributed and holds everything in memory, that order doesn't get change when you call the 'head' method on the pandas dataframe. Thus, the output you get is in the same order as it was when pandas read it from the csv.
On the other hand, a Spark dataframe also preserves the order when reading from ordered file (e.g. csv), but when you call an action method like 'show' on a Spark dataframe shuffling takes places and due to the nature of shuffling you may see random order of records returned.

In a distributed framework like Spark where data is divided and distributed across the cluster, shuffling of data is sure to occur.

So to sum it up, Spark is not giving you wrong values, it's just that it is returning you the records in a random order which is different than what you are getting from pandas.

edited Mar 1, 2018 at 1:28

answered Feb 28, 2018 at 16:38

Abhishek Arora

1,0042 gold badges11 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

moirK Over a year ago

Thank you @Abhishek for a great explanation.

Collectives™ on Stack Overflow

PySpark dataframe shows wrong values

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related