0

I just switched to PySpark dataframe from Pandas and found out that printing out the same column in PySpark dataframe gives wrong values. Here's an example: Using Pandas:

df_pandas=pd.read_csv("crime.csv", low_memory=False)
print(df_pandas["CRIMEID"].head(5))

Output:

1321797
1344185
1181882
1182632
1195867

Whereas using PySpark dataframe:

df_spark = sqlContext.read.format('csv').options(header='true', inferSchema='true').load('crime.csv')
df_spark.select("CRIMEID").show(5)

Output:

+-------+
|CRIMEID|
+-------+
|1321797|
|   null|
|   null|
|1344185|
|   null|
+-------+

I haven't dropped any null rows either. Could somebody explain why that happens? I would really appreciate some help.

1 Answer 1

1

Here's what is happening:

  • When you read a csv in Pandas, the order of the records are preserved. And since the pandas is not distributed and holds everything in memory, that order doesn't get change when you call the 'head' method on the pandas dataframe. Thus, the output you get is in the same order as it was when pandas read it from the csv.
  • On the other hand, a Spark dataframe also preserves the order when reading from ordered file (e.g. csv), but when you call an action method like 'show' on a Spark dataframe shuffling takes places and due to the nature of shuffling you may see random order of records returned.

In a distributed framework like Spark where data is divided and distributed across the cluster, shuffling of data is sure to occur.

So to sum it up, Spark is not giving you wrong values, it's just that it is returning you the records in a random order which is different than what you are getting from pandas.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you @Abhishek for a great explanation.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.