I just switched to PySpark dataframe from Pandas and found out that printing out the same column in PySpark dataframe gives wrong values. Here's an example: Using Pandas:
df_pandas=pd.read_csv("crime.csv", low_memory=False)
print(df_pandas["CRIMEID"].head(5))
Output:
1321797
1344185
1181882
1182632
1195867
Whereas using PySpark dataframe:
df_spark = sqlContext.read.format('csv').options(header='true', inferSchema='true').load('crime.csv')
df_spark.select("CRIMEID").show(5)
Output:
+-------+
|CRIMEID|
+-------+
|1321797|
| null|
| null|
|1344185|
| null|
+-------+
I haven't dropped any null rows either. Could somebody explain why that happens? I would really appreciate some help.