Values of the columns are null and swapped in pyspark dataframe

Question

I am using pyspark==2.3.1. I have performed data preprocessing on the data using pandas now I want to convert my preprocessing function into pyspark from pandas. But while reading the data CSV file using pyspark lot of values become null of the column that has actually some values. If I try to perform any operation on this dataframe then it is swapping the values of the columns with other columns. I also tried different versions of pyspark. Please let me know what I am doing wrong. Thanks

Result from pyspark:

The values of the column "property_type" have null but the actual dataframe has some value instead of null.

CSV File:

But pyspark is working fine with small datasets. i.e.

Praveen Kumar · Accepted Answer · 2022-02-16 19:08:09Z

1

In our we faced the similar issue. Things you need to check

Check wether your data as " [double quotes] pypark would consider this as delimiter and data looks like malformed
Check wether your csv data as multiline We handled this situation by mentioning the following configuration

spark.read.options(header=True, inferSchema=True, escape='"').option("multiline",'true').csv(schema_file_location)

answered Feb 16, 2022 at 19:08

Praveen Kumar

911 silver badge5 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

theansaricode Over a year ago

Thanks, @Praveen Kumar it worked for me. Really appreciate.

Huvi · Accepted Answer · 2022-02-16 21:29:35Z

1

Are you limited to use CSV fileformat? Try parquet. Just save your DataFrame in pandas with .to_parquet() instead of .to_csv(). Spark works with this format really well.

answered Feb 16, 2022 at 21:29

Huvi

832 silver badges7 bronze badges

1 Comment

theansaricode Over a year ago

Hey @Huvi. yes, I am limited to working with CSV file format. I tried .to_parquet() but it didn't work for me but the above answer worked. Thanks for your answer I appreciate it.

Collectives™ on Stack Overflow

Values of the columns are null and swapped in pyspark dataframe

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related