0

I am using pyspark==2.3.1. I have performed data preprocessing on the data using pandas now I want to convert my preprocessing function into pyspark from pandas. But while reading the data CSV file using pyspark lot of values become null of the column that has actually some values. If I try to perform any operation on this dataframe then it is swapping the values of the columns with other columns. I also tried different versions of pyspark. Please let me know what I am doing wrong. Thanks

Result from pyspark:

enter image description here

The values of the column "property_type" have null but the actual dataframe has some value instead of null.

CSV File: enter image description here

But pyspark is working fine with small datasets. i.e. enter image description here

2 Answers 2

1

In our we faced the similar issue. Things you need to check

  1. Check wether your data as " [double quotes] pypark would consider this as delimiter and data looks like malformed
  2. Check wether your csv data as multiline We handled this situation by mentioning the following configuration

spark.read.options(header=True, inferSchema=True, escape='"').option("multiline",'true').csv(schema_file_location)

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, @Praveen Kumar it worked for me. Really appreciate.
1

Are you limited to use CSV fileformat? Try parquet. Just save your DataFrame in pandas with .to_parquet() instead of .to_csv(). Spark works with this format really well.

1 Comment

Hey @Huvi. yes, I am limited to working with CSV file format. I tried .to_parquet() but it didn't work for me but the above answer worked. Thanks for your answer I appreciate it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.