3

I am reading the csv file using Pandas, it's a two column dataframe, and then I am trying to convert to the spark dataframe. The Code for this is:

from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(df)

The dataframe:

print(df) 

gives this :

    Name    Category
0   EDSJOBLIST apply at www.edsjoblist.com  ['biotechnology', 'clinical', 'diagnostic', 'd...
1   Power Direct Marketing  ['advertising', 'analytics', 'brand positionin...
2   CHA Hollywood Medical Center, L.P.  ['general medical and surgical hospital', 'hea...
3   JING JING GOURMET   [nan]
4   TRUE LIFE KINGDOM MINISTRIES    ['religious organization']
5   fasterproms ['microsoft .net']
6   STEREO ZONE ['accessory', 'audio', 'car audio', 'chrome', ...
7   SAN FRANCISCO NEUROLOGICAL SOCIETY  [nan]
8   Fl Advisors ['comprehensive financial planning', 'financia...
9   Fortunatus LLC  ['bottle', 'bottling', 'charitable', 'dna', 'f...
10  TREADS LLC  ['retail', 'wholesaling']

Can anyone help me with this ?

2
  • Your pandas dataframe columns are having differnt types as the error suggests Commented Jul 3, 2018 at 17:20
  • It would be helpful if you edit your question and include the output of print(df.dtypes) and a small sample of your data. Commented Jul 3, 2018 at 17:33

1 Answer 1

14

Spark can have difficulty dealing with object datatypes. A potential workaround is to convert everything to a string first:

sdf = sqlCtx.createDataFrame(df.astype(str))

One consequence of this is that everything, including nan will be converted to string. You will need to take care to properly handle these conversions and cast the columns to the appropriate type.

For instance, if you had a column "colA" with floating point values, you can use something like the following to convert the string "nan" to a null:

from pyspark.sql.functions import col, when
sdf = sdf.withColumn("colA", when(col("colA") != "nan", col("colA").cast("float")))
Sign up to request clarification or add additional context in comments.

2 Comments

thank you for the help. I did as you suggested, but now, I am getting this error: FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Yash\\AppData\\Local\\Temp\\spark-912d2316-6f98-469a-8abb-f6f4f99b6060\\pyspark-4be513ad-a898-4a41-a911-be158ff813b5\\tmppqmxo4ou'
That looks completely unrelated to this issue. Did you try restarting your spark context?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.