TypeError: element in array field Category: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.DoubleType'>

Question

I am reading the csv file using Pandas, it's a two column dataframe, and then I am trying to convert to the spark dataframe. The Code for this is:

from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(df)

The dataframe:

print(df)

gives this :

    Name    Category
0   EDSJOBLIST apply at www.edsjoblist.com  ['biotechnology', 'clinical', 'diagnostic', 'd...
1   Power Direct Marketing  ['advertising', 'analytics', 'brand positionin...
2   CHA Hollywood Medical Center, L.P.  ['general medical and surgical hospital', 'hea...
3   JING JING GOURMET   [nan]
4   TRUE LIFE KINGDOM MINISTRIES    ['religious organization']
5   fasterproms ['microsoft .net']
6   STEREO ZONE ['accessory', 'audio', 'car audio', 'chrome', ...
7   SAN FRANCISCO NEUROLOGICAL SOCIETY  [nan]
8   Fl Advisors ['comprehensive financial planning', 'financia...
9   Fortunatus LLC  ['bottle', 'bottling', 'charitable', 'dna', 'f...
10  TREADS LLC  ['retail', 'wholesaling']

Can anyone help me with this ?

Your pandas dataframe columns are having differnt types as the error suggests — Mufeed
– Mufeed, Commented Jul 3, 2018 at 17:20
It would be helpful if you edit your question and include the output of print(df.dtypes) and a small sample of your data. — pault
– pault, Commented Jul 3, 2018 at 17:33

pault · Accepted Answer · 2018-07-03 17:32:03Z

14

Spark can have difficulty dealing with object datatypes. A potential workaround is to convert everything to a string first:

sdf = sqlCtx.createDataFrame(df.astype(str))

One consequence of this is that everything, including nan will be converted to string. You will need to take care to properly handle these conversions and cast the columns to the appropriate type.

For instance, if you had a column "colA" with floating point values, you can use something like the following to convert the string "nan" to a null:

from pyspark.sql.functions import col, when
sdf = sdf.withColumn("colA", when(col("colA") != "nan", col("colA").cast("float")))

answered Jul 3, 2018 at 17:32

pault

43.7k17 gold badges121 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mr.Panchamia Over a year ago

thank you for the help. I did as you suggested, but now, I am getting this error: FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Yash\\AppData\\Local\\Temp\\spark-912d2316-6f98-469a-8abb-f6f4f99b6060\\pyspark-4be513ad-a898-4a41-a911-be158ff813b5\\tmppqmxo4ou'

pault Over a year ago

That looks completely unrelated to this issue. Did you try restarting your spark context?

Collectives™ on Stack Overflow

TypeError: element in array field Category: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.DoubleType'>

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related