TypeError converting Pandas dataframe to Spark dataframe

Question

I have a Pandas dataframe called pdf that is simply four columns of float64s. Here are the first five lines:

pdf[:5]

      x1         x2        x3          y
0   9.082060  12.837502  6.484107  10.985202
1   9.715981  14.870818  8.026042  12.815644
2  11.303901  21.286343  7.787188  15.786915
3   9.910293  20.533151  6.991775  14.775010
4  12.394907  15.401446  7.101058  13.213897

And the dtypes:

pdf.dtypes

x1    float64
x2    float64
x3    float64
y     float64
dtype: object

But when I try to convert this into a Spark dataframe:

sdf = sqlContext.createDataFrame(pdf)

TypeErrorTraceback (most recent call last)
<ipython-input-54-a40cb79104b5> in <module>()
      5                     ])
      6 
----> 7 sdf = sqlContext.createDataFrame(pdf)

/usr/lib/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio)
    423             rdd, schema = self._createFromRDD(data, schema, samplingRatio)
    424         else:
--> 425             rdd, schema = self._createFromLocal(data, schema)
    426         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
    427         jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

/usr/lib/spark/python/pyspark/sql/context.py in _createFromLocal(self, data, schema)
    339 
    340         if schema is None or isinstance(schema, (list, tuple)):
--> 341             struct = self._inferSchemaFromList(data)
    342             if isinstance(schema, (list, tuple)):
    343                 for i, name in enumerate(schema):

/usr/lib/spark/python/pyspark/sql/context.py in _inferSchemaFromList(self, data)
    239             warnings.warn("inferring schema from dict is deprecated,"
    240                           "please use pyspark.sql.Row instead")
--> 241         schema = reduce(_merge_type, map(_infer_schema, data))
    242         if _has_nulltype(schema):
    243             raise ValueError("Some of types cannot be determined after inferring")

/usr/lib/spark/python/pyspark/sql/types.py in _infer_schema(row)
    829 
    830     else:
--> 831         raise TypeError("Can not infer schema for type: %s" % type(row))
    832 
    833     fields = [StructField(k, _infer_type(v), True) for k, v in items]

TypeError: Can not infer schema for type: <type 'str'>

If I try to specify a schema:

schema = StructType([StructField('y', DoubleType()),
                     StructField('x1', DoubleType()),
                     StructField('x2', DoubleType()),
                     StructField('x3', DoubleType())
                    ])
sdf = sqlContext.createDataFrame(pdf, schema)

Then we get a slightly different error:

TypeErrorTraceback (most recent call last)
<ipython-input-55-a7d2b6d09ed3> in <module>()
      5                     ])
      6 
----> 7 sdf = sqlContext.createDataFrame(pdf, schema)

/usr/lib/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio)
    423             rdd, schema = self._createFromRDD(data, schema, samplingRatio)
    424         else:
--> 425             rdd, schema = self._createFromLocal(data, schema)
    426         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
    427         jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

/usr/lib/spark/python/pyspark/sql/context.py in _createFromLocal(self, data, schema)
    348         elif isinstance(schema, StructType):
    349             for row in data:
--> 350                 _verify_type(row, schema)
    351 
    352         else:

/usr/lib/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
   1132     if _type is StructType:
   1133         if not isinstance(obj, (tuple, list)):
-> 1134             raise TypeError("StructType can not accept object %r in type %s" % (obj, type(obj)))
   1135     else:
   1136         # subclass of them can not be fromInternald in JVM

TypeError: StructType can not accept object 'x1' in type <type 'str'>

Am I missing something obvious? Has anyone had success building a spark dataframe from a Pandas dataframe? This is on Python 2.7, Spark v1.6.1, and Pandas v0.18.1.

Hmmm... I think it's trying to take your column headers and treating them as the data as well. Try taking out the headers. I'm basing my assumption off of the last example in this section. — Scratch'N'Purr
– Scratch'N'Purr, Commented Jun 24, 2016 at 19:19
It definitely appears to be related to the headers; if I change them to integers the error changes from warning about strings to TypeError: Can not infer schema for type: <type 'numpy.int64'>. But I don't think a Pandas dataframe can have no headers at all, can it? — Jeff
– Jeff, Commented Jun 24, 2016 at 19:24
it works fine for python 2.7.10 , spark 1.6.0 and pandas 0.16.2. — shivsn
– shivsn, Commented Jun 24, 2016 at 19:33
Could provide a minimal reproducible example? In particular it would be great to see how you create Pandas dataframe including input data if relevant. Also could you double check Pandas version: `pd.__version__? — zero323
– zero323, Commented Jun 24, 2016 at 20:03
@shivsn So I rolled back to Pandas 0.16.2, and lo and behold, it worked! But then I upgraded back to 0.18.1 to verify the problem... and it still worked. So now I can't replicate my own problem, and I'm not sure what changed. Maybe similar to this issue? I'll update if I find it again. — Jeff
– Jeff, Commented Jun 24, 2016 at 20:55

Jeff · Accepted Answer · 2016-07-07 20:08:54Z

1

I've successfully reproduced this, and it seems to just be closing the ipython notebook and reopening it. When I spin up a new cluster with nothing but Python 2.7, pip and numpy installed (default in the bootstrap) and install Pandas 0.18.1 using pip.main(), then try to convert it to a Spark dataframe using createDataFrame(), it fails with the above errors. But when I close and halt the notebook then start it again, it works fine.

answered Jul 7, 2016 at 20:08

Jeff

2,2381 gold badge19 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

TypeError converting Pandas dataframe to Spark dataframe

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related