Spark DataFrame from pandas Series

Question

I have a Pandas Series object

dates = pd.Series(pd.date_range(start_date,end_date))/
.dt.strftime('%y%m%d')/
.astype(int)/

And i would like to create a Spark DF directly from the Series object, without intermediate Pandas dataframe

    _schema = StructType([
     StructField("date_id", IntegerType(), True),
])

    dates_rdd = sc.parallelize(dates)
    self.date_table = spark.createDataFrame(dates_rdd, _schema)

Error:

Error: raise TypeError("StructType can not accept object %r in type %s" % 
(obj, type(obj)))
TypeError: StructType can not accept object 160101 in type <class 
'numpy.int64'>

If i change the Series object as:

    dates = pd.Series(pd.date_range(start_date,end_date))/
    .dt.strftime('%y%m%d')/
    .astype(int).values.tolist()

Error becomes:

 raise TypeError("StructType can not accept object %r in type %s" % (obj, 
 type(obj)))
 TypeError: StructType can not accept object 160101 in type <class 'int'>

How can i properly map the Int values contained in the dates list/rdd to Python native integer that is accepted from Spark Dataframes?

@Suresh still same error

balalaika
– balalaika

2017-11-13 13:24:37 +00:00
Commented Nov 13, 2017 at 13:24 — balalaika
– balalaika, Commented Nov 13, 2017 at 13:24
start_date,end_date values please ?

Suresh
– Suresh

2017-11-13 13:25:14 +00:00
Commented Nov 13, 2017 at 13:25 — Suresh
– Suresh, Commented Nov 13, 2017 at 13:25

ags29 · Accepted Answer · 2017-11-13 13:33:09Z

3

This will work:

dates_rdd = sc.parallelize(dates).map(lambda x: tuple([int(x)]))
date_table = spark.createDataFrame(dates_rdd, _schema)

The purpose of the additional map in defining dates_rdd is to make the format of the rdd match the schema

answered Nov 13, 2017 at 13:33

ags29

2,7061 gold badge11 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ags29 Over a year ago

ok basically the same answer as below, beat me to it by 20 secs ;)

balalaika Over a year ago

Yes, basically he already gave a comment before so is fair to accept his answer i believe

Suresh · Accepted Answer · 2017-11-13 13:32:47Z

2

Believe,you have missed to create a tuple for each series value,

>>> dates = pd.Series(pd.date_range(start='1/1/1980', end='1/11/1980')).dt.strftime('%y%m%d').astype(int).values.tolist()
>>> rdd = sc.parallelize(dates).map(lambda x:(x,))
>>> _schema = StructType([StructField("date_id", IntegerType(), True),])
>>> df = spark.createDataFrame(rdd,schema=_schema)
>>> df.show()
+-------+
|date_id|
+-------+
| 800101|
| 800102|
| 800103|
| 800104|
| 800105|
| 800106|
| 800107|
| 800108|
| 800109|
| 800110|
| 800111|
+-------+

>>> df.printSchema()
root
 |-- date_id: integer (nullable = true)

answered Nov 13, 2017 at 13:32

Suresh

5,8802 gold badges27 silver badges42 bronze badges

Collectives™ on Stack Overflow

Spark DataFrame from pandas Series

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related