3

I have a Pandas Series object

dates = pd.Series(pd.date_range(start_date,end_date))/
.dt.strftime('%y%m%d')/
.astype(int)/

And i would like to create a Spark DF directly from the Series object, without intermediate Pandas dataframe

    _schema = StructType([
     StructField("date_id", IntegerType(), True),
])

    dates_rdd = sc.parallelize(dates)
    self.date_table = spark.createDataFrame(dates_rdd, _schema)

Error:

Error: raise TypeError("StructType can not accept object %r in type %s" % 
(obj, type(obj)))
TypeError: StructType can not accept object 160101 in type <class 
'numpy.int64'>

If i change the Series object as:

    dates = pd.Series(pd.date_range(start_date,end_date))/
    .dt.strftime('%y%m%d')/
    .astype(int).values.tolist()

Error becomes:

 raise TypeError("StructType can not accept object %r in type %s" % (obj, 
 type(obj)))
 TypeError: StructType can not accept object 160101 in type <class 'int'>

How can i properly map the Int values contained in the dates list/rdd to Python native integer that is accepted from Spark Dataframes?

2
  • @Suresh still same error Commented Nov 13, 2017 at 13:24
  • start_date,end_date values please ? Commented Nov 13, 2017 at 13:25

2 Answers 2

3

This will work:

dates_rdd = sc.parallelize(dates).map(lambda x: tuple([int(x)]))
date_table = spark.createDataFrame(dates_rdd, _schema)

The purpose of the additional map in defining dates_rdd is to make the format of the rdd match the schema

Sign up to request clarification or add additional context in comments.

2 Comments

ok basically the same answer as below, beat me to it by 20 secs ;)
Yes, basically he already gave a comment before so is fair to accept his answer i believe
2

Believe,you have missed to create a tuple for each series value,

>>> dates = pd.Series(pd.date_range(start='1/1/1980', end='1/11/1980')).dt.strftime('%y%m%d').astype(int).values.tolist()
>>> rdd = sc.parallelize(dates).map(lambda x:(x,))
>>> _schema = StructType([StructField("date_id", IntegerType(), True),])
>>> df = spark.createDataFrame(rdd,schema=_schema)
>>> df.show()
+-------+
|date_id|
+-------+
| 800101|
| 800102|
| 800103|
| 800104|
| 800105|
| 800106|
| 800107|
| 800108|
| 800109|
| 800110|
| 800111|
+-------+

>>> df.printSchema()
root
 |-- date_id: integer (nullable = true)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.