5

Hoping this is fairly elementary. I have a Spark dataframe containing a Date column, I want to add a new column with number of days since that date. Google fu is failing me.

Here's what I've tried:

from pyspark.sql.types import *
import datetime
today = datetime.date.today()

schema = StructType([StructField("foo", DateType(), True)])
l = [(datetime.date(2016,12,1),)]
df = sqlContext.createDataFrame(l, schema)
df = df.withColumn('daysBetween',today - df.foo)
df.show()

it fails with error:

u"cannot resolve '(17212 - foo)' due to data type mismatch: '(17212 - foo)' requires (numeric or calendarinterval) type, not date;"

I've tried fiddling around but gotten nowhere. I can't think that this is too hard. Can anyone help?

2 Answers 2

12

OK, figured it out

from pyspark.sql.types import *
import pyspark.sql.functions as funcs
import datetime
today = datetime.date(2017,2,15)

schema = StructType([StructField("foo", DateType(), True)])
l = [(datetime.date(2017,2,14),)]
df = sqlContext.createDataFrame(l, schema)
df = df.withColumn('daysBetween',funcs.datediff(funcs.lit(today), df.foo))
df.collect()

returns [Row(foo=datetime.date(2017, 2, 14), daysBetween=1)]

Sign up to request clarification or add additional context in comments.

Comments

8

You can simply do the following:

import pyspark.sql.functions as F

df = df.withColumn('daysSince', F.datediff(F.current_date(), df.foo))

1 Comment

So others can know: the differences are in days spark.apache.org/docs/2.1.0/api/python/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.