1

I have a dataset looking like this: item_nbr | date 123 | 2016-09-23 123 | 2016-10-23 112 | 2016-08-15 112 | 2016-09-15

I use groupByKey to make it look like this: '123',['2016-09-23','2016-10-23'] '112',['2016-08-15','2016-09-15'] Now I want to calculate the difference between these two dates. I have a function that looks like this:

def ipi_generate(x):
    member_ipi_list = []
    master_ans = []
    for j in range(1,len(x[1])):
        ans = x[1][j]-x[1][j-1] 
        master_ans.append(ans)
    member_ipi_list.append(x[0])
    member_ipi_list.append(master_ans)
    return [member_ipi_list]

Which treats the date as if it's string. How do I convert my string date into a int date in pyspark? Thanks.

3
  • Have you tried using the datetime library? As in datetime.strptime(x[1][j], '%Y-%m-%d') Commented Aug 29, 2017 at 23:18
  • Also, is there a reason you are not transforming these to datetime objects before grouping by key? I'm also not aware of your larger goal, so this may or may not be appropriate, but Window functions or aggregating functions might be easier here. Look into them. Commented Aug 29, 2017 at 23:24
  • Used the datetime library in the function now. Worked fine, thanks. :) I tried transforming the string to datetime objects but that is not how the final output is required so did not do do it before grouping by key. Commented Aug 30, 2017 at 17:16

1 Answer 1

2

You should use window functions instead of using UDF:

First let's create our dataframe:

df = spark.createDataFrame(
    sc.parallelize([["123", "2016-09-23"], ["123", "2016-10-23"], ["123", "2016-11-23"], ["123", "2017-01-01"], ["112", "2016-08-15"], ["112", "2016-09-15"]]), 
    ["item_nbr", "date"]
)

Now let's use a lag function to bring on the same row our current's row date and the date of the previous row:

import pyspark.sql.functions as psf
from pyspark.sql import Window

w = Window.partitionBy("item_nbr").orderBy("date")
df.withColumn(
    "date_diff", 
    psf.datediff("date", psf.lag("date").over(w))
).show()

    +--------+----------+---------+
    |item_nbr|      date|date_diff|
    +--------+----------+---------+
    |     112|2016-08-15|     null|
    |     112|2016-09-15|       31|
    |     123|2016-09-23|     null|
    |     123|2016-10-23|       30|
    |     123|2016-11-23|       31|
    |     123|2017-01-01|       39|
    +--------+----------+---------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.