1

Input DF:

id .  sub_id .   id_created .  id_last_modified   sub_id_created . lead_
1 .    10          12:00         7:00               12:00 .        1:00
1 .    20 .        12:00         7:00                1:00 .        2:30
1 .    30 .        12:00         7:00                2:30 .        7:00
1 .    40          12:00         7:05                7:00          null

Use case, I am trying to create a new_column "time", where:

1. For: (id, max(sub_id)) : id_last_modified - sub_id_created
2. otherwise:  sub_id_created - lead_

Code:

window = Window.partitionBy("id").orderBy("sub_id")

I am getting the expected op for all the rows except for the combination of:

(id, max(sub_id))

for which I am getting null

Any suggestions on where am I going wrong will be helpful. Thanks.

2
  • your tried code seems to be a mix of scala and pyspark Commented Jun 27, 2018 at 4:58
  • and how does unix_timestamp converts formats as 7:00 to valid timestamp? as you say its partially working Commented Jun 27, 2018 at 5:17

2 Answers 2

1

Guess this might work

df = df.withColumn("time",
when($"sub_id"===max($"sub_id").over(window), 
(unix_timestamp($"id_last_modified")- 
unix_timestamp($"sub_id_created"))/3600.0).otherwise( 
(unix_timestamp($"sub_id_created") - 
unix_timestamp(lead($"sub_id_created", 1).over(window)))/3600.0))
Sign up to request clarification or add additional context in comments.

Comments

0
import pandas_datareader as web
import datetime
start = datetime.datetime(2018, 5, 1)
end = datetime.datetime(2019, 5, 31)
df = web.DataReader("goog", 'yahoo', start, end)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.