If my dataframe is like
z = {
'Cust': ["a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "d"],
'datediff': [1, 3, 9, 26, 30, 1, 2, 7, 10, 5, 7],
'row_number': [1, 2, 3, 4, 5, 1, 2, 3, 4, 1, 1],
'Referer': ["URL1", "URL2", "URL2", "URL1", "URL1", "URL3", "URL1", "URL1",
"URL1", "URL1", "URL1"]
}
df1 = pd.DataFrame(z)
Row_number marks the sequence of the sorted day order with in each customer (from SQL processed data), only Datediff to the previous visit (record) is returned from SQL. (I can add date column if needed)
I need to populate the very first URL visited by each customer (to a derived column) to all the rows below it (until row_number reverts to 1, marks another customer).
This will allow me to calculate over all datediff() between all visits started with a certain URL(with some basic tricks using derived columns), using something like DF3_derived.groupby(['Referer']).['datediff'].mean()
I don't know how to do it just using normal[][boolean condition], so maybe best to do this with a Loop reading dataframe1, modify it, and save to dataframe2?
Basically (using Excel terms) getting value from the row above, , but skips if a flag which marks another beginning is met! Excel formula in d2 =if (b2>b1,A1,A2), then drag the formula down!