Add a row when conditions from multiple columns match in python?

Question

I am trying to add a row in a dataframe. The condition is when a user comes back (after 300 seconds) on the app again then I need to add a row. Below is my code. It works fine but takes a lot of execution time, as the real data frame has 10 million rows.

for i in range(1,len(df)):
    if df['user_id'][i]==df['user_id'][i-1] and (df['start_time'][i]-df['start_time'][i-1]).seconds>300:
        df.loc[len(df)]=[df['user_id'][i],df['start_time'][i],'psuedo_App_start_2']

Input:

user_id   start_time        event
100       03/04/19 6:11     psuedo_App_start
100       03/04/19 6:11     notification_receive
100       03/04/19 8:56     notification_dismiss
10        03/04/19 22:05    psuedo_App_start
10        03/04/19 22:05    subcategory_click
10        03/04/19 22:06    subcategory_click

output should look like:

user_id   start_time        event
100       03/04/19 6:11     psuedo_App_start
100       03/04/19 6:11     notification_receive
100       03/04/19 8:56     psuedo_App_start_2
100       03/04/19 8:56     notification_dismiss
10        03/04/19 22:05    psuedo_App_start
10        03/04/19 22:05    subcategory_click
10        03/04/19 22:06    subcategory_click

As seen in the output, there is a row added for user_id = 100, as he came back at 8.56 i.e after 300 seconds.

Do you have control over how these events are inserted into the dataframe? — rdas
– rdas, Commented Apr 7, 2019 at 9:35
Why don't you do this: 1) remember what was the last timestamp you saw during full scan of the dataframe, 2) in the next scan only get the rows which have a higher timestamp (i.e. the new rows) — rdas
– rdas, Commented Apr 7, 2019 at 9:44
Can't really test it right now but you could groupby['user_id','start_time'], then use df.timedelta to check if the start_time for each id is bigger than 300 and insert a new line if condition is met (with the last start_time and user_id pulled from df) — CAPSLOCK
– CAPSLOCK, Commented Apr 7, 2019 at 9:59

jezrael · Accepted Answer · 2019-04-07 12:10:10Z

2

First filter by 2 conditions - compare user_id by DataFrameGroupBy.shifted values per groups, and also difference per groups by DataFrameGroupBy.diff, then reassign evet column by DataFrame.assign, last concat together and sorting by DataFrame.sort_values:

#MM/DD/YY HH:MM
#df['start_time'] = pd.to_datetime(df['start_time'])
#DD/MM/YY HH:MM
#df['start_time'] = pd.to_datetime(df['start_time'], dayfirst=True)

m1 = df['user_id'].eq(df.groupby('user_id')['user_id'].shift())
m2 = df.groupby('user_id')['start_time'].diff().dt.total_seconds() > 300

df1 = df[m1 & m2].assign(event='psuedo_App_start_2')

df1 = (pd.concat([df, df1], ignore_index=True)
         .sort_values(['user_id','start_time'], ascending=[False, True]))
print (df1)
   user_id          start_time                 event
0      100 2019-03-04 06:11:00      psuedo_App_start
1      100 2019-03-04 06:11:00  notification_receive
2      100 2019-03-04 08:56:00  notification_dismiss
6      100 2019-03-04 08:56:00    psuedo_App_start_2
3       10 2019-03-04 22:05:00      psuedo_App_start
4       10 2019-03-04 22:05:00     subcategory_click
5       10 2019-03-04 22:06:00     subcategory_click

answered Apr 7, 2019 at 12:10

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

nk23 Over a year ago

can you help me understand what is stored in m1 and m2?

jezrael Over a year ago

@nk23 There are boolean masks, first compared by eq for ==, second compared by >

aparpara · Accepted Answer · 2019-04-07 10:34:03Z

0

Usually in such cases you need to convert explicit cycles to vectorized operations. Try something like this:

i = (df.user_id.values[1:] == df.user_id.values[:-1]) & ((df.start_time.values[1:] - df.start_time.values[:-1])/np.timedelta64(1, 's') > 300)
newRows = tt[np.append(False, i)].copy()
newRows.event = 'psuedo_App_start_2'
df.append(newRows)

answered Apr 7, 2019 at 10:34

aparpara

2,20110 silver badges25 bronze badges

Collectives™ on Stack Overflow

Add a row when conditions from multiple columns match in python?

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related