2

I am trying to add a row in a dataframe. The condition is when a user comes back (after 300 seconds) on the app again then I need to add a row. Below is my code. It works fine but takes a lot of execution time, as the real data frame has 10 million rows.

for i in range(1,len(df)):
    if df['user_id'][i]==df['user_id'][i-1] and (df['start_time'][i]-df['start_time'][i-1]).seconds>300:
        df.loc[len(df)]=[df['user_id'][i],df['start_time'][i],'psuedo_App_start_2']

Input:

user_id   start_time        event
100       03/04/19 6:11     psuedo_App_start
100       03/04/19 6:11     notification_receive
100       03/04/19 8:56     notification_dismiss
10        03/04/19 22:05    psuedo_App_start
10        03/04/19 22:05    subcategory_click
10        03/04/19 22:06    subcategory_click

output should look like:

user_id   start_time        event
100       03/04/19 6:11     psuedo_App_start
100       03/04/19 6:11     notification_receive
100       03/04/19 8:56     psuedo_App_start_2
100       03/04/19 8:56     notification_dismiss
10        03/04/19 22:05    psuedo_App_start
10        03/04/19 22:05    subcategory_click
10        03/04/19 22:06    subcategory_click

As seen in the output, there is a row added for user_id = 100, as he came back at 8.56 i.e after 300 seconds.

4
  • Do you have control over how these events are inserted into the dataframe? Commented Apr 7, 2019 at 9:35
  • no, other events are automatically generated Commented Apr 7, 2019 at 9:42
  • Why don't you do this: 1) remember what was the last timestamp you saw during full scan of the dataframe, 2) in the next scan only get the rows which have a higher timestamp (i.e. the new rows) Commented Apr 7, 2019 at 9:44
  • Can't really test it right now but you could groupby['user_id','start_time'], then use df.timedelta to check if the start_time for each id is bigger than 300 and insert a new line if condition is met (with the last start_time and user_id pulled from df) Commented Apr 7, 2019 at 9:59

2 Answers 2

2

First filter by 2 conditions - compare user_id by DataFrameGroupBy.shifted values per groups, and also difference per groups by DataFrameGroupBy.diff, then reassign evet column by DataFrame.assign, last concat together and sorting by DataFrame.sort_values:

#MM/DD/YY HH:MM
#df['start_time'] = pd.to_datetime(df['start_time'])
#DD/MM/YY HH:MM
#df['start_time'] = pd.to_datetime(df['start_time'], dayfirst=True)

m1 = df['user_id'].eq(df.groupby('user_id')['user_id'].shift())
m2 = df.groupby('user_id')['start_time'].diff().dt.total_seconds() > 300

df1 = df[m1 & m2].assign(event='psuedo_App_start_2')

df1 = (pd.concat([df, df1], ignore_index=True)
         .sort_values(['user_id','start_time'], ascending=[False, True]))
print (df1)
   user_id          start_time                 event
0      100 2019-03-04 06:11:00      psuedo_App_start
1      100 2019-03-04 06:11:00  notification_receive
2      100 2019-03-04 08:56:00  notification_dismiss
6      100 2019-03-04 08:56:00    psuedo_App_start_2
3       10 2019-03-04 22:05:00      psuedo_App_start
4       10 2019-03-04 22:05:00     subcategory_click
5       10 2019-03-04 22:06:00     subcategory_click
Sign up to request clarification or add additional context in comments.

2 Comments

can you help me understand what is stored in m1 and m2?
@nk23 There are boolean masks, first compared by eq for ==, second compared by >
0

Usually in such cases you need to convert explicit cycles to vectorized operations. Try something like this:

i = (df.user_id.values[1:] == df.user_id.values[:-1]) & ((df.start_time.values[1:] - df.start_time.values[:-1])/np.timedelta64(1, 's') > 300)
newRows = tt[np.append(False, i)].copy()
newRows.event = 'psuedo_App_start_2'
df.append(newRows)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.