Pandas : Optimising multiple sequential for loops

Question

I have a dataframe df like below


import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

data = {'Name': ['XYZ', 'XYZ', 'XYZ', 'XYZ', 'PQR', 'XYZ', 'XYZ', 'ABC', 'XYZ', 'ABC'], 'Init_Time': ['2022-02-16 14:00:31', '2022-02-16 14:03:15', '2022-02-16 14:05:26',
                                                                                           '2022-02-16 14:06:23', '2022-02-16 14:10:00', '2022-02-16 14:12:36', 
                                                                                           '2022-02-16 14:14:11', '2022-02-17 07:07:25', '2022-02-17 15:08:35', 
                                                                                           '2022-02-17 15:09:46'], 'Category_flag': [1,1,0,0,1,0,1,1,0,0], '10min_window_group': [1,1,1,1,1,2,2,3,4,4]}
df = pd.DataFrame(data)
df['Init_Time'] = pd.to_datetime(df['Init_Time'])
print(df)

  Name           Init_Time  Category_flag  10min_window_group
0  XYZ 2022-02-16 14:00:31              1                   1
1  XYZ 2022-02-16 14:03:15              1                   1
2  XYZ 2022-02-16 14:05:26              0                   1
3  XYZ 2022-02-16 14:06:23              0                   1
4  PQR 2022-02-16 14:10:00              1                   1
5  XYZ 2022-02-16 14:12:36              0                   2
6  XYZ 2022-02-16 14:14:11              1                   2
7  ABC 2022-02-17 07:07:25              1                   3
8  XYZ 2022-02-17 15:08:35              0                   4
9  ABC 2022-02-17 15:09:46              0                   4

I'm assigning duplicate flags (Duplicate_Flags) (1/0) for each of the names of column Name that falls under a 10-minute window interval belonging to each of the category flags by filtering :

The column 'Name`, first (XYZ, PQR, ...)
The column Category_Flag, second (1/0).
The column 10min_window_group, third (1/2/3/4).

For instance, in order to find duplicates of XYZ in the first 10-minute interval of the category 1, we first filter Name i.e XYZ among unique names of Name, then we filter for which category_flag we want to find duplicates which in this case is 1, before finally filtering the 10-minute window grouping value i.e 1. In order to achieve this goal, I have utilized 3 for loops which work well in most cases, however, the issue is that it consumes a lot of computational time when the number of data points is very large (say like 2 million data points) since the code needs to iterate through all the 3 for loops.


for name in df['Name'].unique().tolist(): #Iterate over unique names of column `Name`.
  df1 = df[df['Name'] == name]
  for category in df1['Category_flag'].unique().tolist(): #Iterate over unique category flag values.
    df2 = df1[df1['Category_flag'] == category]
    for group in df['10min_window_group'].unique().tolist(): #Iterate over unique window interval values.
      df3 = df2[df2['10min_window_group'] == group]

      if(len(df3) > 0): #Check if the len of df3 is greater than 0.
        df3['Duplicates_flag'] = np.where(df3['Name'].duplicated(), 0, 1) #Finds the duplicates.
        df3_indices = df3['Duplicates_flag'].index #fetch index of duplicates
        df3_values = df3['Duplicates_flag'].values #fetch the values of duplicates.

        df.loc[df3_indices, 'Duplicates_flag'] = df3_values #Assign the duplicate values to the main `df` using the indices.

      elif(len(df3) == 1):
        df3['Duplicates_flag'] = np.where(df3['Name'].duplicated(), 0, 1)
        df3_indices = df3['Duplicates_flag'].index
        df3_values = df3['Duplicates_flag'].values

        df.loc[df3_indices, 'Duplicates_flag'] = df3_values


print(df)

  Name           Init_Time  Category_flag  10min_window_group  Duplicates_flag
0  XYZ 2022-02-16 14:00:31              1                   1              1.0
1  XYZ 2022-02-16 14:03:15              1                   1              0.0
2  XYZ 2022-02-16 14:05:26              0                   1              1.0
3  XYZ 2022-02-16 14:06:23              0                   1              0.0
4  PQR 2022-02-16 14:10:00              1                   1              1.0
5  XYZ 2022-02-16 14:12:36              0                   2              1.0
6  XYZ 2022-02-16 14:14:11              1                   2              1.0
7  ABC 2022-02-17 07:07:25              1                   3              1.0
8  XYZ 2022-02-17 15:08:35              0                   4              1.0
9  ABC 2022-02-17 15:09:46              0                   4              1.0

So, is there a way where in I can optimize the code by reducing the number of 3 for loops/replacing the 3 for loops? The primary aim is to reduce the computation time and make the code more computationally time efficient so that it results in the same output as above.

The data you've given doesn't produce the dataframe at the bottom of the question. — Nuri Taş
– Nuri Taş, Commented Oct 25, 2022 at 17:30

Nuri Taş · Accepted Answer · 2022-10-25 17:37:02Z

2

IIUC you can use transform after groupby:

df['Duplicates_flag'] = ( df.groupby(['Name', 'Category_flag', '10min_window_group'])
           ['10min_window_group'].transform(lambda x: ~x.duplicated()*1) )

Output:

    Name    Init_Time Category_flag 10min_window_group Duplicates_flag
0   XYZ 2022-02-16 14:00:31 1   1   1
1   XYZ 2022-02-16 14:03:15 1   1   0
2   XYZ 2022-02-16 14:05:26 0   1   1
3   XYZ 2022-02-16 14:06:23 0   1   0
4   PQR 2022-02-16 14:10:00 1   1   1
5   XYZ 2022-02-16 14:12:36 0   2   1
6   XYZ 2022-02-16 14:14:11 1   2   1
7   ABC 2022-02-17 07:07:25 1   3   1
8   XYZ 2022-02-17 15:08:35 0   4   1
9   ABC 2022-02-17 15:09:46 0   4   1

edited Oct 25, 2022 at 17:37

answered Oct 25, 2022 at 17:26

Nuri Taş

3,8552 gold badges8 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user3046211 Over a year ago

Tas : thanks for the answer, I have a quick question : How to get the count of duplicate values based on Duplicates_Flag ? For instance, duplicate count value will be 2 at row 0 and row 3, 1 at row 4..

Nuri Taş Over a year ago

similar function but put count: .transform('count')

Collectives™ on Stack Overflow

Pandas : Optimising multiple sequential for loops

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related