0

I have a dataframe df like below


import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

data = {'Name': ['XYZ', 'XYZ', 'XYZ', 'XYZ', 'PQR', 'XYZ', 'XYZ', 'ABC', 'XYZ', 'ABC'], 'Init_Time': ['2022-02-16 14:00:31', '2022-02-16 14:03:15', '2022-02-16 14:05:26',
                                                                                           '2022-02-16 14:06:23', '2022-02-16 14:10:00', '2022-02-16 14:12:36', 
                                                                                           '2022-02-16 14:14:11', '2022-02-17 07:07:25', '2022-02-17 15:08:35', 
                                                                                           '2022-02-17 15:09:46'], 'Category_flag': [1,1,0,0,1,0,1,1,0,0], '10min_window_group': [1,1,1,1,1,2,2,3,4,4]}
df = pd.DataFrame(data)
df['Init_Time'] = pd.to_datetime(df['Init_Time'])
print(df)

  Name           Init_Time  Category_flag  10min_window_group
0  XYZ 2022-02-16 14:00:31              1                   1
1  XYZ 2022-02-16 14:03:15              1                   1
2  XYZ 2022-02-16 14:05:26              0                   1
3  XYZ 2022-02-16 14:06:23              0                   1
4  PQR 2022-02-16 14:10:00              1                   1
5  XYZ 2022-02-16 14:12:36              0                   2
6  XYZ 2022-02-16 14:14:11              1                   2
7  ABC 2022-02-17 07:07:25              1                   3
8  XYZ 2022-02-17 15:08:35              0                   4
9  ABC 2022-02-17 15:09:46              0                   4


I'm assigning duplicate flags (Duplicate_Flags) (1/0) for each of the names of column Name that falls under a 10-minute window interval belonging to each of the category flags by filtering :

  1. The column 'Name`, first (XYZ, PQR, ...)
  2. The column Category_Flag, second (1/0).
  3. The column 10min_window_group, third (1/2/3/4).

For instance, in order to find duplicates of XYZ in the first 10-minute interval of the category 1, we first filter Name i.e XYZ among unique names of Name, then we filter for which category_flag we want to find duplicates which in this case is 1, before finally filtering the 10-minute window grouping value i.e 1. In order to achieve this goal, I have utilized 3 for loops which work well in most cases, however, the issue is that it consumes a lot of computational time when the number of data points is very large (say like 2 million data points) since the code needs to iterate through all the 3 for loops.


for name in df['Name'].unique().tolist(): #Iterate over unique names of column `Name`.
  df1 = df[df['Name'] == name]
  for category in df1['Category_flag'].unique().tolist(): #Iterate over unique category flag values.
    df2 = df1[df1['Category_flag'] == category]
    for group in df['10min_window_group'].unique().tolist(): #Iterate over unique window interval values.
      df3 = df2[df2['10min_window_group'] == group]

      if(len(df3) > 0): #Check if the len of df3 is greater than 0.
        df3['Duplicates_flag'] = np.where(df3['Name'].duplicated(), 0, 1) #Finds the duplicates.
        df3_indices = df3['Duplicates_flag'].index #fetch index of duplicates
        df3_values = df3['Duplicates_flag'].values #fetch the values of duplicates.

        df.loc[df3_indices, 'Duplicates_flag'] = df3_values #Assign the duplicate values to the main `df` using the indices.

      elif(len(df3) == 1):
        df3['Duplicates_flag'] = np.where(df3['Name'].duplicated(), 0, 1)
        df3_indices = df3['Duplicates_flag'].index
        df3_values = df3['Duplicates_flag'].values

        df.loc[df3_indices, 'Duplicates_flag'] = df3_values


print(df)

  Name           Init_Time  Category_flag  10min_window_group  Duplicates_flag
0  XYZ 2022-02-16 14:00:31              1                   1              1.0
1  XYZ 2022-02-16 14:03:15              1                   1              0.0
2  XYZ 2022-02-16 14:05:26              0                   1              1.0
3  XYZ 2022-02-16 14:06:23              0                   1              0.0
4  PQR 2022-02-16 14:10:00              1                   1              1.0
5  XYZ 2022-02-16 14:12:36              0                   2              1.0
6  XYZ 2022-02-16 14:14:11              1                   2              1.0
7  ABC 2022-02-17 07:07:25              1                   3              1.0
8  XYZ 2022-02-17 15:08:35              0                   4              1.0
9  ABC 2022-02-17 15:09:46              0                   4              1.0

So, is there a way where in I can optimize the code by reducing the number of 3 for loops/replacing the 3 for loops? The primary aim is to reduce the computation time and make the code more computationally time efficient so that it results in the same output as above.

2
  • The data you've given doesn't produce the dataframe at the bottom of the question. Commented Oct 25, 2022 at 17:30
  • @NuriTaş : updated the data to reflect the dataframe. Commented Oct 25, 2022 at 17:33

1 Answer 1

2

IIUC you can use transform after groupby:

df['Duplicates_flag'] = ( df.groupby(['Name', 'Category_flag', '10min_window_group'])
           ['10min_window_group'].transform(lambda x: ~x.duplicated()*1) )

Output:

    Name    Init_Time Category_flag 10min_window_group Duplicates_flag
0   XYZ 2022-02-16 14:00:31 1   1   1
1   XYZ 2022-02-16 14:03:15 1   1   0
2   XYZ 2022-02-16 14:05:26 0   1   1
3   XYZ 2022-02-16 14:06:23 0   1   0
4   PQR 2022-02-16 14:10:00 1   1   1
5   XYZ 2022-02-16 14:12:36 0   2   1
6   XYZ 2022-02-16 14:14:11 1   2   1
7   ABC 2022-02-17 07:07:25 1   3   1
8   XYZ 2022-02-17 15:08:35 0   4   1
9   ABC 2022-02-17 15:09:46 0   4   1
Sign up to request clarification or add additional context in comments.

2 Comments

Tas : thanks for the answer, I have a quick question : How to get the count of duplicate values based on Duplicates_Flag ? For instance, duplicate count value will be 2 at row 0 and row 3, 1 at row 4..
similar function but put count: .transform('count')

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.