rolling window count based on parameter

Question

I need to calculate the number of consecutive days from today (2022-01-04) backwards a client logged in my application. I'm having a really hard time coming up with a logic for this because my table is HUGE (can get to pentabyte). Input

Name       Date         
    John    2022-01-01  
    John    2022-01-01
    Mary    2022-01-01
    Steve   2022-01-03
    Mary    2022-01-03
    John    2022-01-02
    John    2022-01-03
    Mary    2022-01-04
    John    2022-01-04

Output

Name       consecutive_days          
    John         4
    Mary         2

Can you guys help me out please?

I doubt you'll get anything satisfying with pandas if you really petabytes (!!) of data — mozway
– mozway, Commented Jul 12, 2022 at 15:16

Baron Legendre · Accepted Answer · 2022-07-12 15:46:18Z

Preparation

df = pd.DataFrame({
    'Name':['John', 'John', 'Mary', 'Steve', 'Mary', 'John', 'John', 'Mary', 'John'],
    'Date':['2022-01-01','2022-01-01','2022-01-01','2022-01-03','2022-01-03','2022-01-02','2022-01-03','2022-01-04','2022-01-04']
})
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['Name','Date']).reset_index(drop=True)
df
###
    Name       Date
0   John 2022-01-01
1   John 2022-01-01
2   John 2022-01-02
3   John 2022-01-03
4   John 2022-01-04
5   Mary 2022-01-01
6   Mary 2022-01-03
7   Mary 2022-01-04
8  Steve 2022-01-03

check df's continuity

consecutive_check = df.groupby(['Name'])['Date'].diff().dt.days.ne(1).cumsum()
df['consecutive_days'] = df.groupby(consecutive_check).cumcount()+1
df
###
    Name       Date   consecutive_days
0   John 2022-01-01                  1
1   John 2022-01-01                  1
2   John 2022-01-02                  2
3   John 2022-01-03                  3
4   John 2022-01-04                  4
5   Mary 2022-01-01                  1
6   Mary 2022-01-03                  1
7   Mary 2022-01-04                  2
8  Steve 2022-01-03                  1

Filtration

result = df[['Name','consecutive_days']].groupby('Name').tail(1)
result = result[result['consecutive_days'] > 1].reset_index(drop=True)
result
###
   Name   consecutive_days
0  John                  4
1  Mary                  2

Michael S. · Accepted Answer · 2022-07-12 15:25:02Z

########## Recreating OP's DF ##############
Name = ["John", "John", "Mary", "Steve", "Mary", "John", "John", "Mary", "John"]
Date = ["2022-01-01", "2022-01-01", "2022-01-01", "2022-01-03", "2022-01-03", 
        "2022-01-02", "2022-01-03", "2022-01-04", "2022-01-04"]
df = pd.DataFrame({"Name":Name, "Date":Date})
df.Date = pd.to_datetime(df.Date)
############################################

# Sort the dataframe first by name then by date
df = df.sort_values(by=["Name", "Date"]).reset_index(drop=True)

# Loop through all of the unique names and create a sub-dataframe for them
allCount = [] # empty count list
for name in df.Name.unique():
    nameDF = df[df.Name == name].reset_index(drop=True)
    
    # Check if the next day is 1 day after the previous day (consecutive)
    # If it is, add to the count value
    count = 1
    for y in range(len(nameDF)-1):
        if nameDF.Date[y+1] == nameDF.Date[y] + pd.Timedelta(days=1):
            count += 1
    allCount.append(count) # append all count values for all names
    
# Make a new DF with the consecutive day count
consecutiveDF = pd.DataFrame({"Name":df.Name.unique(), "consecutive_days":allCount})

# Drop any row that does not have at least 1 consecutive day
consecutiveDF = consecutiveDF[consecutiveDF.consecutive_days > 1].reset_index(drop=True)
consecutiveDF

Output:

    Name    consecutive_days
0   John    4
1   Mary    2

Maybe some list comprehension instead of a nest for loop can speed it up:

# Sort the dataframe first by name then by date
df = df.sort_values(by=["Name", "Date"]).reset_index(drop=True)

# Loop through all of the unique names and create a sub-dataframe for them
allCount = [] # empty count list
for name in df.Name.unique():
    nameDF = df[df.Name == name].reset_index(drop=True)
    
    # Check if the next day is 1 day after the previous day (consecutive)
    # If it is, add to the count value
    days = ["day" for y in range(len(nameDF)-1) if nameDF.Date[y+1] == nameDF.Date[y] + pd.Timedelta(days=1)]
    allCount.append(len(days)+1)
    
# Make a new DF with the consecutive day count
consecutiveDF = pd.DataFrame({"Name":df.Name.unique(), "consecutive_days":allCount})

# Drop any row that does not have at least 1 consecutive day
consecutiveDF = consecutiveDF[consecutiveDF.consecutive_days > 1].reset_index(drop=True)
consecutiveDF

Don't know how great this will be for a petabyte of data, maybe it can help with your logic though.

Collectives™ on Stack Overflow

rolling window count based on parameter

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related