1

I need to calculate the number of consecutive days from today (2022-01-04) backwards a client logged in my application. I'm having a really hard time coming up with a logic for this because my table is HUGE (can get to pentabyte). Input

Name       Date         
    John    2022-01-01  
    John    2022-01-01
    Mary    2022-01-01
    Steve   2022-01-03
    Mary    2022-01-03
    John    2022-01-02
    John    2022-01-03
    Mary    2022-01-04
    John    2022-01-04

Output

Name       consecutive_days          
    John         4
    Mary         2 

Can you guys help me out please?

1
  • 1
    I doubt you'll get anything satisfying with pandas if you really petabytes (!!) of data Commented Jul 12, 2022 at 15:16

2 Answers 2

2

Preparation

df = pd.DataFrame({
    'Name':['John', 'John', 'Mary', 'Steve', 'Mary', 'John', 'John', 'Mary', 'John'],
    'Date':['2022-01-01','2022-01-01','2022-01-01','2022-01-03','2022-01-03','2022-01-02','2022-01-03','2022-01-04','2022-01-04']
})
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['Name','Date']).reset_index(drop=True)
df
###
    Name       Date
0   John 2022-01-01
1   John 2022-01-01
2   John 2022-01-02
3   John 2022-01-03
4   John 2022-01-04
5   Mary 2022-01-01
6   Mary 2022-01-03
7   Mary 2022-01-04
8  Steve 2022-01-03



check df's continuity

consecutive_check = df.groupby(['Name'])['Date'].diff().dt.days.ne(1).cumsum()
df['consecutive_days'] = df.groupby(consecutive_check).cumcount()+1
df
###
    Name       Date   consecutive_days
0   John 2022-01-01                  1
1   John 2022-01-01                  1
2   John 2022-01-02                  2
3   John 2022-01-03                  3
4   John 2022-01-04                  4
5   Mary 2022-01-01                  1
6   Mary 2022-01-03                  1
7   Mary 2022-01-04                  2
8  Steve 2022-01-03                  1



Filtration

result = df[['Name','consecutive_days']].groupby('Name').tail(1)
result = result[result['consecutive_days'] > 1].reset_index(drop=True)
result
###
   Name   consecutive_days
0  John                  4
1  Mary                  2
Sign up to request clarification or add additional context in comments.

Comments

1
########## Recreating OP's DF ##############
Name = ["John", "John", "Mary", "Steve", "Mary", "John", "John", "Mary", "John"]
Date = ["2022-01-01", "2022-01-01", "2022-01-01", "2022-01-03", "2022-01-03", 
        "2022-01-02", "2022-01-03", "2022-01-04", "2022-01-04"]
df = pd.DataFrame({"Name":Name, "Date":Date})
df.Date = pd.to_datetime(df.Date)
############################################

# Sort the dataframe first by name then by date
df = df.sort_values(by=["Name", "Date"]).reset_index(drop=True)

# Loop through all of the unique names and create a sub-dataframe for them
allCount = [] # empty count list
for name in df.Name.unique():
    nameDF = df[df.Name == name].reset_index(drop=True)
    
    # Check if the next day is 1 day after the previous day (consecutive)
    # If it is, add to the count value
    count = 1
    for y in range(len(nameDF)-1):
        if nameDF.Date[y+1] == nameDF.Date[y] + pd.Timedelta(days=1):
            count += 1
    allCount.append(count) # append all count values for all names
    
# Make a new DF with the consecutive day count
consecutiveDF = pd.DataFrame({"Name":df.Name.unique(), "consecutive_days":allCount})

# Drop any row that does not have at least 1 consecutive day
consecutiveDF = consecutiveDF[consecutiveDF.consecutive_days > 1].reset_index(drop=True)
consecutiveDF

Output:

    Name    consecutive_days
0   John    4
1   Mary    2

Maybe some list comprehension instead of a nest for loop can speed it up:

# Sort the dataframe first by name then by date
df = df.sort_values(by=["Name", "Date"]).reset_index(drop=True)

# Loop through all of the unique names and create a sub-dataframe for them
allCount = [] # empty count list
for name in df.Name.unique():
    nameDF = df[df.Name == name].reset_index(drop=True)
    
    # Check if the next day is 1 day after the previous day (consecutive)
    # If it is, add to the count value
    days = ["day" for y in range(len(nameDF)-1) if nameDF.Date[y+1] == nameDF.Date[y] + pd.Timedelta(days=1)]
    allCount.append(len(days)+1)
    
# Make a new DF with the consecutive day count
consecutiveDF = pd.DataFrame({"Name":df.Name.unique(), "consecutive_days":allCount})

# Drop any row that does not have at least 1 consecutive day
consecutiveDF = consecutiveDF[consecutiveDF.consecutive_days > 1].reset_index(drop=True)
consecutiveDF

1 Comment

Don't know how great this will be for a petabyte of data, maybe it can help with your logic though.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.