How to group rows within a time period using Python

Question

I have a DataFrame of some transactions. I want to group these transactions with respect to their item and time column values: the goal is to group items that are within 1 hour of each other. So we start a new group at the time of the next observation that wasn't within an hour of the observation prior (See column start time in DataFrame B).

Here is the data: I want to convert A to B.

A=
item    time             result
A   2016-04-18 13:08:25  Y
A   2016-04-18 13:57:05  N
A   2016-04-18 14:00:12  N
A   2016-04-18 23:45:50  Y
A   2016-04-20 16:53:48  Y
A   2016-04-20 17:11:47  N
B   2016-04-18 15:24:48  N
C   2016-04-23 13:20:44  N
C   2016-04-23 14:02:23  Y


B=
item    start time            end time      Ys  Ns  total count
A   2016-04-18 13:08:25 2016-04-18 14:08:25 1   2   3
A   2016-04-18 23:45:50 2016-04-18 00:45:50 1   0   1
A   2016-04-20 16:53:48 2016-04-20 17:53:48 1   1   2
B   2016-04-18 15:24:48 2016-04-18 16:24:48 0   1   1
C   2016-04-23 13:20:44 2016-04-23 14:20:44 1   1   2

Here is what I did:

grouped = A.groupby('item')
A['end'] = (grouped['time'].transform(lambda grp: grp.min()+pd.Timedelta(hours=1)))
A2 = A.loc[(A['time'] <= A['end'])]

This gives me one group per day: the transaction within 1 hour of the first transaction. So, I'm missing other transactions in the same day but more than 1 hour apart from the first. My struggle is how to get those groups. I can then use pd.crosstab to get the details I want from the result column.

Another idea I have is to sort A by item and time, and then go row by row. If the time is within 1 hour of the previous row, it adds to that group, otherwise, it creates a new group.

There are lots of questions left unanswered. Like, grouped within one hour of when? One hour of first observation? What about the next hour? Does it start when the last hour left off? Or do we start a new hour at the time of the next observation that wasn't within an hour of the observation prior? — piRSquared
– piRSquared, Commented May 11, 2016 at 18:17
@piRSquared I added more details to the question to clarify. — Ana
– Ana, Commented May 11, 2016 at 18:30

Stefan · Accepted Answer · 2016-05-11 21:31:33Z

1

1) Set up a window_end column for later use with .groupby(), and define .get_windows() to check, for each item group, if a row fits the current current 1hr window, or do nothing and keep the initialized value. Apply to all item groups:

df['window_end'] = df.time + pd.Timedelta('1H')

def get_windows(data):
    window_end = data.iloc[0].window_end
    for index, row in data.iloc[1:].iterrows():
        if window_end > row.time:
            df.loc[index, 'window_end'] = window_end
        else:
            window_end = row.window_end

df.groupby('item').apply(lambda x: get_windows(x))

2) Use windows and item with .groupby() and return .value_counts() as transposed DataFrame, clean up index, and add total:

df = df.groupby(['window_end', 'item']).result.apply(lambda x: x.value_counts().to_frame().T)
df = df.fillna(0).astype(int).reset_index(level=2, drop=True)
df['total'] = df.sum(axis=1)

to get:

                            N  Y  total
window_end          item               
2016-04-18 14:08:25 A    A  2  1      3
2016-04-18 16:24:48 B    B  1  0      1
2016-04-19 00:45:50 A    A  0  1      1
2016-04-20 17:53:48 A    A  1  1      2
2016-04-23 14:20:44 C    C  1  1      2

edited May 11, 2016 at 21:31

answered May 11, 2016 at 18:26

Stefan

43.1k13 gold badges80 silver badges84 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Ana Over a year ago

Thanks, yes unfortunately I cannot use Hour as my grouper.

Ana Over a year ago

Thanks, A couple comments. In your second step, windows should be replaced by window_end, and , right? Also you may want to use another for your result` DataFrame so it is not mistaken with column result.

Stefan Over a year ago

That's right, had been fiddling with the code while editing here, never a good idea. Should be working now.

MaxU - stand with Ukraine · Accepted Answer · 2016-05-11 19:04:04Z

1

inspired (+1) by Stefan's solution I came to this one:

B = (A.groupby(['item', A.groupby('item')['time']
                         .diff().fillna(0).dt.total_seconds()//60//60
               ],
               as_index=False)['time'].min()
)


B[['N','Y']] = (A.groupby(['item', A.groupby('item')['time']
                                    .diff().fillna(0).dt.total_seconds()//60//60
                          ])['result']
                 .apply(lambda x: x.value_counts().to_frame().T).fillna(0)
                 .reset_index()[['N','Y']]
)

Output:

In [178]: B
Out[178]:
  item                time    N    Y
0    A 2016-04-18 13:08:25  3.0  1.0
1    A 2016-04-18 23:45:50  0.0  1.0
2    A 2016-04-20 16:53:48  0.0  1.0
3    B 2016-04-18 15:24:48  1.0  0.0
4    C 2016-04-23 13:20:44  1.0  1.0

PS the idea is to use A.groupby('item')['time'].diff().fillna(0).dt.total_seconds()//60//60 as a part of grouping:

In [179]: A.groupby('item')['time'].diff().fillna(0).dt.total_seconds()//60//60
Out[179]:
0     0.0
1     0.0
2     0.0
3     9.0
4    41.0
5     0.0
6     0.0
7     0.0
8     0.0
Name: time, dtype: float64

edited May 11, 2016 at 19:04

answered May 11, 2016 at 18:46

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

1 Comment

Ana Over a year ago

Thatnks @MaxU, I get AttributeError: 'TimedeltaProperties' object has no attribute 'total_seconds' error. I have import datetime as dt.

piRSquared · Accepted Answer · 2016-05-11 19:04:50Z

Setup

import pandas as pd
from StringIO import StringIO

text = """item    time             result
A   2016-04-18 13:08:25  Y
A   2016-04-18 13:57:05  N
A   2016-04-18 14:00:12  N
A   2016-04-18 23:45:50  Y
A   2016-04-20 16:53:48  Y
A   2016-04-20 17:11:47  N
B   2016-04-18 15:24:48  N
C   2016-04-23 13:20:44  N
C   2016-04-23 14:02:23  Y
"""

df = pd.read_csv(StringIO(text), delimiter="\s{2,}", parse_dates=[1], engine='python')

Solution

I needed to create a few process functions:

def set_time_group(df):
    cur_time = pd.NaT
    for index, row in df.iterrows():
        if pd.isnull(cur_time):
            cur_time = row.time
        delta = row.time - cur_time
        if delta.seconds / 3600. < 1:
            df.loc[index, 'time_ref'] = cur_time
        else:
            df.loc[index, 'time_ref'] = row.time
            cur_time = row.time
    return df

def summarize_results(df):
    df_ = df.groupby('result').count().iloc[:, 0]
    df_.loc['total count'] = df_.sum()
    return df_

dfg1 = df.groupby('item').apply(set_time_group)
dfg2 = dfg1.groupby(['item', 'time_ref']).apply(summarize_results)
df_f = dfg2.unstack().fillna(0)

Demonstration

print df_f

result                      N    Y  total count
item time_ref                                  
A    2016-04-18 13:08:25  2.0  1.0          3.0
     2016-04-18 23:45:50  0.0  1.0          1.0
     2016-04-20 16:53:48  1.0  1.0          2.0
B    2016-04-18 15:24:48  1.0  0.0          1.0
C    2016-04-23 13:20:44  1.0  1.0          2.0

Collectives™ on Stack Overflow

How to group rows within a time period using Python

3 Answers 3

3 Comments

1 Comment

Setup

Solution

Demonstration

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

Setup

Solution

Demonstration

Comments

Your Answer

Sign up or log in

Post as a guest

Related