finding streaks in pandas dataframe

Question

I have a pandas dataframe as follows:

time    winner  loser   stat
1       A       B       0
2       C       B       0
3       D       B       1
4       E       B       0
5       F       A       0
6       G       A       0
7       H       A       0
8       I       A       1

each row is a match result. the first column is the time of the match, second and third column contain winner/loser and the fourth column is one stat from the match.

I want to detect streaks of zeros for this stat per loser.

The expected result should look like this:

time    winner  loser   stat    streak
1       A       B       0       1
2       C       B       0       2
3       D       B       1       0
4       E       B       0       1
5       F       A       0       1
6       G       A       0       2
7       H       A       0       3
8       I       A       1       0

In pseudocode the algorithm should work like this:

.groupby loser column.
then iterate over each row of each loser group
in each row, look at the stat column: if it contains 0, then increment the streak value from the previous row by 0. if it is not 0, then start a new streak, that is, put 0 into the streak column.

So the .groupby is clear. But then I would need some sort of .apply where I can look at the previous row? this is where I am stuck.

Not exactly an answer, but it think the solution might be found by combining this and this — M.T
– M.T, Commented Feb 16, 2016 at 8:50

jezrael · Accepted Answer · 2016-02-16 09:33:02Z

You can apply custom function f, then cumsum, cumcount and astype:

def f(x):
    x['streak'] = x.groupby( (x['stat'] != 0).cumsum()).cumcount() + 
                  ( (x['stat'] != 0).cumsum() == 0).astype(int) 
    return x

df = df.groupby('loser', sort=False).apply(f)
print df
   time winner loser  stat  streak
0     1      A     B     0       1
1     2      C     B     0       2
2     3      D     B     1       0
3     4      E     B     0       1
4     5      F     A     0       1
5     6      G     A     0       2
6     7      H     A     0       3
7     8      I     A     1       0

For better undestanding:

def f(x):
    x['c'] = (x['stat'] != 0).cumsum()
    x['a'] = (x['c'] == 0).astype(int)
    x['b'] = x.groupby( 'c' ).cumcount()

    x['streak'] = x.groupby( 'c' ).cumcount() + x['a']

    return x
df = df.groupby('loser', sort=False).apply(f)
print df
   time winner loser  stat  c  a  b  streak
0     1      A     B     0  0  1  0       1
1     2      C     B     0  0  1  1       2
2     3      D     B     1  1  0  0       0
3     4      E     B     0  1  0  1       1
4     5      F     A     0  0  1  0       1
5     6      G     A     0  0  1  1       2
6     7      H     A     0  0  1  2       3
7     8      I     A     1  1  0  0       0

Community · Accepted Answer · 2017-05-23 10:30:06Z

5

Not as elegant as jezrael's answer, but for me easier to understand...

First, define a function that works with a single loser:

def f(df):
    df['streak2'] = (df['stat'] == 0).cumsum()
    df['cumsum'] = np.nan
    df.loc[df['stat'] == 1, 'cumsum'] = df['streak2']
    df['cumsum'] = df['cumsum'].fillna(method='ffill')
    df['cumsum'] = df['cumsum'].fillna(0)
    df['streak'] = df['streak2'] - df['cumsum']
    df.drop(['streak2', 'cumsum'], axis=1, inplace=True)
    return df

The streak is essentially a cumsum, but we need to reset it each time stat is 1. We therefore subtract the value of the cumsum where stat is 1, carried forward until the next 1.

Then groupby and apply by loser:

df.groupby('loser').apply(f)

The result is as expected.

edited May 23, 2017 at 10:30

CommunityBot

11 silver badge

answered Feb 16, 2016 at 9:52

IanS

16.3k9 gold badges64 silver badges87 bronze badges

1 Comment

beta Over a year ago

i like this solution most, because it's really transparent. moreover, jezrael's answer only works if the stat column has 0 or 1 values.. in fact, my real data also contains other values (e.g. 3, 4, 5) in the stat column. sorry, that's of course my bad, for not pointing this out in my example. thanks for your help!

Anton Protopopov · Accepted Answer · 2016-02-16 09:41:24Z

You could use iterrows to access previous row:

df['streak'] = 0

for i, row in df.iterrows():
    if i != 0:
        if row['stat'] == 0:
            if row['loser'] == df.ix[i-1, 'loser']:
                df.ix[i, 'streak'] = df.ix[i-1, 'streak'] + 1        
            else:
                df.ix[i, 'streak'] = 1
    else:
        if row['stat'] == 0:
            df.ix[i, 'streak'] = 1

Which gives:

In [210]: df
Out[210]:
   time winner loser  stat  streak
0     1      A     B     0       1
1     2      C     B     0       2
2     3      D     B     1       0
3     4      E     B     0       1
4     5      F     A     0       1
5     6      G     A     0       2
6     7      H     A     0       3
7     8      I     A     1       0

Collectives™ on Stack Overflow

finding streaks in pandas dataframe

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related