Find value counts within a pandas dataframe of strings

Question

I want to get the frequency count of strings within a column. One one hand, this is similar to collapsing a dataframe to a set of rows that only reflects the strings in the column. I was able to solve this with a loop, but know there is a better solution.

Example df:

       2017-08-09  2017-08-10
id                                                             
0             pre         pre   
2      active_1-3    active_1   
3        active_1    active_1   
4      active_3-7  active_3-7   
5        active_1    active_1

And want to get out:

       2017-08-09  2017-08-10
pre             1           1
active_1        2           3
active_1-3      3           0
active_3-7      1           1

I searched a lot of forums but couldnt' find a good answer.

I'm assuming a pivot_table approach is the right one, but couldn't get the right arguments to collapse a table that didn't have an obvious index for the output df.

I was able to get this to work by iterating over each column, using value_counts(), and appending each value count series into a new dataframe, but I know there is a better solution.

for i in range(len(date_cols)):
    new_values = df[date_cols[i]].value_counts()
    output_df = pd.concat([output_df , new_values], axis=1)

Thanks!

Bharath M Shetty · Accepted Answer · 2017-10-21 15:08:02Z

5

You can use value counts and pd.Series (Thanks for improvement Jon)i.e

ndf = df.apply(pd.Series.value_counts).fillna(0)

           2017-08-09  2017-08-10
active_1             2         3.0
active_1-3           1         0.0
active_3-7           1         1.0
pre                  1         1.0

Timings:

k = pd.concat([df]*1000)
# @cᴏʟᴅsᴘᴇᴇᴅ's method 
%%timeit
pd.get_dummies(k.T).groupby(by=lambda x: x.split('_', 1)[1], axis=1).sum().T
1 loop, best of 3: 5.68 s per loop


%%timeit
# @cᴏʟᴅsᴘᴇᴇᴅ's method 
k.stack().str.get_dummies().sum(level=1).T
10 loops, best of 3: 84.1 ms per loop

# My method 
%%timeit
k.apply(pd.Series.value_counts).fillna(0)
100 loops, best of 3: 7.57 ms per loop

# FabienP's method 
%%timeit
k.unstack().groupby(level=0).value_counts().unstack().T.fillna(0)
100 loops, best of 3: 7.35 ms per loop

#@Wen's method (fastest for now) 
pd.concat([pd.Series(collections.Counter(k[x])) for x in df.columns],axis=1)
100 loops, best of 3: 4 ms per loop

edited Oct 21, 2017 at 15:08

answered Oct 21, 2017 at 13:25

Bharath M Shetty

30.6k6 gold badges65 silver badges111 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jon Clements Over a year ago

You don't need the lambda here... df.apply(pd.Series.value_counts, 0).fillna(0)

Bharath M Shetty Over a year ago

I was trying passing value_counts directly din't work. I didn't think abt pd.Series thank you

FabienP Over a year ago

@Bharathshetty, suggested another alternative, if you want to compare.

BENY · Accepted Answer · 2017-10-21 15:04:26Z

4

I do not know why I addict to using apply in this strange way ...

df.apply(lambda x : x.groupby(x).count()).fillna(0)
Out[31]: 
            2017-08-09  2017-08-10
active_1             2         3.0
active_1-3           1         0.0
active_3-7           1         1.0
pre                  1         1.0

Or

import collections
df.apply(lambda x : pd.Series(collections.Counter(x))).fillna(0)

As what I expected simple for loop is faster than apply

pd.concat([pd.Series(collections.Counter(df[x])) for x in df.columns],axis=1)

edited Oct 21, 2017 at 15:04

answered Oct 21, 2017 at 14:40

BENY

324k22 gold badges176 silver badges250 bronze badges

1 Comment

Bharath M Shetty Over a year ago

100 loops, best of 3: 4 ms per loop things are getting interseting

cs95 · Accepted Answer · 2017-10-21 14:01:19Z

2

stack + get_dummies + sum:

df.stack().str.get_dummies().sum(level=1).T

            2017-08-09  2017-08-10
active_1             2           3
active_1-3           1           0
active_3-7           1           1
pre                  1           1

Very piR-esque if I do say so myself, elegance-wise, not speed-wise.

Alternative with pd.get_dummies + groupby:

pd.get_dummies(df.T).groupby(by=lambda x: x.split('_', 1)[1], axis=1).sum().T

            2017-08-09  2017-08-10
active_1             2           3
active_1-3           1           0
active_3-7           1           1
pre                  1           1

edited Oct 21, 2017 at 14:01

answered Oct 21, 2017 at 13:28

cs95

406k106 gold badges744 silver badges797 bronze badges

4 Comments

Bharath M Shetty Over a year ago

That was by second Option was about to add that

Bharath M Shetty Over a year ago

Not piR-esque he cares too much about speed. It will take toooo much time if df exceeds 1000 rows.

Bharath M Shetty Over a year ago

Try with bigger dataset once. It went to seconds in my PC

Bharath M Shetty Over a year ago

@coldspeed I added the timings its not in ms went upto seconds I think get_dummies is not good here

FabienP · Accepted Answer · 2017-10-21 14:11:21Z

1

Another solution using groupby and value_counts

df.unstack().groupby(level=0).value_counts().unstack().T.fillna(0)
Out[]:
            2017-08-09  2017-08-10
active_1           2.0         3.0
active_1-3         1.0         0.0
active_3-7         1.0         1.0
pre                1.0         1.0

Or avoiding the last call to fillna

df.unstack().groupby(level=0).value_counts().unstack(fill_value=0).T

edited Oct 21, 2017 at 14:11

answered Oct 21, 2017 at 13:48

FabienP

3,1581 gold badge24 silver badges27 bronze badges

1 Comment

FabienP Over a year ago

@Bharathshetty, Yep that's close. Just found I can avoid last call to fillna if it's playing for speed ;). But that won't make a big difference anyway.

Collectives™ on Stack Overflow

Find value counts within a pandas dataframe of strings

4 Answers 4

3 Comments

1 Comment

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

1 Comment

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related