2

I want to get the frequency count of strings within a column. One one hand, this is similar to collapsing a dataframe to a set of rows that only reflects the strings in the column. I was able to solve this with a loop, but know there is a better solution.

Example df:

       2017-08-09  2017-08-10
id                                                             
0             pre         pre   
2      active_1-3    active_1   
3        active_1    active_1   
4      active_3-7  active_3-7   
5        active_1    active_1

And want to get out:

       2017-08-09  2017-08-10
pre             1           1
active_1        2           3
active_1-3      3           0
active_3-7      1           1

I searched a lot of forums but couldnt' find a good answer.

I'm assuming a pivot_table approach is the right one, but couldn't get the right arguments to collapse a table that didn't have an obvious index for the output df.

I was able to get this to work by iterating over each column, using value_counts(), and appending each value count series into a new dataframe, but I know there is a better solution.

for i in range(len(date_cols)):
    new_values = df[date_cols[i]].value_counts()
    output_df = pd.concat([output_df , new_values], axis=1)

Thanks!

4 Answers 4

5

You can use value counts and pd.Series (Thanks for improvement Jon)i.e

ndf = df.apply(pd.Series.value_counts).fillna(0)
           2017-08-09  2017-08-10
active_1             2         3.0
active_1-3           1         0.0
active_3-7           1         1.0
pre                  1         1.0

Timings:

k = pd.concat([df]*1000)
# @cᴏʟᴅsᴘᴇᴇᴅ's method 
%%timeit
pd.get_dummies(k.T).groupby(by=lambda x: x.split('_', 1)[1], axis=1).sum().T
1 loop, best of 3: 5.68 s per loop


%%timeit
# @cᴏʟᴅsᴘᴇᴇᴅ's method 
k.stack().str.get_dummies().sum(level=1).T
10 loops, best of 3: 84.1 ms per loop

# My method 
%%timeit
k.apply(pd.Series.value_counts).fillna(0)
100 loops, best of 3: 7.57 ms per loop

# FabienP's method 
%%timeit
k.unstack().groupby(level=0).value_counts().unstack().T.fillna(0)
100 loops, best of 3: 7.35 ms per loop

#@Wen's method (fastest for now) 
pd.concat([pd.Series(collections.Counter(k[x])) for x in df.columns],axis=1)
100 loops, best of 3: 4 ms per loop
Sign up to request clarification or add additional context in comments.

3 Comments

You don't need the lambda here... df.apply(pd.Series.value_counts, 0).fillna(0)
I was trying passing value_counts directly din't work. I didn't think abt pd.Series thank you
@Bharathshetty, suggested another alternative, if you want to compare.
4

I do not know why I addict to using apply in this strange way ...

df.apply(lambda x : x.groupby(x).count()).fillna(0)
Out[31]: 
            2017-08-09  2017-08-10
active_1             2         3.0
active_1-3           1         0.0
active_3-7           1         1.0
pre                  1         1.0

Or

import collections
df.apply(lambda x : pd.Series(collections.Counter(x))).fillna(0)

As what I expected simple for loop is faster than apply

pd.concat([pd.Series(collections.Counter(df[x])) for x in df.columns],axis=1)

1 Comment

100 loops, best of 3: 4 ms per loop things are getting interseting
2

stack + get_dummies + sum:

df.stack().str.get_dummies().sum(level=1).T

            2017-08-09  2017-08-10
active_1             2           3
active_1-3           1           0
active_3-7           1           1
pre                  1           1

Very piR-esque if I do say so myself, elegance-wise, not speed-wise.


Alternative with pd.get_dummies + groupby:

pd.get_dummies(df.T).groupby(by=lambda x: x.split('_', 1)[1], axis=1).sum().T

            2017-08-09  2017-08-10
active_1             2           3
active_1-3           1           0
active_3-7           1           1
pre                  1           1

4 Comments

That was by second Option was about to add that
Not piR-esque he cares too much about speed. It will take toooo much time if df exceeds 1000 rows.
Try with bigger dataset once. It went to seconds in my PC
@coldspeed I added the timings its not in ms went upto seconds I think get_dummies is not good here
1

Another solution using groupby and value_counts

df.unstack().groupby(level=0).value_counts().unstack().T.fillna(0)
Out[]:
            2017-08-09  2017-08-10
active_1           2.0         3.0
active_1-3         1.0         0.0
active_3-7         1.0         1.0
pre                1.0         1.0

Or avoiding the last call to fillna

df.unstack().groupby(level=0).value_counts().unstack(fill_value=0).T

1 Comment

@Bharathshetty, Yep that's close. Just found I can avoid last call to fillna if it's playing for speed ;). But that won't make a big difference anyway.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.