How to get frequency of each element in column (having array of strings) of data frame with pandas?

Question

I have a panda data frame in python as below:

df['column'] = [abc, mno]
               [mno, pqr]
               [abc, mno]
               [mno, pqr]

I want to get the count of each item below :

abc = 2, 
mno= 4 ,
pqr = 2

I can do iteration over the each row to count but this is not the kind of solution I m looking for. If there is any way where I can use iloc or anything related to that, please suggest to me.

I have looked at various solutions with a similar problem but none of them satisfied my scenario.

i understand comprehension will be faster than iteration yet.. i was expecting some easy solution using pandas api.. — Monu
– Monu, Commented Jan 28, 2020 at 18:01

Celius Stingher · Accepted Answer · 2020-01-28 18:01:36Z

3

Here is how I'd solve it using .explode() and .value_counts() you can furthermore assign it as a column or do as you please with the output: In one line:

print(df.explode('column')['column'].value_counts())

Full example:

import pandas as pd
data_1 = {'index':[0,1,2,3],'column':[['abc','mno'],['mno','pqr'],['abc','mno'],['mno','pqr']]}
df = pd.DataFrame(data_1)
df = df.set_index('index')
print(df)
           column
index            
0      [abc, mno]
1      [mno, pqr]
2      [abc, mno]
3      [mno, pqr]

Here we perform the .explode() to create individual values from the lists and value_counts() to count repetition of unique values:

df_new = df.explode('column')
print(df_new['column'].value_counts())

Output:

mno    4
abc    2
pqr    2

answered Jan 28, 2020 at 18:01

Celius Stingher

18.4k6 gold badges26 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Alex Hall Over a year ago

You can also explode a series directly, e.g. df["column"].explode().value_counts().

Celius Stingher Over a year ago

Yes, I had the feeling OP was dealing with more columns (specially because of the ML tag)

Andy L. · Accepted Answer · 2020-01-28 18:29:50Z

2

Use collections.Counter

from collections import Counter
from itertools import chain

Counter(chain.from_iterable(df.column))

Out[196]: Counter({'abc': 2, 'mno': 4, 'pqr': 2})

%timeit

df1 = pd.concat([df]*10000, ignore_index=True)

In [227]: %timeit pd.Series(Counter(chain.from_iterable(df1.column)))
14.3 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [228]: %timeit df1.column.explode().value_counts()
127 ms ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Jan 28, 2020 at 18:29

answered Jan 28, 2020 at 18:08

Andy L.

25.3k4 gold badges20 silver badges30 bronze badges

3 Comments

Andy L. Over a year ago

@AlexHall: it is 8x faster than explode and value_counts when using chain.from_iterable to flatten on big dataframe

Alex Hall Over a year ago

I take it back, the Counter API is implemented in Python but the actual counting is in C. My tests confirm that it is faster than value_counts, and from_iterable is faster than explode (maybe this warrants a pandas issue?)

Andy L. Over a year ago

@AlexHall: nothing serious. Cheers! it is well-known that np.concatenate is slower than chain.from_iterable in flattening lists. It's just a lazy part of me to use it instead of from_iterable. I switched to from_iterable just to show the true speed of Counter :)

Collectives™ on Stack Overflow

How to get frequency of each element in column (having array of strings) of data frame with pandas?

2 Answers 2

2 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related