1

I have a panda data frame in python as below:

df['column'] = [abc, mno]
               [mno, pqr]
               [abc, mno]
               [mno, pqr]

I want to get the count of each item below :

abc = 2, 
mno= 4 ,
pqr = 2

I can do iteration over the each row to count but this is not the kind of solution I m looking for. If there is any way where I can use iloc or anything related to that, please suggest to me.

I have looked at various solutions with a similar problem but none of them satisfied my scenario.

3
  • How about you use .explode() and value_counts()? Commented Jan 28, 2020 at 17:57
  • stackoverflow.com/questions/33556050/… and value_counts Commented Jan 28, 2020 at 17:58
  • i understand comprehension will be faster than iteration yet.. i was expecting some easy solution using pandas api.. Commented Jan 28, 2020 at 18:01

2 Answers 2

3

Here is how I'd solve it using .explode() and .value_counts() you can furthermore assign it as a column or do as you please with the output: In one line:

print(df.explode('column')['column'].value_counts())

Full example:

import pandas as pd
data_1 = {'index':[0,1,2,3],'column':[['abc','mno'],['mno','pqr'],['abc','mno'],['mno','pqr']]}
df = pd.DataFrame(data_1)
df = df.set_index('index')
print(df)
           column
index            
0      [abc, mno]
1      [mno, pqr]
2      [abc, mno]
3      [mno, pqr]

Here we perform the .explode() to create individual values from the lists and value_counts() to count repetition of unique values:

df_new = df.explode('column')
print(df_new['column'].value_counts())

Output:

mno    4
abc    2
pqr    2
Sign up to request clarification or add additional context in comments.

2 Comments

You can also explode a series directly, e.g. df["column"].explode().value_counts().
Yes, I had the feeling OP was dealing with more columns (specially because of the ML tag)
2

Use collections.Counter

from collections import Counter
from itertools import chain

Counter(chain.from_iterable(df.column))

Out[196]: Counter({'abc': 2, 'mno': 4, 'pqr': 2})

%timeit

df1 = pd.concat([df]*10000, ignore_index=True)

In [227]: %timeit pd.Series(Counter(chain.from_iterable(df1.column)))
14.3 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [228]: %timeit df1.column.explode().value_counts()
127 ms ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

3 Comments

@AlexHall: it is 8x faster than explode and value_counts when using chain.from_iterable to flatten on big dataframe
I take it back, the Counter API is implemented in Python but the actual counting is in C. My tests confirm that it is faster than value_counts, and from_iterable is faster than explode (maybe this warrants a pandas issue?)
@AlexHall: nothing serious. Cheers! it is well-known that np.concatenate is slower than chain.from_iterable in flattening lists. It's just a lazy part of me to use it instead of from_iterable. I switched to from_iterable just to show the true speed of Counter :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.