2

I working on a pandas DataFrame which needs a new column that shows count of specific values in specific columns.

I tried various combinations groupby and pivot, but had problems to apply it to whole dataframe without errors.

df = pd.DataFrame([
    ['a', 'z'],
    ['a', 'x'],
    ['a', 'y'],
    ['b', 'v'],
    ['b', 'x'],
    ['b', 'v']],
  columns=['col1', 'col2'])

I need to add col3 that counts 'v' values in col2 for each value in 'col1'. There is no 'v' in col2 for 'a' in col1, so it's 0 everywhere, while expected value count is 2 for 'b', also in a row where value in col2 equals 'x' instead of 'v'.

Expected output:

['a', 'z', 0]
['a', 'x', 0]
['a', 'y', 0]
['b', 'v', 2]
['b', 'x', 2]
['b', 'v', 2]

I'm looking rather for a nice pandas specific solution because the original dataframe is quite big, so things like row iterations and time expensive.

3 Answers 3

1

Create a Boolean Series checking the equality then groupby +transform + sum to count them.

df['col3'] = df.col2.eq('v').astype(int).groupby(df.col1).transform('sum')  

#  col1 col2  col3
#0    a    z     0
#1    a    x     0
#2    a    y     0
#3    b    v     2
#4    b    x     2
#5    b    v     2
Sign up to request clarification or add additional context in comments.

2 Comments

Super neat stuff :)
Works nicely and can be easily developed, i.e. extend the groupby arguments by more columns on more complex dataframes. Thanks!
0

While ALollz's answer is neat and a one liner, here is another one although a two step solution introducing you to other concepts like str.contains and np.where!

First get the rows which have v using np.where and mark them as a flag:

df['col3'] = np.where(df['col2'].str.contains('v'), 1, 0)

Now perform a groupby on col1 and sum them:

df['col3'] = df.groupby('col1')['col3'].transform('sum')

Output:

  col1 col2  col3
0    a    z     0
1    a    x     0
2    a    y     0
3    b    v     2
4    b    x     2
5    b    v     2

1 Comment

I like this one because arguments can be freely extended for both where and groupby for more sophisticated calculations.
0

All the answers above are fine. The only caveat is that transform can be slow when the group size is very large. Alternatively, you can try the workaround below,

(df.assign(mask = lambda x:x.col2.eq('v'))
   .pipe(lambda x:x.join(x.groupby('col1')['mask'].sum().map(int).rename('col3'),on='col1')))

1 Comment

It doesn't work and prompts this: ValueError: columns overlap but no suffix specified: Index(['col3'], dtype='object'). Could you please advise?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.