Pandas how add a new column to dataframe based on values from all rows, specific columns values applied to whole dataframe

Question

I working on a pandas DataFrame which needs a new column that shows count of specific values in specific columns.

I tried various combinations groupby and pivot, but had problems to apply it to whole dataframe without errors.

df = pd.DataFrame([
    ['a', 'z'],
    ['a', 'x'],
    ['a', 'y'],
    ['b', 'v'],
    ['b', 'x'],
    ['b', 'v']],
  columns=['col1', 'col2'])

I need to add col3 that counts 'v' values in col2 for each value in 'col1'. There is no 'v' in col2 for 'a' in col1, so it's 0 everywhere, while expected value count is 2 for 'b', also in a row where value in col2 equals 'x' instead of 'v'.

Expected output:

['a', 'z', 0]
['a', 'x', 0]
['a', 'y', 0]
['b', 'v', 2]
['b', 'x', 2]
['b', 'v', 2]

I'm looking rather for a nice pandas specific solution because the original dataframe is quite big, so things like row iterations and time expensive.

ALollz · Accepted Answer · 2019-07-31 14:08:06Z

1

Create a Boolean Series checking the equality then groupby +transform + sum to count them.

df['col3'] = df.col2.eq('v').astype(int).groupby(df.col1).transform('sum')  

#  col1 col2  col3
#0    a    z     0
#1    a    x     0
#2    a    y     0
#3    b    v     2
#4    b    x     2
#5    b    v     2

answered Jul 31, 2019 at 14:08

ALollz

59.7k7 gold badges74 silver badges97 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ankur Sinha Over a year ago

Super neat stuff :)

ltyrvol Over a year ago

Works nicely and can be easily developed, i.e. extend the groupby arguments by more columns on more complex dataframes. Thanks!

Ankur Sinha · Accepted Answer · 2019-07-31 14:12:41Z

0

While ALollz's answer is neat and a one liner, here is another one although a two step solution introducing you to other concepts like str.contains and np.where!

First get the rows which have v using np.where and mark them as a flag:

df['col3'] = np.where(df['col2'].str.contains('v'), 1, 0)

Now perform a groupby on col1 and sum them:

df['col3'] = df.groupby('col1')['col3'].transform('sum')

Output:

  col1 col2  col3
0    a    z     0
1    a    x     0
2    a    y     0
3    b    v     2
4    b    x     2
5    b    v     2

answered Jul 31, 2019 at 14:12

Ankur Sinha

6,6648 gold badges44 silver badges74 bronze badges

1 Comment

ltyrvol Over a year ago

I like this one because arguments can be freely extended for both where and groupby for more sophisticated calculations.

Mark Wang · Accepted Answer · 2019-07-31 14:17:02Z

0

All the answers above are fine. The only caveat is that transform can be slow when the group size is very large. Alternatively, you can try the workaround below,

(df.assign(mask = lambda x:x.col2.eq('v'))
   .pipe(lambda x:x.join(x.groupby('col1')['mask'].sum().map(int).rename('col3'),on='col1')))

answered Jul 31, 2019 at 14:17

Mark Wang

2,7579 silver badges18 bronze badges

1 Comment

ltyrvol Over a year ago

It doesn't work and prompts this: ValueError: columns overlap but no suffix specified: Index(['col3'], dtype='object'). Could you please advise?

Collectives™ on Stack Overflow

Pandas how add a new column to dataframe based on values from all rows, specific columns values applied to whole dataframe

3 Answers 3

2 Comments

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related