Add a new column with values based on groupby values two other columns

Question

I'm trying to add a new column to a dataframe, based on a groupby of the original dataframe, and assigning one of three values based on the results of that groupby

i.e. In the df below I want to add a fourth column to show whether I have too many 'A's, too many 'C's, or just the right amount.

    col1 col2  col3
0   1111    A     1
1   1111    B     3
2   1111    B     3
3   1111    B     3
4   1111    C     1
5   2222    A     1
6   2222    B     1
7   2222    C     2
8   2222    C     2
9   3333    A     2
10  3333    A     2
11  3333    B     1
12  3333    C     1

becomes...

    col1 col2  col3 col4
0   1111    A     1   OK
1   1111    B     3   OK
2   1111    B     3   OK
3   1111    B     3   OK
4   1111    C     1   OK
5   2222    A     1   >C
6   2222    B     1   >C
7   2222    C     2   >C
8   2222    C     2   >C
9   3333    A     2   >A
10  3333    A     2   >A
11  3333    B     1   >A
12  3333    C     1   >A

I was thinking of using transform, but that returns series, and I think in this case I need to check two different columns?

Example:

d1 = {'col1': ['1111', '1111', '1111', '1111', '1111', '2222', '2222', '2222', '2222', '3333', '3333', '3333', '3333'],
      'col2': ['A', 'B', 'B', 'B', 'C', 'A', 'B', 'C', 'C', 'A', 'A', 'B', 'C'],
      'col3': [1, 3, 3, 3, 1, 1, 1, 2, 2, 2, 2, 1, 1]}
df1 = pd.DataFrame(data=d1)
d2 = {'col1': ['1111', '1111', '1111', '1111', '1111', '2222', '2222', '2222', '2222', '3333', '3333', '3333', '3333'],
      'col2': ['A', 'B', 'B', 'B', 'C', 'A', 'B', 'C', 'C', 'A', 'A', 'B', 'C'],
      'col3': [1, 3, 3, 3, 1, 1, 1, 2, 2, 2, 2, 1, 1],
      'col4': ['OK', 'OK', 'OK', 'OK', 'OK', '>C', '>C', '>C', '>C', '>A', '>A', '>A', '>A']}
df2 = pd.DataFrame(data=d2)
print(df1)
print(df2)

what happens when both A and C are more?

anky
– anky

2020-01-21 09:42:39 +00:00
Commented Jan 21, 2020 at 9:42 — anky
– anky, Commented Jan 21, 2020 at 9:42
Why is OK value? Because 3 ? Or because 3B ?

jezrael
– jezrael

2020-01-21 09:42:43 +00:00
Commented Jan 21, 2020 at 9:42 — jezrael
– jezrael, Commented Jan 21, 2020 at 9:42
Can you clarify how you are populating the 4th column?

Vishakha Lall
– Vishakha Lall

2020-01-21 09:42:56 +00:00
Commented Jan 21, 2020 at 9:42 — Vishakha Lall
– Vishakha Lall, Commented Jan 21, 2020 at 9:42

anky · Accepted Answer · 2020-01-21 10:10:12Z

From what I understand, try this :

Get counts using crosstab of col2 wrt col1 values then using .loc[] filter the values you want to consider , then using df.gt compare if count is greater than 1 and do a dot multiplication with column names , finally map it back with series.fillna

Solution:

values_to_filter = ['A','C'] #put B for testing and it will show >B for first group
m = pd.crosstab(df1['col1'],df1['col2']).loc[:,values_to_filter]
df1['col4'] = (df1['col1'].map(m.gt(1).dot(m.columns).replace('',np.nan)
                                             .radd('>')).fillna('OK'))

Output:

print(df1)

    col1 col2  col3 col4
0   1111    A     1   OK
1   1111    B     3   OK
2   1111    B     3   OK
3   1111    B     3   OK
4   1111    C     1   OK
5   2222    A     1   >C
6   2222    B     1   >C
7   2222    C     2   >C
8   2222    C     2   >C
9   3333    A     2   >A
10  3333    A     2   >A
11  3333    B     1   >A
12  3333    C     1   >A

Additional Details: Where m is a count of col2 values w.r.t col1 values:

print(m)

col2  A  C
col1      
1111  1  1
2222  1  2
3333  2  1

Collectives™ on Stack Overflow

Add a new column with values based on groupby values two other columns

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related