0

I'm trying to add a new column to a dataframe, based on a groupby of the original dataframe, and assigning one of three values based on the results of that groupby

i.e. In the df below I want to add a fourth column to show whether I have too many 'A's, too many 'C's, or just the right amount.

    col1 col2  col3
0   1111    A     1
1   1111    B     3
2   1111    B     3
3   1111    B     3
4   1111    C     1
5   2222    A     1
6   2222    B     1
7   2222    C     2
8   2222    C     2
9   3333    A     2
10  3333    A     2
11  3333    B     1
12  3333    C     1

becomes...

    col1 col2  col3 col4
0   1111    A     1   OK
1   1111    B     3   OK
2   1111    B     3   OK
3   1111    B     3   OK
4   1111    C     1   OK
5   2222    A     1   >C
6   2222    B     1   >C
7   2222    C     2   >C
8   2222    C     2   >C
9   3333    A     2   >A
10  3333    A     2   >A
11  3333    B     1   >A
12  3333    C     1   >A

I was thinking of using transform, but that returns series, and I think in this case I need to check two different columns?

Example:

d1 = {'col1': ['1111', '1111', '1111', '1111', '1111', '2222', '2222', '2222', '2222', '3333', '3333', '3333', '3333'],
      'col2': ['A', 'B', 'B', 'B', 'C', 'A', 'B', 'C', 'C', 'A', 'A', 'B', 'C'],
      'col3': [1, 3, 3, 3, 1, 1, 1, 2, 2, 2, 2, 1, 1]}
df1 = pd.DataFrame(data=d1)
d2 = {'col1': ['1111', '1111', '1111', '1111', '1111', '2222', '2222', '2222', '2222', '3333', '3333', '3333', '3333'],
      'col2': ['A', 'B', 'B', 'B', 'C', 'A', 'B', 'C', 'C', 'A', 'A', 'B', 'C'],
      'col3': [1, 3, 3, 3, 1, 1, 1, 2, 2, 2, 2, 1, 1],
      'col4': ['OK', 'OK', 'OK', 'OK', 'OK', '>C', '>C', '>C', '>C', '>A', '>A', '>A', '>A']}
df2 = pd.DataFrame(data=d2)
print(df1)
print(df2)
3
  • what happens when both A and C are more? Commented Jan 21, 2020 at 9:42
  • Why is OK value? Because 3 ? Or because 3B ? Commented Jan 21, 2020 at 9:42
  • 2
    Can you clarify how you are populating the 4th column? Commented Jan 21, 2020 at 9:42

1 Answer 1

1

From what I understand, try this :

Get counts using crosstab of col2 wrt col1 values then using .loc[] filter the values you want to consider , then using df.gt compare if count is greater than 1 and do a dot multiplication with column names , finally map it back with series.fillna

Solution:

values_to_filter = ['A','C'] #put B for testing and it will show >B for first group
m = pd.crosstab(df1['col1'],df1['col2']).loc[:,values_to_filter]
df1['col4'] = (df1['col1'].map(m.gt(1).dot(m.columns).replace('',np.nan)
                                             .radd('>')).fillna('OK'))

Output:

print(df1)

    col1 col2  col3 col4
0   1111    A     1   OK
1   1111    B     3   OK
2   1111    B     3   OK
3   1111    B     3   OK
4   1111    C     1   OK
5   2222    A     1   >C
6   2222    B     1   >C
7   2222    C     2   >C
8   2222    C     2   >C
9   3333    A     2   >A
10  3333    A     2   >A
11  3333    B     1   >A
12  3333    C     1   >A

Additional Details: Where m is a count of col2 values w.r.t col1 values:

print(m)

col2  A  C
col1      
1111  1  1
2222  1  2
3333  2  1
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.