Aggregate values by multiple columns

Question

My dataframe looks like this.

df = pd.DataFrame({
    'ID': [1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3],
    'text': ['a', 'a', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'e', 'e', 'e', 'f', 'g']  ,
    'out_text': ['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x13', 'x14']  ,  
    'Rule_1': ['N', 'N', 'N', 'Y', 'N', 'N', 'N', 'N', 'N', 'N','N', 'N', 'Y', 'Y'],
    'Rule_2': ['Y', 'N', 'N', 'N', 'Y', 'N', 'N', 'N', 'N', 'N','N', 'N', 'Y', 'N'],
    'Rule_3': ['N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N','N', 'N', 'Y', 'Y']})

    ID text out_text Rule_1 Rule_2 Rule_3
0    1    a       x1      N      Y      N
1    1    a       x2      N      N      N
2    1    b       x3      N      N      N
3    1    b       x4      Y      N      N
4    2    c       x5      N      Y      N
5    2    c       x6      N      N      N
6    2    c       x7      N      N      N
7    2    d       x8      N      N      N
8    2    d       x9      N      N      N
9    2    e      x10      N      N      N
10   2    e      x11      N      N      N
11   2    e      x12      N      N      N
12   3    f      x13      Y      Y      Y
13   3    g      x14      Y      N      Y

I have to aggregate Rule_1, Rule_2, Rule_3 to such that if a combination of ID and Text has a 'Y' in any of these columns, the overall result is a Y for that combination. In our example 1-a and 1-b are Y overall. 2-d and 2-e are 'N'. How do I aggregate multiple columns?

Quang Hoang · Accepted Answer · 2020-08-03 15:46:23Z

2

Let's try using max(1) to aggregate the rules by rows, then groupyby().any() to check if any row has Y:

(df[['Rule_1','Rule_2','Rule_3']].eq('Y')
   .max(axis=1)
   .groupby([df['ID'],df['text']])
   .any()
)

Output:

ID  text
1   a        True
    b        True
2   c        True
    d       False
    e       False
3   f        True
    g        True
dtype: bool

Or if you want Y/N, we can change max/any to max, and drop comparison:

(df[['Rule_1','Rule_2','Rule_3']]
   .max(axis=1)
   .groupby([df['ID'],df['text']])
   .max()
)

Output:

ID  text
1   a       Y
    b       Y
2   c       Y
    d       N
    e       N
3   f       Y
    g       Y
dtype: object

answered Aug 3, 2020 at 15:46

Quang Hoang

151k11 gold badges64 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Dhruv Over a year ago

Can you help me with how can I aggregate on this? For instance, if I want to count number of 'Y's for every ID. It would 2 for ID 1, 1 for ID 2, and 2 for ID 3.

Quang Hoang Over a year ago

Use the first approach and chain with sum(level='ID'). Or chain method 2 with .eq('Y').sum(level='ID').

Dhruv Over a year ago

This works perfectly with the test data that I had shared. but for some reason when I am summing this on the actual data, I get True and False instead of the counts. Any idea why that might be happening?

Quang Hoang Over a year ago

Try convert the boolean to float before sum, e.g. .eq('Y').astype(float).sum(level='ID').

Collectives™ on Stack Overflow

Aggregate values by multiple columns

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related