1

My dataframe looks like this.

df = pd.DataFrame({
    'ID': [1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3],
    'text': ['a', 'a', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'e', 'e', 'e', 'f', 'g']  ,
    'out_text': ['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x13', 'x14']  ,  
    'Rule_1': ['N', 'N', 'N', 'Y', 'N', 'N', 'N', 'N', 'N', 'N','N', 'N', 'Y', 'Y'],
    'Rule_2': ['Y', 'N', 'N', 'N', 'Y', 'N', 'N', 'N', 'N', 'N','N', 'N', 'Y', 'N'],
    'Rule_3': ['N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N','N', 'N', 'Y', 'Y']})
    ID text out_text Rule_1 Rule_2 Rule_3
0    1    a       x1      N      Y      N
1    1    a       x2      N      N      N
2    1    b       x3      N      N      N
3    1    b       x4      Y      N      N
4    2    c       x5      N      Y      N
5    2    c       x6      N      N      N
6    2    c       x7      N      N      N
7    2    d       x8      N      N      N
8    2    d       x9      N      N      N
9    2    e      x10      N      N      N
10   2    e      x11      N      N      N
11   2    e      x12      N      N      N
12   3    f      x13      Y      Y      Y
13   3    g      x14      Y      N      Y

I have to aggregate Rule_1, Rule_2, Rule_3 to such that if a combination of ID and Text has a 'Y' in any of these columns, the overall result is a Y for that combination. In our example 1-a and 1-b are Y overall. 2-d and 2-e are 'N'. How do I aggregate multiple columns?

1 Answer 1

2

Let's try using max(1) to aggregate the rules by rows, then groupyby().any() to check if any row has Y:

(df[['Rule_1','Rule_2','Rule_3']].eq('Y')
   .max(axis=1)
   .groupby([df['ID'],df['text']])
   .any()
)

Output:

ID  text
1   a        True
    b        True
2   c        True
    d       False
    e       False
3   f        True
    g        True
dtype: bool

Or if you want Y/N, we can change max/any to max, and drop comparison:

(df[['Rule_1','Rule_2','Rule_3']]
   .max(axis=1)
   .groupby([df['ID'],df['text']])
   .max()
)

Output:

ID  text
1   a       Y
    b       Y
2   c       Y
    d       N
    e       N
3   f       Y
    g       Y
dtype: object
Sign up to request clarification or add additional context in comments.

4 Comments

Can you help me with how can I aggregate on this? For instance, if I want to count number of 'Y's for every ID. It would 2 for ID 1, 1 for ID 2, and 2 for ID 3.
Use the first approach and chain with sum(level='ID'). Or chain method 2 with .eq('Y').sum(level='ID').
This works perfectly with the test data that I had shared. but for some reason when I am summing this on the actual data, I get True and False instead of the counts. Any idea why that might be happening?
Try convert the boolean to float before sum, e.g. .eq('Y').astype(float).sum(level='ID').

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.