4

I am trying to develop the following filter with a pandas dataframe:

  • I have four columns, A, B, A_prime and B_prime;
  • if A or B is less than a threshold, C, then I would like to find the sum between A_prime and B_prime and allocate it to the maximum of A_prime and B_prime, while setting the minimum of A_prime and B_prime to zero.

How would I write this as a pandas aggregation function?

A working example, written inefficiently, is shown below:

import pandas as pd
import numpy as np

data = {
    "A":list(np.abs(np.random.randn(10))),
    "B":list(np.abs(np.random.randn(10))),
    "A_prime":list(np.abs(np.random.randn(10))),
    "B_prime":list(np.abs(np.random.randn(10)))
    
}

df = pd.DataFrame.from_dict(data)
C = 0.2

print("BEFORE:")
print(df)


for index, row in df.iterrows():
    if(row["A"] < C or row["B"] < C):
        max_idx = np.argmax([row["A"], row["B"]])
        if(max_idx==0):
            row["A_prime"] = row["A_prime"] + row["B_prime"]
            row["B_prime"] = 0
        else:
            row["B_prime"] = row["A_prime"] + row["B_prime"]
            row["A_prime"] = 0
    
print("")
print("AFTER:")
print(df)

Output:

BEFORE:
          A         B   A_prime   B_prime
0  0.182445  0.924890  1.563398  0.562325
1  0.252587  0.273637  0.515395  0.538876
2  1.369412  1.985702  1.813962  1.643794
3  0.834666  0.143880  0.860673  0.372468
4  1.380012  0.715774  0.022681  0.892717
5  0.582497  0.477100  0.956821  1.134613
6  0.083045  0.322060  0.362513  1.386124
7  1.384267  0.251577  0.639843  0.458650
8  0.375456  0.412320  0.661661  0.086588
9  0.079226  0.385621  0.601451  0.837827

AFTER:
          A         B   A_prime   B_prime
0  0.182445  0.924890  0.000000  2.125723
1  0.252587  0.273637  0.515395  0.538876
2  1.369412  1.985702  1.813962  1.643794
3  0.834666  0.143880  1.233141  0.000000
4  1.380012  0.715774  0.022681  0.892717
5  0.582497  0.477100  0.956821  1.134613
6  0.083045  0.322060  0.000000  1.748638
7  1.384267  0.251577  0.639843  0.458650
8  0.375456  0.412320  0.661661  0.086588
9  0.079226  0.385621  0.000000  1.439278
3
  • Should't the first line in your example allocate the max to A_prime instead of B_prime (as it is the greater number)? Commented Jun 7, 2021 at 9:53
  • 2
    @nocibambi From their code, I believe check is over A and B rather than primes to determine which *_prime to put the sum. Commented Jun 7, 2021 at 9:59
  • 1
    @MustafaAydın Yeah, I read the description, not the code. :) I suppose we wait and find out what the intent was. Commented Jun 7, 2021 at 10:15

4 Answers 4

2

here is one way:

prime_cols = ["A_prime", "B_prime"]

# get the candidate sums
prime_sums = df[prime_cols].sum(axis=1)

# check which rows satisfy the `C` threshold
threshold_satisfied = df.A.lt(C) | df.B.lt(C)

# set the satisfying rows' values to sums for both columns
df.loc[threshold_satisfied, prime_cols] = prime_sums

# generate a 1-0 mask that will multiply the greater value by 1 and
# smaller value by 0 to "select" one of them and kill other
mask_A_side = df.A.gt(df.B)
the_mask = pd.concat([mask_A_side, ~mask_A_side], axis=1).set_axis(prime_cols, axis=1)

# multiply with the mask
df.loc[threshold_satisfied, prime_cols] *= the_mask

which first puts the sum of prime columns to both of the columns where threshold condition is satisfied, and then kills one of them with a 1-0 mask multiplication.

to get

>>> df

          A         B   A_prime   B_prime
0  0.182445  0.924890  0.000000  2.125723
1  0.252587  0.273637  0.515395  0.538876
2  1.369412  1.985702  1.813962  1.643794
3  0.834666  0.143880  1.233141  0.000000
4  1.380012  0.715774  0.022681  0.892717
5  0.582497  0.477100  0.956821  1.134613
6  0.083045  0.322060  0.000000  1.748637
7  1.384267  0.251577  0.639843  0.458650
8  0.375456  0.412320  0.661661  0.086588
9  0.079226  0.385621  0.000000  1.439278
Sign up to request clarification or add additional context in comments.

Comments

1

The most pandified way of doing this would be with an apply(), as the example below:

import pandas as pd
import numpy as np

data = {
    "A":list(np.abs(np.random.randn(10))),
    "B":list(np.abs(np.random.randn(10))),
    "A_prime":list(np.abs(np.random.randn(10))),
    "B_prime":list(np.abs(np.random.randn(10)))
    
}

df = pd.DataFrame.from_dict(data)
C = 0.2

def A_B_prime(row):
    A_prime_val = row["A_prime"]
    B_prime_val = row["B_prime"]
    if(row["A"] < C or row["B"] < C):
        max_idx = np.argmax([row["A"], row["B"]])
        if(max_idx==0):
            A_prime_val = row["A_prime"] + row["B_prime"]
            B_prime_val = 0
        else:
            B_prime_val = row["A_prime"] + row["B_prime"]
            A_prime_val = 0
    return A_prime_val, B_prime_val

df['A_prime'], df['B_prime'] = zip(*df.apply(A_B_prime, axis=1))

You can find some good insights on how to return multiple columns from a single apply() on this thread.

Comments

1

You can rewrite values based on a conditional map and apply.

# Columns to check the min/max on:
check_max_cols = ["A_prime", "B_prime"]


def allocate_sum(row):
    # Identify max and min values
    max_col = row[check_max_cols].idxmax(axis=1)
    min_col = check_max_cols[1] if max_col == check_max_cols[0] else check_max_cols[0]

    row[max_col] = row[["A_prime", "B_prime"]].sum()
    row[min_col] = 0
    return row


below_threshold = (df[["A", "B"]] < C).any(axis=1)

df.loc[below_threshold, :] = df.loc[below_threshold, :].apply(allocate_sum, axis=1)

1 Comment

This is what I was looking for. Thank you!
0

I don't know if this can be done with aggregation. I am sharing something that has worked for me. See if it works for you too.

Firstly, we can generate the masks

#Get mask of all the rows where A_Prime > B_Prime and either of A or B less than C(0.2)
mask_val_a = ((dataset['A'] < C) | (dataset['B'] < C)) & \
             (dataset['A_Prime'] > dataset['B_Prime']) 

#Get mask of all the rows where A_Prime < B_Prime and either of A or B less than C(0.2)
mask_val_b = ((dataset['A'] < C) | (dataset['B'] < C)) & (dataset['A_Prime'] < dataset['B_Prime'])

Once we have masks values, we can just use pandas mask to update the values

#Firstly we will update the values where A_Prime > B_Prime
dataset['A_Prime'] = dataset['A_Prime'].mask(mask_val_a, dataset[['A_Prime', 'B_Prime']].sum(axis = 1))
dataset['B_Prime'] = dataset['B_Prime'].mask(mask_val_a, 0)

#Then we will update the values where B_Prime > A_Prime
dataset['B_Prime'] = dataset['B_Prime'].mask(mask_val_b, dataset[['A_Prime', 'B_Prime']].sum(axis = 1))
dataset['A_Prime'] = dataset['A_Prime'].mask(mask_val_b, 0)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.