multiply and summing certain columns based on name pandas python

Question

i have a small sample data set:

import pandas as pd
d = {
  'measure1_x': [10,12,20,30,21],
  'measure2_x':[11,12,10,3,3],
  'measure3_x':[10,0,12,1,1],
  'measure1_y': [1,2,2,3,1],
  'measure2_y':[1,1,1,3,3],
  'measure3_y':[1,0,2,1,1]
}
df = pd.DataFrame(d)
df = df.reindex_axis([
    'measure1_x','measure2_x', 'measure3_x','measure1_y','measure2_y','measure3_y'
], axis=1)

it looks like:

      measure1_x  measure2_x  measure3_x  measure1_y  measure2_y  measure3_y
          10          11          10           1           1           1
          12          12           0           2           1           0
          20          10          12           2           1           2
          30           3           1           3           3           1
          21           3           1           1           3           1

i created the column names almost the same except for '_x' and '_y' to help identify which pair should be multiplying: i want to multiply the pair with the same column name when '_x' and '_y' are disregarded, then i want sum the numbers to get a total number, keep in mind my actual data set is huge and the columns are not in this perfect order so this naming is a way for identifying correct pairs to multiply:

total = measure1_x * measure1_y + measure2_x * measure2_y + measure3_x * measure3_y

so desired output:

measure1_x  measure2_x  measure3_x  measure1_y  measure2_y  measure3_y   total

 10          11          10           1           1           1           31 
 12          12           0           2           1           0           36 
 20          10          12           2           1           2           74
 30           3           1           3           3           1          100
 21           3           1           1           3           1           31

my attempt and thought process, but cannot proceed anymore syntax wise:

#first identify the column names that has '_x' and '_y', then identify if 
#the column names are the same after removing '_x' and '_y', if the pair has 
#the same name then multiply them, do that for all pairs and sum the results 
#up to get the total number

for colname in df.columns:
if "_x".lower() in colname.lower() or "_y".lower() in colname.lower():
    if "_x".lower() in colname.lower():  
        colnamex = colname
    if "_y".lower() in colname.lower():
        colnamey = colname

    #if colnamex[:-2] are the same for colnamex and colnamey then multiply and sum

cs95 · Accepted Answer · 2018-05-16 18:28:30Z

3

`filter` + `np.einsum`

Thought I'd try something a little different this time—

get your _x and _y columns separately
do a product-sum. This is very easy to specify with einsum (and fast).

df = df.sort_index(axis=1) # optional, do this if your columns aren't sorted

i = df.filter(like='_x') 
j = df.filter(like='_y')
df['Total'] = np.einsum('ij,ij->i', i, j) # (i.values * j).sum(axis=1)

df
   measure1_x  measure2_x  measure3_x  measure1_y  measure2_y  measure3_y  Total
0          10          11          10           1           1           1     31
1          12          12           0           2           1           0     36
2          20          10          12           2           1           2     74
3          30           3           1           3           3           1    100
4          21           3           1           1           3           1     31

A slightly more robust version which filters out non-numeric columns and performs an assertion beforehand—

df = df.sort_index(axis=1).select_dtypes(exclude=[object])
i = df.filter(regex='.*_x') 
j = df.filter(regex='.*_y')

assert i.shape == j.shape

df['Total'] = np.einsum('ij,ij->i', i, j)

If the assertion fails, the the assumptions of 1) your columns being numeric, and 2) the number of x and y columns being equal, as your question would suggest, do not hold for your actual dataset.

edited May 16, 2018 at 18:28

answered May 16, 2018 at 18:07

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

piRSquared Over a year ago

I was going to force a dot product somewhere to be cool. Now I don't have to because I've been out-cooled with einsum (-:

cs95 Over a year ago

@piRSquared I was trying to force one too, until I remembered how I'd been outcooled that day like this XD

filippo Over a year ago

einsum is always super cool, but doesn't this assume ordered columns?

cs95 Over a year ago

@filippo Certainly the filter makes assumptions on the nature of the column names. But so does piR's answer. By the way, I've already added an optional df.sort_index(axis=1) step above in case it was needed.

filippo Over a year ago

uh completely missed sort_index!

|

piRSquared · Accepted Answer · 2018-05-16 18:14:11Z

3

Use df.columns.str.split to generate a new MultiIndex
Use prod with axis and level arguments
Use sum with axis argument
Use assign to create new column

df.assign(
    Total=df.set_axis(
        df.columns.str.split('_', expand=True),
        axis=1, inplace=False
    ).prod(axis=1, level=0).sum(1)
)

   measure1_x  measure2_x  measure3_x  measure1_y  measure2_y  measure3_y  Total
0          10          11          10           1           1           1     31
1          12          12           0           2           1           0     36
2          20          10          12           2           1           2     74
3          30           3           1           3           3           1    100
4          21           3           1           1           3           1     31

Restrict dataframe to just columns that look like `'meausre[i]_[j]'`

df.assign(
    Total=df.filter(regex='^measure\d+_\w+$').pipe(
        lambda d: d.set_axis(
            d.columns.str.split('_', expand=True),
            axis=1, inplace=False
        )
    ).prod(axis=1, level=0).sum(1)
)

Debugging

See if this gets you the correct Totals

d_ = df.copy()
d_.columns = d_.columns.str.split('_', expand=True)

d_.prod(axis=1, level=0).sum(1)

0     31
1     36
2     74
3    100
4     31
dtype: int64

edited May 16, 2018 at 18:14

answered May 16, 2018 at 17:59

piRSquared

296k68 gold badges509 silver badges654 bronze badges

10 Comments

Jessica Over a year ago

i tried with my actual dataset which is larger and got: TypeError: set_axis() got multiple values for argument 'axis'

piRSquared Over a year ago

Are you accidentally using axis twice in the set_axis call?

Jessica Over a year ago

my actual dataset columns are not in perfect order, sorry if i do not understand your code fully, but where do you identify the correct pair to multiply based on name? or is it not needed?

Jessica Over a year ago

Are you accidentally using axis twice in the set_axis call? : no i copied and pasted your exact code

piRSquared Over a year ago

Try my updated suggestion. Are your columns not restricted to just those that look like measurei_j?

|

Collectives™ on Stack Overflow

multiply and summing certain columns based on name pandas python

2 Answers 2

`filter` + `np.einsum`

12 Comments

Restrict dataframe to just columns that look like `'meausre[i]_[j]'`

Debugging

10 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

filter + np.einsum

12 Comments

Restrict dataframe to just columns that look like 'meausre[i]_[j]'

Debugging

10 Comments

Your Answer

Sign up or log in

Post as a guest

Related

`filter` + `np.einsum`

Restrict dataframe to just columns that look like `'meausre[i]_[j]'`