3

i have a small sample data set:

import pandas as pd
d = {
  'measure1_x': [10,12,20,30,21],
  'measure2_x':[11,12,10,3,3],
  'measure3_x':[10,0,12,1,1],
  'measure1_y': [1,2,2,3,1],
  'measure2_y':[1,1,1,3,3],
  'measure3_y':[1,0,2,1,1]
}
df = pd.DataFrame(d)
df = df.reindex_axis([
    'measure1_x','measure2_x', 'measure3_x','measure1_y','measure2_y','measure3_y'
], axis=1) 

it looks like:

      measure1_x  measure2_x  measure3_x  measure1_y  measure2_y  measure3_y
          10          11          10           1           1           1
          12          12           0           2           1           0
          20          10          12           2           1           2
          30           3           1           3           3           1
          21           3           1           1           3           1

i created the column names almost the same except for '_x' and '_y' to help identify which pair should be multiplying: i want to multiply the pair with the same column name when '_x' and '_y' are disregarded, then i want sum the numbers to get a total number, keep in mind my actual data set is huge and the columns are not in this perfect order so this naming is a way for identifying correct pairs to multiply:

total = measure1_x * measure1_y + measure2_x * measure2_y + measure3_x * measure3_y

so desired output:

measure1_x  measure2_x  measure3_x  measure1_y  measure2_y  measure3_y   total

 10          11          10           1           1           1           31 
 12          12           0           2           1           0           36 
 20          10          12           2           1           2           74
 30           3           1           3           3           1          100
 21           3           1           1           3           1           31

my attempt and thought process, but cannot proceed anymore syntax wise:

#first identify the column names that has '_x' and '_y', then identify if 
#the column names are the same after removing '_x' and '_y', if the pair has 
#the same name then multiply them, do that for all pairs and sum the results 
#up to get the total number

for colname in df.columns:
if "_x".lower() in colname.lower() or "_y".lower() in colname.lower():
    if "_x".lower() in colname.lower():  
        colnamex = colname
    if "_y".lower() in colname.lower():
        colnamey = colname

    #if colnamex[:-2] are the same for colnamex and colnamey then multiply and sum

2 Answers 2

3

filter + np.einsum

Thought I'd try something a little different this time—

  • get your _x and _y columns separately
  • do a product-sum. This is very easy to specify with einsum (and fast).

df = df.sort_index(axis=1) # optional, do this if your columns aren't sorted

i = df.filter(like='_x') 
j = df.filter(like='_y')
df['Total'] = np.einsum('ij,ij->i', i, j) # (i.values * j).sum(axis=1)

df
   measure1_x  measure2_x  measure3_x  measure1_y  measure2_y  measure3_y  Total
0          10          11          10           1           1           1     31
1          12          12           0           2           1           0     36
2          20          10          12           2           1           2     74
3          30           3           1           3           3           1    100
4          21           3           1           1           3           1     31

A slightly more robust version which filters out non-numeric columns and performs an assertion beforehand—

df = df.sort_index(axis=1).select_dtypes(exclude=[object])
i = df.filter(regex='.*_x') 
j = df.filter(regex='.*_y')

assert i.shape == j.shape

df['Total'] = np.einsum('ij,ij->i', i, j)

If the assertion fails, the the assumptions of 1) your columns being numeric, and 2) the number of x and y columns being equal, as your question would suggest, do not hold for your actual dataset.

Sign up to request clarification or add additional context in comments.

12 Comments

I was going to force a dot product somewhere to be cool. Now I don't have to because I've been out-cooled with einsum (-:
@piRSquared I was trying to force one too, until I remembered how I'd been outcooled that day like this XD
einsum is always super cool, but doesn't this assume ordered columns?
@filippo Certainly the filter makes assumptions on the nature of the column names. But so does piR's answer. By the way, I've already added an optional df.sort_index(axis=1) step above in case it was needed.
uh completely missed sort_index!
|
3
  • Use df.columns.str.split to generate a new MultiIndex
  • Use prod with axis and level arguments
  • Use sum with axis argument
  • Use assign to create new column

df.assign(
    Total=df.set_axis(
        df.columns.str.split('_', expand=True),
        axis=1, inplace=False
    ).prod(axis=1, level=0).sum(1)
)

   measure1_x  measure2_x  measure3_x  measure1_y  measure2_y  measure3_y  Total
0          10          11          10           1           1           1     31
1          12          12           0           2           1           0     36
2          20          10          12           2           1           2     74
3          30           3           1           3           3           1    100
4          21           3           1           1           3           1     31

Restrict dataframe to just columns that look like 'meausre[i]_[j]'

df.assign(
    Total=df.filter(regex='^measure\d+_\w+$').pipe(
        lambda d: d.set_axis(
            d.columns.str.split('_', expand=True),
            axis=1, inplace=False
        )
    ).prod(axis=1, level=0).sum(1)
)

Debugging

See if this gets you the correct Totals

d_ = df.copy()
d_.columns = d_.columns.str.split('_', expand=True)

d_.prod(axis=1, level=0).sum(1)

0     31
1     36
2     74
3    100
4     31
dtype: int64

10 Comments

i tried with my actual dataset which is larger and got: TypeError: set_axis() got multiple values for argument 'axis'
Are you accidentally using axis twice in the set_axis call?
my actual dataset columns are not in perfect order, sorry if i do not understand your code fully, but where do you identify the correct pair to multiply based on name? or is it not needed?
Are you accidentally using axis twice in the set_axis call? : no i copied and pasted your exact code
Try my updated suggestion. Are your columns not restricted to just those that look like measurei_j?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.