1

I have to a dataframe df in each row I have some columns I want to make subtraction, columns_to_sub, some tag column called 'absorb' and some columns that I don't want to change. I want to subtract the values of columns_to_sub by a row that is on another dataframe and is indexed by the tag 'absorb'. Here is a non functional example of what I want:

import pandas as pd
import numpy as np
data = np.hstack((np.random.randint(0,10,20).reshape(-1,1),np.random.rand(20,3)))
df = pd.DataFrame(data,columns=['absorb','a','b','c'])
columns_to_sub = ['a','b']

means = df.groupby('absorb')[columns_to_sub].mean()
#This result is not what I want, because the subtraction is strange
df[columns_to_sub] = df[columns_to_sub] - means.loc[df.absorb,columns_to_sub]

How do I fix this code?

2 Answers 2

2

You were so close. Just use values on means.

df[columns_to_sub] = df[columns_to_sub] - means.loc[df.absorb,columns_to_sub].values
>>> df
    absorb         a         b         c
0        2 -0.060540 -0.270233  0.416213
1        9  0.597084  0.136158  0.415023
2        1 -0.131393 -0.535288  0.158465
3        3  0.282902 -0.008801  0.872598
4        9 -0.236306 -0.337588  0.297589
5        6  0.000000  0.000000  0.283559
6        3  0.022021 -0.110693  0.671295
7        7  0.042000 -0.327157  0.736395
8        1  0.097912  0.119899  0.409241
9        1 -0.460052  0.280302  0.341200
10       1  0.002855 -0.013902  0.648113
11       1  0.490679  0.148989  0.626300
12       8  0.000000  0.000000  0.986039
13       3 -0.304923  0.119494  0.553210
14       0  0.000000  0.000000  0.626576
15       5  0.000000  0.000000  0.105102
16       2 -0.166760 -0.122624  0.750912
17       2  0.227300  0.392857  0.498822
18       7 -0.042000  0.327157  0.323361
19       9 -0.360778  0.201430  0.521043
Sign up to request clarification or add additional context in comments.

3 Comments

Nice answer (+1). Pretty sure df.groupby('absorb') is wrong, though.
@AmiTavory Have you tried it? Works fine for me. Just use the column names when grouping on the dataframe. There is no need for df.groupby(df.column). pandas.pydata.org/pandas-docs/stable/groupby.html
Curious about what you think: by me, it gives different results when I change the two versions.
0

If you set 'absorb' as the index on df the subtraction will be straightforward. Although absorb is a non-unique index, so make sure that is what you want.

data = np.hstack((np.random.randint(0,10,20).reshape(-1,1),np.random.rand(20,3)))
df = pd.DataFrame(data,columns=['absorb','a','b','c']).set_index('absorb')
df.head()

               a         b         c
absorb                              
8       0.942156  0.675819  0.606406
0       0.801685  0.360899  0.055210
7       0.540333  0.691493  0.580708
7       0.234766  0.446549  0.295496
4       0.942021  0.338729  0.827124

Thus far df, with the absorb index.

Then, the means:

columns_to_sub = ['a','b']
​
means = df.groupby(level=0)[columns_to_sub].mean()
means.head()
               a         b
absorb                    
0       0.871498  0.659507
1       0.113925  0.711533
2       0.485379  0.191867
4       0.557054  0.581740

Then the subtraction can be done like so:

result = df[columns_to_sub] -  means[columns_to_sub]
result.head()
               a         b
absorb                    
0      -0.069813 -0.298608
0       0.069813  0.298608
1       0.000000  0.000000
2       0.451854  0.164074
2      -0.451854 -0.164074

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.