How to perform operation to multiple columns in Pandas without using column names?

Question

I have a dataset with a large number of columns. I wanted to perform a general computation on all these columns and get a final value and apply that as a new column.

For example, I have a data frame like below

      A1       A2       A3      ...   A120
0    0.12     0.03     0.43     ...   0.56
1    0.24     0.53     0.01     ...   0.98
.     ...       ...     ...     ...    ...
200   0.11     0.22     0.31     ...   0.08

I want to construct a data frame similar to the below with a new column calc.

calc = (A1**2 - A1) + (A2**2 - A2) ... (A120**2 - A120)

The final data frame should be like this

      A1       A2       A3      ...   A120   calc
0    0.12     0.03     0.43     ...   0.56    x
1    0.24     0.53     0.01     ...   0.98    y
.     ...       ...     ...     ...    ...   ...
200   0.11     0.22     0.31    ...   0.08    n

I tried to do this with python as below

import pandas as pd

df = pd.read_csv('sample.csv')

def construct_matrix():
    temp_sumsqc = 0
    for i in range(len(df.columns)):
        column_name_construct = 'A'+f'{i}'
        temp_sumsqc += df[column_name_construct] ** 2 - (df[column_name_construct])
    df["sumsqc"] = temp_sumsqc


matrix_constructor()
print(df_read.to_string())

But this throws a KeyError: 'A1

It is difficult to do df["A1"]**2 - df["A1"] + df["A2"]**2 - df["A2"] + ... since there are 120 columns.

Since the way I attempted didn't work, I wonder whether there's a better way to do this?

Note that f'A{i}' is equivalent to, and clearer than 'A'+f'{i}' — user17242583
– user17242583, Commented Jun 2, 2022 at 17:58

Shubham Sharma · Accepted Answer · 2022-06-02 17:58:56Z

4

No need to use for loop, we can use vectorized approach here

df['calc'] = df.pow(2).sub(df).sum(1)

answered Jun 2, 2022 at 17:58

Shubham Sharma

71.8k6 gold badges26 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user17242583 Over a year ago

Smart! I didn't think of trying df-wide functions.

rafaelc Over a year ago

This is the best answer in terms of performance.

user17242583 · Accepted Answer · 2022-06-02 17:55:46Z

2

You can use df.apply to execute code for each column, and then use sum(axis=1) to sum the resulting values across columns:

df['sumsqc'] = df.apply(lambda col: (col ** 2) - col).sum(axis=1)

Output:

>>> df
       A1    A2    A3  A120  sumsqc
0    0.12  0.03  0.43  0.56 -0.6262
1    0.24  0.53  0.01  0.98 -0.4610
200  0.11  0.22  0.31  0.08 -0.5570

Note that A1**2 - A1 is equivalent to A1 * (A1 - 1), so you could do

df['sumsqc'] = df.apply(lambda col: col * (col - 1)).sum(axis=1)

answered Jun 2, 2022 at 17:55

user17242583

1 Comment

Archibald Over a year ago

This work well too and didn't notice any delay when I applied it to the dataset and solved the problem well. Thanks @richardec

Collectives™ on Stack Overflow

How to perform operation to multiple columns in Pandas without using column names?

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related