1

I have a dataset with a large number of columns. I wanted to perform a general computation on all these columns and get a final value and apply that as a new column.

For example, I have a data frame like below

      A1       A2       A3      ...   A120
0    0.12     0.03     0.43     ...   0.56
1    0.24     0.53     0.01     ...   0.98
.     ...       ...     ...     ...    ...
200   0.11     0.22     0.31     ...   0.08

I want to construct a data frame similar to the below with a new column calc.

calc = (A1**2 - A1) + (A2**2 - A2) ... (A120**2 - A120)

The final data frame should be like this

      A1       A2       A3      ...   A120   calc
0    0.12     0.03     0.43     ...   0.56    x
1    0.24     0.53     0.01     ...   0.98    y
.     ...       ...     ...     ...    ...   ...
200   0.11     0.22     0.31    ...   0.08    n

I tried to do this with python as below

import pandas as pd

df = pd.read_csv('sample.csv')

def construct_matrix():
    temp_sumsqc = 0
    for i in range(len(df.columns)):
        column_name_construct = 'A'+f'{i}'
        temp_sumsqc += df[column_name_construct] ** 2 - (df[column_name_construct])
    df["sumsqc"] = temp_sumsqc


matrix_constructor()
print(df_read.to_string())

But this throws a KeyError: 'A1

It is difficult to do df["A1"]**2 - df["A1"] + df["A2"]**2 - df["A2"] + ... since there are 120 columns.

Since the way I attempted didn't work, I wonder whether there's a better way to do this?

1
  • Note that f'A{i}' is equivalent to, and clearer than 'A'+f'{i}' Commented Jun 2, 2022 at 17:58

2 Answers 2

4

No need to use for loop, we can use vectorized approach here

df['calc'] = df.pow(2).sub(df).sum(1)
Sign up to request clarification or add additional context in comments.

2 Comments

Smart! I didn't think of trying df-wide functions.
This is the best answer in terms of performance.
2

You can use df.apply to execute code for each column, and then use sum(axis=1) to sum the resulting values across columns:

df['sumsqc'] = df.apply(lambda col: (col ** 2) - col).sum(axis=1)

Output:

>>> df
       A1    A2    A3  A120  sumsqc
0    0.12  0.03  0.43  0.56 -0.6262
1    0.24  0.53  0.01  0.98 -0.4610
200  0.11  0.22  0.31  0.08 -0.5570

Note that A1**2 - A1 is equivalent to A1 * (A1 - 1), so you could do

df['sumsqc'] = df.apply(lambda col: col * (col - 1)).sum(axis=1)

1 Comment

This work well too and didn't notice any delay when I applied it to the dataset and solved the problem well. Thanks @richardec

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.