Normalizing a huge python dataframe

Question

I have a huge csv file (~2GB) that I have imported using Dask. Now I want to normalize this dataframe. The dataframe contains about 70k columns. I have written this python function to calculate this:

def normalize(df):
   result = df.copy()
   for col in tqdm(df.columns):
     if col!=str('name')  #basically not to normalize columns with name of "name"
        max_value = df[col].max()
        min_value = df[col].min()
        result[col] = (df[col] - min_value) / (max_value - min_value)
   return result

It works okay but takes a lot of time. I put this on execution and its showing it will take appoximately 88 hours to complete. I tried switching to sklearn's minmaxscaler() but it doesn't show any progress of normalization and I am afraid that it will also take quite a lot of time. Is there any other way to normalize all the columns (and ignore a few like I did in that if condition).

don't take this feedback the wrong way, but str('name') it is redundant, is the same to compare 1=1. "name" its already a string, this makes your code look unprofessional — Enrique Benito Casado
– Enrique Benito Casado, Commented Mar 31, 2022 at 18:11

Hans Bambel · Accepted Answer · 2022-03-31 11:31:26Z

2

You don't need to loop through this. When the other columns than name are numerical values then you can just do something along the following:

num_cols = [col for col in df.columns if col != "name"]
df.loc[:, num_cols] = (df[num_cols] - df[num_cols].min()) / (df[num_cols].max() - df[num_cols].min())

Here is a minimal code sample:

import pandas as pd

df = pd.DataFrame({"name": ["a"]*4, "a": [2,3,4,6], "b": [9,5,2,34]})
num_cols = [col for col in df.columns if col != "name"]
df.loc[:, num_cols] = (df[num_cols] - df[num_cols].min()) / (df[num_cols].max() - df[num_cols].min())

print(df)

answered Mar 31, 2022 at 11:31

Hans Bambel

9561 gold badge10 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Daweo · Accepted Answer · 2022-03-31 11:54:07Z

I am afraid that it will also take quite a lot of time

Then considering that you just need numerical operations I suggest using numpy for actual number crunching and pandas only for extraction of columns to process, simple example:

import numpy as np
import pandas as pd
df = pd.DataFrame({'name':['A','B','C'],'x1':[1,2,3],'x2':[4,8,6],'x3':[10,15,30]})
num_arr = df[['x1','x2','x3']].to_numpy()
mins = np.min(num_arr,0)
maxs = np.max(num_arr,0)
result_arr = (num_arr - mins) / (maxs - mins)
result_df = pd.concat([df[['name']],pd.DataFrame(result_arr,columns=['x1','x2','x3'])],axis=1)
print(result_df)

output

  name   x1   x2    x3
0    A  0.0  0.0  0.00
1    B  0.5  1.0  0.25
2    C  1.0  0.5  1.00

Disclaimer: this solutions assumes that df has indices like 0,1,2,...

If you would need further speed increase consider using parallelization, which might be used in this case as values in each columns are computed independently from other columns.

Collectives™ on Stack Overflow

Normalizing a huge python dataframe

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related