Modify pandas dataframes inside a loop

Question

I want to modify some Pandas dataframes inside a for loop. The problem is that after the loop runs, the dataframes are not updated with the modifications. What is happening?

My code:

for i in [ages, vels, vendors, mt, base_tbl]:
    i = i.drop_duplicates(subset='IDs', keep="last")
    i['IDs'] = i['IDs'].astype(str)

you're assigning the modified dataframes to the i variable each time — Paul H
– Paul H, Commented Dec 28, 2021 at 15:27
Let's try using 'inplace=True' instead of reassigning. i.drop_duplicates('IDs', keep='last', inplace=True) — Scott Boston
– Scott Boston, Commented Dec 28, 2021 at 15:44

Paul H · Accepted Answer · 2021-12-28 15:35:19Z

1

Your modified dataframes are stored assigned to the i variable with each iteration of your loop.

You could do:

list_of_df = [ages, vels, vendors, mt, base_tbl]

list_of_df = [
    df.drop_duplicates(subset='IDs', keep="last")
      .assign(IDs=lambda df: df["IDs"].astype(str)
    for df in list_of_df
]

...but then you're stuck with a list of dataframes instead of having them individually.

There's not enough context to your question to know how to best fix this issue.

Two options I can think of:

concatenate them into a single dataframe and operate on that (you can assign a "source" column that distinguishes each dataset)
do this prep/clean up as each dataframe is created.

Say you have a function that loads your data. You can write another that does the clean up and pipe the loader's output to it. Like this:


def cleanup(df):
    return (
      df.drop_duplicates(subset='IDs', keep="last")
        .assign(IDs=lambda df: df["IDs"].astype(str)
    )

ages = load_data("ages").pipe(cleanup)
mt = load_data("mt").pipe(cleanup)
# etc

answered Dec 28, 2021 at 15:35

Paul H

68.7k23 gold badges165 silver badges139 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Guillermo.D Over a year ago

thanks Paul... I think I'll have to pipe it when loading the csvs.

Scott Boston · Accepted Answer · 2021-12-28 16:37:14Z

0

Try this to modify the objects in the current memory space.

for i in [ages, vels, vendors, mt, base_tbl]:
    i.drop_duplicates(subset='IDs', keep="last", inplace=True)
    i['IDs'] = i['IDs'].astype(str)

MVCE:

import pandas as pd
import numpy as np
np.random.seed(123)
df1 =  pd.DataFrame(np.random.randint(0,100, (5,5)), columns=[*'abcde'])
df2 =  pd.DataFrame(np.random.randint(0,100, (5,5)), columns=[*'abcde'])
df3 =  pd.DataFrame(np.random.randint(0,100, (5,5)), columns=[*'abcde'])

for  i in [df1, df2, df3]:
    i.drop_duplicates('b', keep='last', inplace=True)
    i['a'] = i['a'].astype(str)


df1.info()
df2.info()
df3.info()
print(df2)

Output:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       5 non-null      object
 1   b       5 non-null      int32 
 2   c       5 non-null      int32 
 3   d       5 non-null      int32 
 4   e       5 non-null      int32 
dtypes: int32(4), object(1)
memory usage: 160.0+ bytes
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 4
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       4 non-null      object
 1   b       4 non-null      int32 
 2   c       4 non-null      int32 
 3   d       4 non-null      int32 
 4   e       4 non-null      int32 
dtypes: int32(4), object(1)
memory usage: 128.0+ bytes
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       5 non-null      object
 1   b       5 non-null      int32 
 2   c       5 non-null      int32 
 3   d       5 non-null      int32 
 4   e       5 non-null      int32 
dtypes: int32(4), object(1)
memory usage: 160.0+ bytes
    a   b   c   d   e
0  84  39  66  84  47
1  61  48   7  99  92
3  34  97  76  40   3
4  69  64  75  34  58
1
df1
a   b   c   d   e
0   97  30  52  12  50
3   2   86  41  11  98  # Note missing second index drop duplicate worked.
4   0   48  71  94  61

edited Dec 28, 2021 at 16:37

answered Dec 28, 2021 at 15:45

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

3 Comments

Guillermo.D Over a year ago

your suggestion doesn't take into account the as.trype(str) line, which is what interests me the most :(

Scott Boston Over a year ago

@Guillermo.D Try it. See if t works. I think it will. The problem happens when the creation of new memoery space for i with drop_duplicates.

Scott Boston Over a year ago

@Guillermo.D See update.

imburningbabe · Accepted Answer · 2021-12-29 13:05:33Z

0

You just have to add inplace=True to your code, in order to overwrite the df with modifications:

for i in [ages, vels, vendors, mt, base_tbl]:
    i.drop_duplicates(subset='IDs', keep="last", inplace=True)
    i['IDs'] = i['IDs'].astype(str)

This should fix

edited Dec 29, 2021 at 13:05

answered Dec 28, 2021 at 15:50

imburningbabe

8121 gold badge4 silver badges13 bronze badges

3 Comments

Paul H Over a year ago

In this answer, you've turned i into None since that's what in-place pandas operations return. If fix that, this is a direct copy of another answer already provided.

Scott Boston Over a year ago

Note in my comment I stated instead of reassiging. Do not use i = in front of a method with inplace.

imburningbabe Over a year ago

Yeah, you're right, I miswrote. Unhappy to say that btw ;)

Collectives™ on Stack Overflow

Modify pandas dataframes inside a loop

3 Answers 3

1 Comment

3 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related