1

I want to modify some Pandas dataframes inside a for loop. The problem is that after the loop runs, the dataframes are not updated with the modifications. What is happening?

My code:

for i in [ages, vels, vendors, mt, base_tbl]:
    i = i.drop_duplicates(subset='IDs', keep="last")
    i['IDs'] = i['IDs'].astype(str) 
2
  • you're assigning the modified dataframes to the i variable each time Commented Dec 28, 2021 at 15:27
  • Let's try using 'inplace=True' instead of reassigning. i.drop_duplicates('IDs', keep='last', inplace=True) Commented Dec 28, 2021 at 15:44

3 Answers 3

1

Your modified dataframes are stored assigned to the i variable with each iteration of your loop.

You could do:

list_of_df = [ages, vels, vendors, mt, base_tbl]

list_of_df = [
    df.drop_duplicates(subset='IDs', keep="last")
      .assign(IDs=lambda df: df["IDs"].astype(str)
    for df in list_of_df
]

...but then you're stuck with a list of dataframes instead of having them individually.

There's not enough context to your question to know how to best fix this issue.

Two options I can think of:

  1. concatenate them into a single dataframe and operate on that (you can assign a "source" column that distinguishes each dataset)
  2. do this prep/clean up as each dataframe is created.

Say you have a function that loads your data. You can write another that does the clean up and pipe the loader's output to it. Like this:


def cleanup(df):
    return (
      df.drop_duplicates(subset='IDs', keep="last")
        .assign(IDs=lambda df: df["IDs"].astype(str)
    )

ages = load_data("ages").pipe(cleanup)
mt = load_data("mt").pipe(cleanup)
# etc
Sign up to request clarification or add additional context in comments.

1 Comment

thanks Paul... I think I'll have to pipe it when loading the csvs.
0

Try this to modify the objects in the current memory space.

for i in [ages, vels, vendors, mt, base_tbl]:
    i.drop_duplicates(subset='IDs', keep="last", inplace=True)
    i['IDs'] = i['IDs'].astype(str) 

MVCE:

import pandas as pd
import numpy as np
np.random.seed(123)
df1 =  pd.DataFrame(np.random.randint(0,100, (5,5)), columns=[*'abcde'])
df2 =  pd.DataFrame(np.random.randint(0,100, (5,5)), columns=[*'abcde'])
df3 =  pd.DataFrame(np.random.randint(0,100, (5,5)), columns=[*'abcde'])

for  i in [df1, df2, df3]:
    i.drop_duplicates('b', keep='last', inplace=True)
    i['a'] = i['a'].astype(str)


df1.info()
df2.info()
df3.info()
print(df2)

Output:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       5 non-null      object
 1   b       5 non-null      int32 
 2   c       5 non-null      int32 
 3   d       5 non-null      int32 
 4   e       5 non-null      int32 
dtypes: int32(4), object(1)
memory usage: 160.0+ bytes
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 4
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       4 non-null      object
 1   b       4 non-null      int32 
 2   c       4 non-null      int32 
 3   d       4 non-null      int32 
 4   e       4 non-null      int32 
dtypes: int32(4), object(1)
memory usage: 128.0+ bytes
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       5 non-null      object
 1   b       5 non-null      int32 
 2   c       5 non-null      int32 
 3   d       5 non-null      int32 
 4   e       5 non-null      int32 
dtypes: int32(4), object(1)
memory usage: 160.0+ bytes
    a   b   c   d   e
0  84  39  66  84  47
1  61  48   7  99  92
3  34  97  76  40   3
4  69  64  75  34  58
1
df1
a   b   c   d   e
0   97  30  52  12  50
3   2   86  41  11  98  # Note missing second index drop duplicate worked.
4   0   48  71  94  61

3 Comments

your suggestion doesn't take into account the as.trype(str) line, which is what interests me the most :(
@Guillermo.D Try it. See if t works. I think it will. The problem happens when the creation of new memoery space for i with drop_duplicates.
@Guillermo.D See update.
0

You just have to add inplace=True to your code, in order to overwrite the df with modifications:

for i in [ages, vels, vendors, mt, base_tbl]:
    i.drop_duplicates(subset='IDs', keep="last", inplace=True)
    i['IDs'] = i['IDs'].astype(str) 

This should fix

3 Comments

In this answer, you've turned i into None since that's what in-place pandas operations return. If fix that, this is a direct copy of another answer already provided.
Note in my comment I stated instead of reassiging. Do not use i = in front of a method with inplace.
Yeah, you're right, I miswrote. Unhappy to say that btw ;)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.