0

I want to replace multiple strings in my list of dataframes that match. I cannot get these to match and replace in place, instead it produces additional row entries.

Here's the example data:

import pandas as pd
import re
from scipy import linalg

nm=['sr', 'pop15', 'pop75', 'dpi', 'ddpi']
df_tbl=pd.DataFrame(linalg.circulant(nm))

ls_comb = [df_tbl.loc[0:i] for i in range(0, len(df_tbl))]

extract_text=['dpi', 'pop15'] 
clean_text=['np.log(dpi)', 'np.log(pop15)']
cl_text=[re.search('(?<=\\()[^\\^\\)]+', i).group(0) for i in clean_text]
int_text=list(set(extract_text).intersection(cl_text))

I know that int_text is the same as extract_text, but in some instances I may only have one np.log for clean_text, so I just left this as is as I would be using int_text to filter.

And what I have tried:

[
    i.apply(
        lambda x: [
            re.sub(rf"\b{ext_t}\b", cln_t, val)
            for val in x
            for ext_t, cln_t in zip(int_text, clean_text)
        ]
    )
    for i in ls_comb
]

It produces the following:

[    0     1            2      3              4
 0  sr  ddpi  np.log(dpi)  pop75          pop15
 1  sr  ddpi          dpi  pop75  np.log(pop15),
                0     1            2            3              4
 0             sr  ddpi  np.log(dpi)        pop75          pop15
 1             sr  ddpi          dpi        pop75  np.log(pop15)
 2          pop15    sr         ddpi  np.log(dpi)          pop75
 3  np.log(pop15)    sr         ddpi          dpi          pop75,
                0              1            2            3              4
 0             sr           ddpi  np.log(dpi)        pop75          pop15
 1             sr           ddpi          dpi        pop75  np.log(pop15)
 2          pop15             sr         ddpi  np.log(dpi)          pop75
 3  np.log(pop15)             sr         ddpi          dpi          pop75
 4          pop75          pop15           sr         ddpi    np.log(dpi)
 5          pop75  np.log(pop15)           sr         ddpi            dpi,
.
.
.

However, it produces additional rows, I expect a clean solution like this:

[       0            1            2            3            4
 0     sr          ddpi       np.log(dpi)    pop75      np.log(pop15),
        0            1            2            3            4
 0     sr          ddpi       np.log(dpi)     pop75     np.log(pop15)
 1  np.log(pop15)   sr          ddpi       np.log(dpi)     pop75,
.
.
.
3
  • I'm afraid I don't really understand your objective. Could you perhaps give a more explicit example of the data you're working with, the output you expect, and an explanation of the logic you're applying? Commented Jun 30, 2022 at 21:45
  • @CrazyChucky I have updated with the output to compare with the expected output. Essentially, I want to replace values from int_text for those that match with their log form from clean_text. I wanted to replace these in place, however my attempt would loop within x so it would do a loop once for np.log(pop15), and a loop again for the other element so It would double the size. The expected output shows the values being replaced as they are in their place. Commented Jun 30, 2022 at 21:58
  • 1
    Looping is pretty much never the best answer when it comes to pandas... Commented Jun 30, 2022 at 22:24

1 Answer 1

2
import pandas as pd
from scipy import linalg

nm=['sr', 'pop15', 'pop75', 'dpi', 'ddpi']
df_tbl=pd.DataFrame(linalg.circulant(nm))

extract_text=['dpi', 'pop15'] 
clean_text=['np.log(dpi)', 'np.log(pop15)']
df_tbl.replace(extract_text, clean_text, inplace=True)
print(df_tbl)

Output:

               0              1              2              3              4
0             sr           ddpi    np.log(dpi)          pop75  np.log(pop15)
1  np.log(pop15)             sr           ddpi    np.log(dpi)          pop75
2          pop75  np.log(pop15)             sr           ddpi    np.log(dpi)
3    np.log(dpi)          pop75  np.log(pop15)             sr           ddpi
4           ddpi    np.log(dpi)          pop75  np.log(pop15)             sr
Sign up to request clarification or add additional context in comments.

4 Comments

I did not expect this to work! I had used replace sometime before but it would replace ddpi and dpi because they had the same word, which is why I went for re.sub, does inplace=True prevent this issue?
pd.DataFrame.replace is quite different from str.replace or even pd.Series.str.replace. It's important to keep track of which one you're using.
Adding inplace=True is just a different way of doing df_tbl = df_tbl.replace(extract_text, clean_text) that's available for certain functions.
Ah I get it, this is definitely a much better option. I tried to do the replace on ls_comb which gave all the consistencies regarding errors.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.