1

I am attempting to replace a string within a pandas dataframe, with a string pulled from a dictionary which contains multiple sets of parentheses. When running the script, I get an error for match groups, and the string is not replaced. I'm fairly confident that this error is caused by the parentheses.

To resolve, I have been attempting to use regular expression pattern matching using the str.contains() method. I have reviewed other solutions provided on stackoverflow, but haven't been successful in resolving my error.

Here is some script I am using for testing purposes. It's important that the parentheses are maintained in the strings (i.e. I don't to have to remove them):

import pandas as pd
import numpy as np

dict= {'2017() (pat)':'2000',
       '2018() (pat)':'2001'}

df = pd.DataFrame({'YEAR': ['test2017end','test2018end','test2019end'],
                   'MONTH': ['Jan','Feb','Mar'],
                   'DD': ['1','12','22']})

for init, repl in dict.items():
    df.loc[df['YEAR'].str.contains(init),'YEAR'] = repl

print(df)

Can someone please provide guidance on using pattern matching so that the strings are properly replaced?

Thanks!

1
  • Don't name dictionaries dict Commented Aug 11, 2018 at 5:23

3 Answers 3

1

Dont use variable dict, because python code keyword.

Solution is extract first integer in key of dictionary:

import re

d= {'2017() (pat)':'2000',
       '2018() (pat)':'2001'}

df = pd.DataFrame({'YEAR': ['test2017end','test2018end','test2019end'],
                   'MONTH': ['Jan','Feb','Mar'],
                   'DD': ['1','12','22']})

for init, repl in d.items():
    i = re.findall('\d+', init)[0]
    df.loc[df['YEAR'].str.contains(i),'YEAR'] = repl

print(df)
          YEAR MONTH  DD
0         2000   Jan   1
1         2001   Feb  12
2  test2019end   Mar  22
Sign up to request clarification or add additional context in comments.

Comments

0

Have you tried methods that doesn’t involve looping? Something in this direction:

import re
import pandas as pd

dict_= {'2017() (pat)':'2000',
       '2018() (pat)':'2001'}

df = pd.DataFrame({'YEAR': ['test2017end','test2018end','test2019end'],
                   'MONTH': ['Jan','Feb','Mar'],
                   'DD': ['1','12','22']})

pat = r'(\d{4,4})'

dict_b = {re.search(pat, key).group(1):item for key, item in dict_.items()}

# Return NaN for no match
df['YEARX'] = df['YEAR'].str.extract(pat,expand=False).map(dict_b)

# Return found year for no match
df['YEARY'] = df['YEAR'].str.extract(pat,
                  expand=False).apply(lambda x: dict_b[x] if x in dict_b.keys() else x)

Comments

0

Thank you for the quick responses. My code was a little more complicated than I posted, and I was actually matching characters rather than numbers. I modified jerzael's response for this and the script functions correctly. Here is my test script I used:

import pandas as pd
import numpy as np
import re

dct= {'love (one)()':'john',
       'smith (two)()':'doe',
       'ken (three)()':'yearns'}

df = pd.DataFrame({'MAN': ['test|smith (two)()end','test|love (one)()end','test|ken (three)()end'],
                   'MONTH': ['Jan','Feb','Mar'],
                   'DD': ['1','12','22']})

for init, repl in dct.items():
    i = re.findall(r'\w+', init)[0]
    df.loc[df['MAN'].str.contains(i),'MAN'] = repl

print(df)

For the beginners like me, the regular expression how to documentation is a must (https://docs.python.org/3/howto/regex.html#regex-howto)

Cheers

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.