Python Pandas and regex in replacing items in Dataframe using dictionary

Question

Hi I am trying to remap a Dataframe using a dictionary in Python Pandas but I need to use regex to make things work fine.

Here is a sample of the dict:

di_cities = {
"Ain Salah (town)": "Ain Salah"
"Agadez town": "Agadez"
"Bamako city":  "Bamako",
"Birnin Konni town":  "Birni N Konni",
"Konni":  "Birni N Konni",
"Kadunà":  "Kaduna",
"Kaduna (city)":  "Kaduna",
"Kano (city)":  "Kano"
"Matamey":  "Matamey",
"Mopti city":  "Mopti"
"N'guigmi":  "Nguigmi",
"Tunis":  "Tunis",
"Tunis (city)":  "Tunis"
}

I am using this iteration:

di_cities = {rf"\b{k}\b": v for k, v in di_cities.items()}
df_cities_clean = df.replace(di_cities, regex=True)

As you can see in the pic (final result) it works fine for Bamako, Agadez, Mopti and every sigle-word string. Doesn't for any string with parentheses and in case of Birnin Konni messes up a little bit. I am using another dictionary in a similar way but there every string is between parentheses and {rf"\({k}\)" works perfectly.

Can you help me?

Final result

Try di_cities = {rf"\b{re.escape(k)}(?:(?<=\w)\b|(?<!\w))": v for k, v in di_cities.items()}. Note that it may not work if your dictionary has overlapping keys (those that are prefixes of other(s)). This also assumes your keys always start with a word char. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Nov 9, 2021 at 16:13
Thanks Wiktor! It's almost perfect to me! It does the job for everything except Konni i.e. overlapping keys (there is just this one at the moment) but I solved with a workaround — Irvine
– Irvine, Commented Nov 9, 2021 at 18:58
what result do you need? Maybe you should simply split(" (") and get first element — furas
– furas, Commented Nov 9, 2021 at 19:02
Great, next time, please add @username mention in the comment to notify this user of your feedback. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Nov 15, 2021 at 8:05

Wiktor Stribiżew · Accepted Answer · 2021-11-15 08:04:51Z

1

I suggest using

di_cities = {rf"\b{re.escape(k)}(?:(?<=\w)\b|(?<!\w))": v for k, v in di_cities.items()}

With this dictionary comprehension, you create another dictionary with keys as regular expressions matching former keys as whole words that start with word characters (that is, digits, letters, underscores, connector punctuation) and - if they end with word chars - are not immediately followed with another word char. If a key does not end with a word char, say, if it ends with punctuation, or whitespace (maybe adding .strip() would make it safer), no additional boundary check is applied.

The rf"\b{re.escape(k)}(?:(?<=\w)\b|(?<!\w))" escapes [all special regex metacharacters in] the key first, and then prepends it with a word boundary, and (?:(?<=\w)\b|(?<!\w)) is a non-capturing group that matches

(?<=\w)\b - a word boundary if the preceding char is a word char ((?<!...) is a positive lookbehind)
| - or
(?<!\w)) - no additional check (empty string is matched) if there is no word char immediately to the left of the current location ((?<!...) is a negative lookbehind).

answered Nov 15, 2021 at 8:04

Wiktor Stribiżew

631k41 gold badges502 silver badges633 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Irvine Over a year ago

Thank you Wiktor! That works perfectly. You've been very clear in your explanation too.

Collectives™ on Stack Overflow

Python Pandas and regex in replacing items in Dataframe using dictionary

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related