3

Hi I am trying to remap a Dataframe using a dictionary in Python Pandas but I need to use regex to make things work fine.

Here is a sample of the dict:

di_cities = {
"Ain Salah (town)": "Ain Salah"
"Agadez town": "Agadez"
"Bamako city":  "Bamako",
"Birnin Konni town":  "Birni N Konni",
"Konni":  "Birni N Konni",
"Kadunà":  "Kaduna",
"Kaduna (city)":  "Kaduna",
"Kano (city)":  "Kano"
"Matamey":  "Matamey",
"Mopti city":  "Mopti"
"N'guigmi":  "Nguigmi",
"Tunis":  "Tunis",
"Tunis (city)":  "Tunis"
}

I am using this iteration:

di_cities = {rf"\b{k}\b": v for k, v in di_cities.items()}
df_cities_clean = df.replace(di_cities, regex=True)

As you can see in the pic (final result) it works fine for Bamako, Agadez, Mopti and every sigle-word string. Doesn't for any string with parentheses and in case of Birnin Konni messes up a little bit. I am using another dictionary in a similar way but there every string is between parentheses and {rf"\({k}\)" works perfectly.

Can you help me?

Final result

9
  • Use re.escape. \b won't help then. Commented Nov 9, 2021 at 16:07
  • 1
    Try di_cities = {rf"\b{re.escape(k)}(?:(?<=\w)\b|(?<!\w))": v for k, v in di_cities.items()}. Note that it may not work if your dictionary has overlapping keys (those that are prefixes of other(s)). This also assumes your keys always start with a word char. Commented Nov 9, 2021 at 16:13
  • Thanks Wiktor! It's almost perfect to me! It does the job for everything except Konni i.e. overlapping keys (there is just this one at the moment) but I solved with a workaround Commented Nov 9, 2021 at 18:58
  • what result do you need? Maybe you should simply split(" (") and get first element Commented Nov 9, 2021 at 19:02
  • 1
    Great, next time, please add @username mention in the comment to notify this user of your feedback. Commented Nov 15, 2021 at 8:05

1 Answer 1

1

I suggest using

di_cities = {rf"\b{re.escape(k)}(?:(?<=\w)\b|(?<!\w))": v for k, v in di_cities.items()}

With this dictionary comprehension, you create another dictionary with keys as regular expressions matching former keys as whole words that start with word characters (that is, digits, letters, underscores, connector punctuation) and - if they end with word chars - are not immediately followed with another word char. If a key does not end with a word char, say, if it ends with punctuation, or whitespace (maybe adding .strip() would make it safer), no additional boundary check is applied.

The rf"\b{re.escape(k)}(?:(?<=\w)\b|(?<!\w))" escapes [all special regex metacharacters in] the key first, and then prepends it with a word boundary, and (?:(?<=\w)\b|(?<!\w)) is a non-capturing group that matches

  • (?<=\w)\b - a word boundary if the preceding char is a word char ((?<!...) is a positive lookbehind)
  • | - or
  • (?<!\w)) - no additional check (empty string is matched) if there is no word char immediately to the left of the current location ((?<!...) is a negative lookbehind).
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you Wiktor! That works perfectly. You've been very clear in your explanation too.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.