Replace keywords in dataframe column using pandas dictionary

Question

I have a dataframe that contain 4 columns, one of them is called action_description, it contain "free text" that resume the different actions done to resolve an issue.

The words in this column are sometimes miswritting, and we have a dictionnary, for all famous miswords (example: REPLCD -> REPLACED, .....)

I want to replace all miswords in my column using python code.

Here's the code I use:

Code:

import sys
import pyspark
import pandas_datareader
import re
import csv
import xlrd
import pandas as pd
import numpy as np
import datetime

from pyspark.context import SparkContext
from pyspark.sql.functions import *
from pandas import DataFrame
from pandas_datareader import data, wb
from pandas import *

xls = ExcelFile("test_doc_2.xls")
df = xls.parse(xls.sheet_names[0])
df.drop(df.columns[[0, 1]],inplace=True,axis=1)
df2 = Series(df.TO_VALUE.values,index=df.FROM_VALUE).to_dict()

xls1 = ExcelFile("Test_Source_New_2.xls")
df1 = xls1.parse(xls1.sheet_names[0])

df1['WORK_PERFORMED_NEW'] = df1['WORK_PERFORMED'].replace(df2, regex=True)

This solution work, except in some cases,

in my dictionary: DEF -> DEFERRED, DEFERED -> DEFERRED

so with my solution: DEFERED -> DEFERREDERED, at it replaced DEF in DEFERED by DEFERRED and it got concatenated with ERED, DEFERRED+ERED.

I thought about using boundaries (r"\b"), but I got syntax error !!!

How can I overcome this issue .

Thank you in advance.

Sumit S Chawla · Accepted Answer · 2018-05-17 08:43:03Z

1

I guess the issue you are facing is due to regex = True. As you mentioned you have a dictionary:

DEF -> DEFERRED, DEFERED -> DEFERRED

So,when you pass DEFERED , it first finds DEF and replaces it with DEFERRED in place followed by ERED. So you will get:

DEFERED -> DEFERREDERED

Simplified:

DEF +ERED -> DEFERRED + ERED -> DEFFEREDERED

In case of any query, you can comment.

answered May 17, 2018 at 8:43

Sumit S Chawla

3,6601 gold badge16 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

18 Comments

akshat Over a year ago

would like to suggest a solution as well. Issue with code has already been explained by OP

Mouad Over a year ago

I think I found a solution, is to add first the \\b to my dictionary, and then use it.

Mouad Over a year ago

like this: df3 = {r'(\b){}(\b)'.format(k):r'\1{}\2'.format(v) for k,v in df2.items()} df1['WORK_PERFORMED_NEW'] = df1['WORK_PERFORMED'].replace(df3, regex=True)

Sumit S Chawla Over a year ago

If the words are separate , use regex = False, so it will check for that exact word and then only it would replace.

Sumit S Chawla Over a year ago

Cool. \b would also perform the same thing. It will check for word only.

|

Collectives™ on Stack Overflow

Replace keywords in dataframe column using pandas dictionary

1 Answer 1

18 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

18 Comments

Your Answer

Sign up or log in

Post as a guest

Related