0

I have a dataframe that contain 4 columns, one of them is called action_description, it contain "free text" that resume the different actions done to resolve an issue.

The words in this column are sometimes miswritting, and we have a dictionnary, for all famous miswords (example: REPLCD -> REPLACED, .....)

I want to replace all miswords in my column using python code.

Here's the code I use:

Code:

import sys
import pyspark
import pandas_datareader
import re
import csv
import xlrd
import pandas as pd
import numpy as np
import datetime

from pyspark.context import SparkContext
from pyspark.sql.functions import *
from pandas import DataFrame
from pandas_datareader import data, wb
from pandas import *

xls = ExcelFile("test_doc_2.xls")
df = xls.parse(xls.sheet_names[0])
df.drop(df.columns[[0, 1]],inplace=True,axis=1)
df2 = Series(df.TO_VALUE.values,index=df.FROM_VALUE).to_dict()

xls1 = ExcelFile("Test_Source_New_2.xls")
df1 = xls1.parse(xls1.sheet_names[0])

df1['WORK_PERFORMED_NEW'] = df1['WORK_PERFORMED'].replace(df2, regex=True)

This solution work, except in some cases,

in my dictionary: DEF -> DEFERRED, DEFERED -> DEFERRED

so with my solution: DEFERED -> DEFERREDERED, at it replaced DEF in DEFERED by DEFERRED and it got concatenated with ERED, DEFERRED+ERED.

I thought about using boundaries (r"\b"), but I got syntax error !!!

How can I overcome this issue .

Thank you in advance.

1 Answer 1

1

I guess the issue you are facing is due to regex = True. As you mentioned you have a dictionary:

DEF -> DEFERRED, DEFERED -> DEFERRED

So,when you pass DEFERED , it first finds DEF and replaces it with DEFERRED in place followed by ERED. So you will get:

DEFERED -> DEFERREDERED 

Simplified:

DEF +ERED -> DEFERRED + ERED -> DEFFEREDERED

In case of any query, you can comment.

Sign up to request clarification or add additional context in comments.

18 Comments

would like to suggest a solution as well. Issue with code has already been explained by OP
I think I found a solution, is to add first the \\b to my dictionary, and then use it.
like this: df3 = {r'(\b){}(\b)'.format(k):r'\1{}\2'.format(v) for k,v in df2.items()} df1['WORK_PERFORMED_NEW'] = df1['WORK_PERFORMED'].replace(df3, regex=True)
If the words are separate , use regex = False, so it will check for that exact word and then only it would replace.
Cool. \b would also perform the same thing. It will check for word only.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.