1

I have a large CSV file that has many short words and I need to change them into a full word. I found few posts here such as 1, 2 but most of these are either change the entire row or needs to do manually one by one.

My CSV file looks like:

infoID               messages
 111     we need to fix the car mag but we can't
 113         we need a shf to perform eng change
 115                      gr is needed to change
 116                            bat needs change
 117                    car towed for ext change 
 118                              car ml is high
  .
  .

My another file that has all the full word of short-form words and I want to use that to apply in my document and it is in the form of:

shf:shaft
gr:gear
ml:mileage

It would be great if you can provide your help with code that I can run in my side as well. Thanks

2 Answers 2

4

Read your text file in as a Series that looks like

s

0    mag:magnitude
1        shf:shaft
2          gr:gear
3      bat:battery
4      ext:exhaust
5       ml:mileage
Name: 0, dtype: object

Split on colon and convert the series into a dictionary mapping key to its replacement:

dict(s.str.split(':').tolist())

# {'bat': 'battery',
#  'ext': 'exhaust',
#  'gr': 'gear',
#  'mag': 'magnitude',
#  'ml': 'mileage',
#  'shf': 'shaft'}

Use this to perform a replace operation with regex=True:

df['messages'].replace(dict(s.str.split(':').tolist()), regex=True)

0    we need to fix the car magnitude but we can't
1            we need a shaft to perform eng change
2                         gear is needed to change
3                             battery needs change
4                     car towed for exhaust change
5                              car mileage is high
Name: messages, dtype: object

Note that if these are strictly whole word replacements, you can extend this solution by converting the key strings into regular expressions that use word boundaries. For good measure, escape the string as well:

import re

mapping = {fr'\b{re.escape(k)}\b': v for k, v in s.str.split(':').tolist()}
df['messages'].replace(mapping, regex=True)

0    we need to fix the car magnitude but we can't
1            we need a shaft to perform eng change
2                         gear is needed to change
3                             battery needs change
4                     car towed for exhaust change
5                              car mileage is high
Name: messages, dtype: object
Sign up to request clarification or add additional context in comments.

10 Comments

Just curious, why is the regex=True required?
Isn't this a bit error prone if there is any other word that contains the keys of dictionary? E.g. something like great will be changed to geareat as well.
@razdi Without it, pandas looks for exact match, so the whole line must match the searched text.
@Chris That's true, but without any context this is the simplest solution. If whole word replacements are required, then the solution can be extended with word boundaries.
@cs95 I see. I too love the simplicity of your post. Thanks for the response :)
|
3

Another way using pd.Series.apply:

d = dict(i.split(':') for i in d.split('\n'))
#{'bat': 'battery',
# 'ext': 'exhaust',
# 'gr': 'gear',
# 'mag': 'magnitude',
# 'ml': 'mileage',
# 'shf': 'shaft'}

df['messages'].apply(lambda x : ' '.join(d.get(i, i) for i in x.split()), 1)

Output:

0    we need to fix the car magnitude but we can't
1            we need a shaft to perform eng change
2                         gear is needed to change
3                             battery needs change
4                     car towed for exhaust change
5                              car mileage is high
Name: messages, dtype: object

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.