How do you match strings with different values in pandas?

Question

I'm trying to compare the values in 2 dataframes. This is my code :

for i in df1['Searches']:
    for j in df['Tags']:
        if  i == j:
           print(i,j)

The code works. However, I want to account for cases where the strings don't entirely match, due to spacing, misspelling, or punctuation, but they should match given how much they have in common.

For instance:

   Searches       |   Tags 
----------------------------------
   lightblue      |   light blue
   light-blue     |   light blue
   light blu      |   light blue
   lite blue      |   light blue
   liteblue       |   light blue
   liteblu        |   light blue
   light b l u e  |   light blue
   light.blue     |   light blue
   l i ght blue   |   light blue

I listed variations of possible strings that could show up under searches, and the string that it should match to under tags. Is there a way to account for those variations and still have them match?

Thank you for taking the time to read my question and help in any way you can.

Code Different · Accepted Answer · 2021-10-11 02:32:10Z

3

You are getting into fuzzy string matching. One way to do that is to use a similarity metric such as jaro_similarity from the Natural Language Toolkit (NLTK):

from nltk.metrics.distance import jaro_similarity
df['jaro_similarity'] = df.apply(lambda row: jaro_similarity(row['Searches'], row['Tags']), axis=1)

Result:

     Searches       Tags  jaro_similarity
    lightblue light blue         0.966667
   light-blue light blue         0.933333
    light blu light blue         0.966667
    lite blue light blue         0.896296
     liteblue light blue         0.858333
      liteblu light blue         0.819048
light b l u e light blue         0.923077
   light.blue light blue         0.933333
 l i ght blue light blue         0.877778

You have to pick a cut-off point by experimenting on your data. Documentation on the nltk.metrics.distance module: https://www.nltk.org/api/nltk.metrics.distance.html#module-nltk.metrics.distance

answered Oct 11, 2021 at 2:32

Code Different

93.4k16 gold badges154 silver badges175 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

shorttriptomars Over a year ago

Thank you so much for taking the time to answer my question. Your solution was really helpful!

jollibobert · Accepted Answer · 2021-10-11 02:33:30Z

1

You can use a string similarity metric to determine a match. For example, here I use edit_distance from the nltk library:

import pandas as pd
from nltk.metrics.distance import edit_distance

searches = \
['lightblue',
 'light-blue',
 'light blu',
 'lite blue',
 'liteblue',
 'liteblu',
 'light b l u e',
 'light.blue',
 'l i ght blue',
 'totally different string'
]
df = pd.DataFrame()
df['Searches'] = searches
df['Tags'] = 'light blue'

matches = []
distance_threshold = 5
for i in df['Searches']:
    for j in df['Tags']:
        if  edit_distance(i, j) < distance_threshold:
            # print(i,j)  
            matches.append(i)
print(list(set(matches)))

Output:

['light.blue',
 'light b l u e',
 'lightblue',
 'light blu',
 'light-blue',
 'lite blue',
 'l i ght blue',
 'liteblu',
 'liteblue']

But you will have to adjust distance_threshold to your liking or choose another metric that works better. See a list of metrics here:

https://www.nltk.org/api/nltk.metrics.distance.html

There are many other libraries that you can try as well. Just give it a search.

answered Oct 11, 2021 at 2:33

jollibobert

3331 silver badge8 bronze badges

2 Comments

shorttriptomars Over a year ago

Thank you so much for taking the time to answer my question. The comment above was a better answer, but your solution was still helpful!

jollibobert Over a year ago

Glad to help! :D I upvoted the other answer actually. It was quicker by a minute as well. Cheers! :)

Collectives™ on Stack Overflow

How do you match strings with different values in pandas?

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related