2

I'm trying to compare the values in 2 dataframes. This is my code :

for i in df1['Searches']:
    for j in df['Tags']:
        if  i == j:
           print(i,j)   

The code works. However, I want to account for cases where the strings don't entirely match, due to spacing, misspelling, or punctuation, but they should match given how much they have in common.

For instance:

   Searches       |   Tags 
----------------------------------
   lightblue      |   light blue
   light-blue     |   light blue
   light blu      |   light blue
   lite blue      |   light blue
   liteblue       |   light blue
   liteblu        |   light blue
   light b l u e  |   light blue
   light.blue     |   light blue
   l i ght blue   |   light blue
  

I listed variations of possible strings that could show up under searches, and the string that it should match to under tags. Is there a way to account for those variations and still have them match?

Thank you for taking the time to read my question and help in any way you can.

2 Answers 2

3

You are getting into fuzzy string matching. One way to do that is to use a similarity metric such as jaro_similarity from the Natural Language Toolkit (NLTK):

from nltk.metrics.distance import jaro_similarity
df['jaro_similarity'] = df.apply(lambda row: jaro_similarity(row['Searches'], row['Tags']), axis=1)

Result:

     Searches       Tags  jaro_similarity
    lightblue light blue         0.966667
   light-blue light blue         0.933333
    light blu light blue         0.966667
    lite blue light blue         0.896296
     liteblue light blue         0.858333
      liteblu light blue         0.819048
light b l u e light blue         0.923077
   light.blue light blue         0.933333
 l i ght blue light blue         0.877778

You have to pick a cut-off point by experimenting on your data. Documentation on the nltk.metrics.distance module: https://www.nltk.org/api/nltk.metrics.distance.html#module-nltk.metrics.distance

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you so much for taking the time to answer my question. Your solution was really helpful!
1

You can use a string similarity metric to determine a match. For example, here I use edit_distance from the nltk library:

import pandas as pd
from nltk.metrics.distance import edit_distance

searches = \
['lightblue',
 'light-blue',
 'light blu',
 'lite blue',
 'liteblue',
 'liteblu',
 'light b l u e',
 'light.blue',
 'l i ght blue',
 'totally different string'
]
df = pd.DataFrame()
df['Searches'] = searches
df['Tags'] = 'light blue'

matches = []
distance_threshold = 5
for i in df['Searches']:
    for j in df['Tags']:
        if  edit_distance(i, j) < distance_threshold:
            # print(i,j)  
            matches.append(i)
print(list(set(matches))) 

Output:

['light.blue',
 'light b l u e',
 'lightblue',
 'light blu',
 'light-blue',
 'lite blue',
 'l i ght blue',
 'liteblu',
 'liteblue']

But you will have to adjust distance_threshold to your liking or choose another metric that works better. See a list of metrics here:

https://www.nltk.org/api/nltk.metrics.distance.html

There are many other libraries that you can try as well. Just give it a search.

2 Comments

Thank you so much for taking the time to answer my question. The comment above was a better answer, but your solution was still helpful!
Glad to help! :D I upvoted the other answer actually. It was quicker by a minute as well. Cheers! :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.