2

I want to find the most similar value from a dataframe column to a specified string , e.g. a='book'. Let's say the dataframe looks like: df

col1
wijk 00 book
Wijk a 
test

Now I want to return wijk 00 book since this is the most similar to a. I am trying to do this with the fuzzywuzzy package.

Therefore, I have a dataframe A with the values I want to have a similar one for. Then I use:

A['similar_value'] = A.col1.apply(lambda x: [process.extract(x, df.col1, limit=1)][0][0][0])  

But when comparing a lot of strings, this takes too much time. Does anyone knows how to do this quickly?

7
  • How you define similarity here? Commented Apr 26, 2021 at 16:05
  • @ZalakBhalani the strings in the dataframe column should contain the string a Commented Apr 26, 2021 at 16:09
  • what's your current code with fuzzywuzzy? we can try to optimize that Commented Apr 26, 2021 at 16:10
  • I added my code Commented Apr 26, 2021 at 16:16
  • What is the process variable defined as? Commented Apr 26, 2021 at 16:17

2 Answers 2

1

You can use 'str.contains' method to get the string which exact substring

df[df["column_name"].str.contains("book")].values[0][0]
Sign up to request clarification or add additional context in comments.

Comments

1

I would use rapidfuzz:

from rapidfuzz import process, fuzz

df = pd.DataFrame(['wijk 00 book', 'Wijk a', 'test'], columns=['col1'])

search_str = 'book'
most_similar = process.extractOne(search_str, df['col1'], scorer=fuzz.WRatio)

Output:

most_similar
('wijk 00 book', 90.0, 0)

This gives you the most similar string in the column as well as a score for how similar it is to your search string.

1 Comment

nice +1, much faster than my version using rapidfuzz with apply()

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.