Find matching similar keywords in Python Dataframe

Question

joined_Gravity1.head()

Comments
____________________________________________________
0   Why the old Pike/Lyrik?
1   This is good
2   So clean
3   Looks like a Decoy

Input: type(joined_Gravity1)
Output: pandas.core.frame.DataFrame

The following code allows me to select strings that contain keywords: "ender"

joined_Gravity1[joined_Gravity1["Comments"].str.contains("ender", na=False)]

Output:

Comments
___________________________
194 We need a new Sender 😂
7   What about the sender
179 what about the sender?😏

How to revise the code to include words similar to 'Sender' such as 'snder','bnder'?

Puneet Singh · Accepted Answer · 2020-08-01 05:47:05Z

1

I don't see a reason why regex=True inside the contains function won't work here.

joined_Gravity1[joined_Gravity1["Comments"].str.contains(pat="ender|snder|bndr", na=False, regex=True)]

I have used "ender|snder|bnder" only. You can make a list of all such words say list_words, and pass in pat='|'.join(list_words) in contains function above.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html

answered Aug 1, 2020 at 5:47

Puneet Singh

3343 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Akshay Sehgal · Accepted Answer · 2020-08-01 06:15:36Z

1

There can be a massive number of possibilities that can occur with combinations of alphabets in such words. What you are trying to do is a fuzzy match between 2 string. I can recommend using the following -

#!pip install fuzzywuzzy
from fuzzywuzzy import fuzz, process

word = 'sender'
others = ['bnder', 'snder', 'sender', 'hello']

process.extractBests(word, others)

[('sender', 100), ('snder', 91), ('bnder', 73), ('hello', 18)]

Based on this you can decide which threshold to choose and then mark the ones that are above the threshold as a match (using the code you used above)

Here is a method to do this in your exact problem statement with a function -

df = pd.DataFrame(['hi there i am a sender', 
                   'I dont wanna be a bnder', 
                   'can i be the snder?', 
                   'i think i am a nerd'], columns=['text'])

#s = sentence, w = match word, t = match threshold
def get_match(s,w,t):
    ss = process.extractBests(w,s.split())
    return any([i[1]>t for i in ss])

#What its doing - Match each word in each row in df.text with 
#the word sender and see of any of the words have a match greater 
#than threshold ratio 70.
df['match'] = df['text'].apply(get_match, w='sender', t=70)
print(df)

                      text  match
0   hi there i am a sender   True
1  I dont wanna be a bnder   True
2      can i be the snder?   True
3      i think i am a nerd  False

Tweek the t value from 70 to 80 if you want more exact match or lower for more relaxed match.

Finally you can filter it out -

df[df['match']==True][['text']]

                      text
0   hi there i am a sender
1  I dont wanna be a bnder
2      can i be the snder?

edited Aug 1, 2020 at 6:15

answered Aug 1, 2020 at 5:58

Akshay Sehgal

19.4k3 gold badges26 silver badges57 bronze badges

3 Comments

Luc Over a year ago

df['match'] = df['text'].apply(get_match, w='sender', t=70) Is it possible to include several words instead of just 1 word in the position w? I tried the following: 1. df['match'] = df['text'].apply(get_match, w=('sender','slx'), t=70) 2. df['match'] = df['text'].apply(get_match, w=['sender','slx'], t=70) 3. w = ['sender','slx','clx'] df['match'] = df['text'].apply(get_match, w, t=70) Neither of the three works. 'Sender' here is the product category that can further be broken down into product types.

Akshay Sehgal Over a year ago

I am not sure what you need here. Do you want to separate the sentences by Sender or Slx? or you want sentences which have BOTH sender and Slx part of the sentence?

Akshay Sehgal Over a year ago

Also, it wont match ofcourse because the fuzzywuzzy documentation clearly says that it uses the target word to match against a list of choices.. It doesnt match a list of word against a list of choices. You can easily modify the funciton to operate over a list of words instead of 1

Andres Ordorica · Accepted Answer · 2020-08-01 00:35:50Z

-1

from difflib import get_close_matches 

def closeMatches(patterns, word): 
     print(get_close_matches(word, patterns)) 

 list_patterns = joined_Gravity1[joined_Gravity1["Comments"].str.contains("ender", na=False)]

 word = 'Sender'
 patterns = list_patterns
 closeMatches(patterns, word)

answered Aug 1, 2020 at 0:35

Andres Ordorica

3021 silver badge5 bronze badges

1 Comment

Ehsan Over a year ago

This does not achieve what OP is asking.

Collectives™ on Stack Overflow

Find matching similar keywords in Python Dataframe

3 Answers 3

Comments

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related