Looping through list and row for keyword match in pandas dataframe

Question

I have a dataframe that looks like this. It has 1 column labeled 'utterances'. df.utterances contains rows whose values are strings of n number words.

  
                             utterances
0                                        okay go ahead.
1                                     Um, let me think.
2     nan that's not very encouraging. If they had a...
3     they wouldn't make you want to do it. nan nan ...
4     Yeah. The problem is though, it just, if we pu...

I also have a list of specific words. It is called specific_words. It looks like this:

specific_words = ['happy, 'good', 'encouraging', 'joyful']

I want to check if any of the words from specific_words are found in any of the utterances. Essentially, I want to loop throughevery row in df.utterance, and when I do so, loop through specific_list to look for matches. If there is a match, I want to have a boolean column next to df.utterances that shows this.

def query_text_by_keyword(df, word_list):
    for word in word_list:
        for utt in df.utterance:
            if word in utt:
                match = True
            else:
                match = False
            return match
    
df['query_match'] = df.apply(query_text_by_keyword, 
                                               axis=1, 
                                               args=(specific_words,))

It doesn't break, but it just returns False for every row, when it shouldn't. For example, the first few rows should look like this:

 utterances                                                    query_match
    0                                        okay go ahead.       False
    1                                     Um, let me think.       False
    2     nan that's not very encouraging. If they had a...       True
    3     they wouldn't make you want to do it. nan nan ...       False
    4     Yeah. The problem is though, it just, if we pu...       False

Edit

@furas made a great suggestion to solve the initial question. However, I would also like to add another column that contains the specific word(s) from the query that indicates a match. Example:

 utterances                                                 query_match   word  
    0                                    okay go ahead    False      NaN
    1                                 Um, let me think    False      NaN
    2 nan that's not very encouraging. If they had a..    True   'encouraging'
    3 they wouldn't make you want to do it. nan nan ..    False      NaN
    4 Yeah. The problem is though, it just, if we pu..    False      NaN

regex - df.str.constains("happy|good|encouraging|joyful") ? And "|".join(specific_words) to create this regex. — furas
– furas, Commented Feb 11, 2020 at 2:40

furas · Accepted Answer · 2020-02-11 16:54:10Z

2

You can use regex with str.contains(regex)

df['utterances'].str.constains("happy|good|encouraging|joyful")

You can create this regex with

query = '|'.join(specific_words)

You can also use str.lower() because strings may have uppercase chars.

import pandas as pd

df = pd.DataFrame({
    'utterances':[
        'okay go ahead',
        'Um, let me think.',
        'nan that\'s not very encouraging. If they had a...',
        'they wouldn\'t make you want to do it. nan nan ...',
        'Yeah. The problem is though, it just, if we pu...',
    ]
})

specific_words = ['happy', 'good', 'encouraging', 'joyful']

query = '|'.join(specific_words)

df['query_match'] = df['utterances'].str.lower().str.contains(query)

print(df)

Result

                                          utterances  query_match
0                                      okay go ahead        False
1                                  Um, let me think.        False
2  nan that's not very encouraging. If they had a...         True
3  they wouldn't make you want to do it. nan nan ...        False
4  Yeah. The problem is though, it just, if we pu...        False

EDIT: as @HenryYik mentioned in comment you can use case=False instead of str.lower()

df['query_match'] = df['utterances'].str.contains(query, case=False)

More in doc: pandas.Series.str.contains

EDIT: to get matching word you ca use str.extract() with regex in (...)

df['word'] = df['utterances'].str.extract( "(happy|good|encouraging|joyful)" )

Working example:

import pandas as pd

df = pd.DataFrame({
    'utterances':[
        'okay go ahead',
        'Um, let me think.',
        'nan that\'s not very encouraging. If they had a...',
        'they wouldn\'t make you want to do it. nan nan ...',
        'Yeah. The problem is though, it just, if we pu...',
        'Yeah. happy good',
    ]
})

specific_words = ['happy', 'good', 'encouraging', 'joyful']

query = '|'.join(specific_words)

df['query_match'] = df['utterances'].str.contains(query, case=False)
df['word'] = df['utterances'].str.extract( '({})'.format(query) )

print(df)

In example I added 'Yeah. happy good' to test which word will be returned happy or good. It returns first matching word.

Result:

                                          utterances  query_match         word
0                                      okay go ahead        False          NaN
1                                  Um, let me think.        False          NaN
2  nan that's not very encouraging. If they had a...         True  encouraging
3  they wouldn't make you want to do it. nan nan ...        False          NaN
4  Yeah. The problem is though, it just, if we pu...        False          NaN
5                                   Yeah. happy good         True        happy

BTW: now you can even do

df['query_match'] = ~df['word'].isna()

or

df['query_match'] = df['word'].notna()

edited Feb 11, 2020 at 16:54

answered Feb 11, 2020 at 2:52

furas

149k12 gold badges121 silver badges171 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Henry Yik Over a year ago

You could pass case=False to str.contains so you don't have to lowercase.

connor449 Over a year ago

@furas Thanks, this is great. How would you also add another column that held the specific word(s) from the query that were a match? See example in edit above.

furas Over a year ago

BTW: now you can even do df['query_match'] = df['word'].notna() or df['query_match'] = ~df['word'].isna()

Collectives™ on Stack Overflow

Looping through list and row for keyword match in pandas dataframe

Edit

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Edit

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related