Python pandas counting matches of regex with compound words in a string

Question

I have a dictionary of regular expressions and I want to count the matches in the dictionary with topics and regex that include compound words.

import pandas as pd


terms = {'animals':"(fox|russian brown deer|bald eagle|arctic fox)",
'people':'(John Adams|Rob|Steve|Superman|Super man)',
'games':'(basketball|basket ball|bball)'
}

df=pd.DataFrame({
'Score': [4,6,2,7,8],
'Foo': ['Superman was looking for a russian brown deer.', 'John adams started to play basket ball with rob yesterday before steve called him','Basketball or bball is a sport played by Steve afterschool','The bald eagle flew pass the arctic fox three times','The fox was sptted playing basket ball?']
})

To count the matches I can use similar code to the question: Python pandas count number of Regex matches in a string. But it splits the strings by white spaces then count the terms which do not include compound terms. What is an alternative way to do this so that compound terms connected by a space gets included?

df1 = df.Foo.str.split(expand=True).stack().reset_index(level=1, drop=True).reset_index(name='Foo')



for k, v in terms.items():
    df1[k] = df1.Foo.str.contains('(?i)(^|\s)'+terms[k]+'($|\s|\.|,|\?)')


df2= df1.groupby('index').sum().astype(int)


df = pd.concat([df,df2], axis=1)
print(df)

The end result should look like:

                                                 Foo  Score  animals  people  \
0     Superman was looking for a russian brown deer.      4        1       1   
1  John adams started to play basket ball with ro...      6        0       3   
2  Basketball or bball is a sport played by Steve...      2        0       1   
3  The bald eagle flew pass the artic fox three t...      7        3       0   
4             The fox was sptted playing basket ball      8        1       0   

   games  
0      0  
1      1  
2      2  
3      0  
4      1

Note that for the 3 row the word fox in arctic fox and the word arctic fox should be counted each once (2 times together) for the animal column.

Sergey Bushmanov · Accepted Answer · 2016-04-06 09:51:39Z

0

Please see if this is what you were looking for:

import(re)
for k in terms.keys():
    df[k] = 0
    for words in re.sub("[()]","",terms[k]).split('|'):
        mask = df.Foo.str.contains(words, case = False)
        df[k] += mask
df


                                              Foo   Score   people  animals games
0   Superman was looking for a russian brown deer.      4        1        1     0
1   John adams started to play basket ball with ro...   6        3        0     1
2   Basketball or bball is a sport played by Steve...   2        1        0     2
3   The bald eagle flew pass the arctic fox three ...   7        0        3     0
4   The fox was sptted playing basket ball?             8        0        1     1

edited Apr 6, 2016 at 9:51

answered Apr 6, 2016 at 9:36

Sergey Bushmanov

25.5k8 gold badges63 silver badges84 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ccsv Over a year ago

Yes thanks not really familiar with the sub function in the regex library

Sergey Bushmanov Over a year ago

sub stands for substitute. If you drop parentheses in your initial dictionary of terms you would not need this sub

Collectives™ on Stack Overflow

Python pandas counting matches of regex with compound words in a string

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related