2

I have a dictionary of regular expressions and I want to count the matches in the dictionary with topics and regex that include compound words.

import pandas as pd


terms = {'animals':"(fox|russian brown deer|bald eagle|arctic fox)",
'people':'(John Adams|Rob|Steve|Superman|Super man)',
'games':'(basketball|basket ball|bball)'
}

df=pd.DataFrame({
'Score': [4,6,2,7,8],
'Foo': ['Superman was looking for a russian brown deer.', 'John adams started to play basket ball with rob yesterday before steve called him','Basketball or bball is a sport played by Steve afterschool','The bald eagle flew pass the arctic fox three times','The fox was sptted playing basket ball?']
})

To count the matches I can use similar code to the question: Python pandas count number of Regex matches in a string. But it splits the strings by white spaces then count the terms which do not include compound terms. What is an alternative way to do this so that compound terms connected by a space gets included?

df1 = df.Foo.str.split(expand=True).stack().reset_index(level=1, drop=True).reset_index(name='Foo')



for k, v in terms.items():
    df1[k] = df1.Foo.str.contains('(?i)(^|\s)'+terms[k]+'($|\s|\.|,|\?)')


df2= df1.groupby('index').sum().astype(int)


df = pd.concat([df,df2], axis=1)
print(df)

The end result should look like:

                                                 Foo  Score  animals  people  \
0     Superman was looking for a russian brown deer.      4        1       1   
1  John adams started to play basket ball with ro...      6        0       3   
2  Basketball or bball is a sport played by Steve...      2        0       1   
3  The bald eagle flew pass the artic fox three t...      7        3       0   
4             The fox was sptted playing basket ball      8        1       0   

   games  
0      0  
1      1  
2      2  
3      0  
4      1  

Note that for the 3 row the word fox in arctic fox and the word arctic fox should be counted each once (2 times together) for the animal column.

1 Answer 1

0

Please see if this is what you were looking for:

import(re)
for k in terms.keys():
    df[k] = 0
    for words in re.sub("[()]","",terms[k]).split('|'):
        mask = df.Foo.str.contains(words, case = False)
        df[k] += mask
df


                                              Foo   Score   people  animals games
0   Superman was looking for a russian brown deer.      4        1        1     0
1   John adams started to play basket ball with ro...   6        3        0     1
2   Basketball or bball is a sport played by Steve...   2        1        0     2
3   The bald eagle flew pass the arctic fox three ...   7        0        3     0
4   The fox was sptted playing basket ball?             8        0        1     1
Sign up to request clarification or add additional context in comments.

2 Comments

Yes thanks not really familiar with the sub function in the regex library
sub stands for substitute. If you drop parentheses in your initial dictionary of terms you would not need this sub

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.