I have a dictionary of regular expressions and I want to count the matches in the dictionary with topics and regex that include compound words.
import pandas as pd
terms = {'animals':"(fox|russian brown deer|bald eagle|arctic fox)",
'people':'(John Adams|Rob|Steve|Superman|Super man)',
'games':'(basketball|basket ball|bball)'
}
df=pd.DataFrame({
'Score': [4,6,2,7,8],
'Foo': ['Superman was looking for a russian brown deer.', 'John adams started to play basket ball with rob yesterday before steve called him','Basketball or bball is a sport played by Steve afterschool','The bald eagle flew pass the arctic fox three times','The fox was sptted playing basket ball?']
})
To count the matches I can use similar code to the question: Python pandas count number of Regex matches in a string. But it splits the strings by white spaces then count the terms which do not include compound terms. What is an alternative way to do this so that compound terms connected by a space gets included?
df1 = df.Foo.str.split(expand=True).stack().reset_index(level=1, drop=True).reset_index(name='Foo')
for k, v in terms.items():
df1[k] = df1.Foo.str.contains('(?i)(^|\s)'+terms[k]+'($|\s|\.|,|\?)')
df2= df1.groupby('index').sum().astype(int)
df = pd.concat([df,df2], axis=1)
print(df)
The end result should look like:
Foo Score animals people \
0 Superman was looking for a russian brown deer. 4 1 1
1 John adams started to play basket ball with ro... 6 0 3
2 Basketball or bball is a sport played by Steve... 2 0 1
3 The bald eagle flew pass the artic fox three t... 7 3 0
4 The fox was sptted playing basket ball 8 1 0
games
0 0
1 1
2 2
3 0
4 1
Note that for the 3 row the word fox in arctic fox and the word arctic fox should be counted each once (2 times together) for the animal column.