this is a follow-up of this previous topic. I have a Series of strings w called "British" like this:
British
\bSkilful\b
\bWilful\b
\bfulfil\b
\b.*favour.*\b
\bappal\b
\bappall.*\b
\barbour.*\b
\barmor.*\b
\bstrange\b
\brumor.*\b
\b.*color.*\b
\b.*centre's\b
and a DataFrame df like this:
User_ID Tweet
01 hi all
02 see you something
03 that's my favourite spot
04 the strangest rumors
05 my appal is nice
06 check my rumor
07 #brborboncheckruMoreThanever
08 look @mycentre's
I would like to get a new column containing the SINGLE keywords found in the strings. So far I did:
List = pd.read_csv('w.txt')
r = re.compile(r'.*({}).*'.format('|'.join(List['British'].values)), re.IGNORECASE)
and then mask the DataFrame:
masked = map(bool, map(r.search, df['Tweet']))
df2 = df[masked]
Then I masked it again to add the 'keyword' column:
mask = [m.group(1) if m else None for m in map(r.search, df2['Tweet'])]
df2['keyword'] = mask
which returns:
User_ID Tweet keyword
2 3 that's my favourite spot favourite spot
4 5 my appal is nice appal
5 6 check my rumor rumor
7 8 look @mycentre's mycentre's
So the boolean mask works fine and detect only the tweets containing at least one keyword. But what if I would like to extract only the single keyword found? The final DataFrame should be as:
User_ID Tweet keyword
2 3 that's my favourite spot favourite
4 5 my appal is nice appal
5 6 check my rumor rumor
7 8 look @mycentre's centre's
Thanks so much for your kind help.
keywordcolumn if there are multiple keywords? So,keyword1would have "favourite" andkeyword2would have 'spot', for example?centre'sandmycentre's, both are chunks of non-whitespace chars. The logic that you describe is[m.group(1).split()[0] if m else None for m in map(r.search, df2['Tweet'])]