2

this is a follow-up of this previous topic. I have a Series of strings w called "British" like this:

British
\bSkilful\b
\bWilful\b
\bfulfil\b
\b.*favour.*\b
\bappal\b
\bappall.*\b
\barbour.*\b
\barmor.*\b
\bstrange\b
\brumor.*\b
\b.*color.*\b
\b.*centre's\b

and a DataFrame df like this:

 User_ID     Tweet
 01          hi all
 02          see you something
 03          that's my favourite spot
 04          the strangest rumors
 05          my appal is nice
 06          check my rumor
 07          #brborboncheckruMoreThanever
 08          look @mycentre's

I would like to get a new column containing the SINGLE keywords found in the strings. So far I did:

 List = pd.read_csv('w.txt')
 r = re.compile(r'.*({}).*'.format('|'.join(List['British'].values)), re.IGNORECASE)

and then mask the DataFrame:

  masked = map(bool, map(r.search, df['Tweet']))
  df2 = df[masked]

Then I masked it again to add the 'keyword' column:

 mask = [m.group(1) if m else None for m in map(r.search, df2['Tweet'])]
 df2['keyword'] = mask

which returns:

   User_ID                     Tweet         keyword
2        3  that's my favourite spot  favourite spot
4        5          my appal is nice           appal
5        6            check my rumor           rumor
7        8          look @mycentre's      mycentre's

So the boolean mask works fine and detect only the tweets containing at least one keyword. But what if I would like to extract only the single keyword found? The final DataFrame should be as:

   User_ID                     Tweet         keyword
2        3  that's my favourite spot       favourite
4        5          my appal is nice           appal
5        6            check my rumor           rumor
7        8          look @mycentre's        centre's

Thanks so much for your kind help.

2
  • In the instance of index 2 in your returned dataframe, are you looking to have a different keyword column if there are multiple keywords? So, keyword1 would have "favourite" and keyword2 would have 'spot', for example? Commented Nov 8, 2016 at 14:47
  • It is impossible, you cannot differentiate between centre's and mycentre's, both are chunks of non-whitespace chars. The logic that you describe is [m.group(1).split()[0] if m else None for m in map(r.search, df2['Tweet'])] Commented Mar 25, 2019 at 18:19

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.