Counting occurrences of word in a string in Pandas

Question

I am trying to count how many times a word is present in all of the strings of a Pandas Series I have a dataframe df which follows this logic:

word

hi    
hello
bye
goodbye

And a df_2 which looks like this ( scroll to the right to see another column )

sentence                                                                            metric_x

hello, what a wonderful day                                                         10
I did not said hello today                                                          15
what comes first, hi or hello                                                       25
the most used word is hi                                                            30
hi or hello, which is more formal                                                   50
he said goodbye, even though he never said hi or hello in the first place           5

I am trying to achieve the following in df: counting how many times a word is present and what is the metric_x sum of the values which matches the word.

word        count       metric_x_sum
        
hi          4           110
hello       5           105
bye         0           0
goodbye     1           5

I am using this:

df['count'] = df['word'].apply(lambda x: df_2['sentence'].str.count(x).sum())

The problem lies in the length of the dataframes, I have 70,000 unique words in df and 250,000 unique sentences in df_2, the line above is running for 15 minutes and I have no clue for how long it might run.

After letting it run for 15 minutes I got this error:

error: multiple repeat at position 2

Is there a smarter, faster way of achieving this?

jezrael · Accepted Answer · 2020-09-23 07:15:37Z

1

First split sentences for words and DataFrame.explode, remove trailing values , by Series.str.strip:

df2 = df_2.assign(word = df_2['sentence'].str.split()).explode('word')
df2['word'] = df2['word'].str.strip(',')
#print (df2)

Then DataFrame.merge with left join and aggregate GroupBy.count for exclude missing values with sum:

df3 = (df.merge(df2, on='word', how='left')
         .groupby('word')
         .agg(count=('metric_x', 'count'), metric_x_sum=('metric_x','sum')))
# print (df3)

Last add to original:

df = df.join(df3, on='word')
df['metric_x_sum'] = df['metric_x_sum'].astype(int)
print (df)
      word  count  metric_x_sum
0       hi      4           110
1    hello      5           105
2      bye      0             0
3  goodbye      1             5

edited Sep 23, 2020 at 7:15

answered Sep 23, 2020 at 7:08

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jonas Palačionis Over a year ago

hello and thank you. Could you elaborate why your method is way faster than mine?

jezrael Over a year ago

@JonasPalačionis - Hmmm, not sure how get values of metric_x to metric_x_sum, also faster should be because merge is faster here.

Collectives™ on Stack Overflow

Counting occurrences of word in a string in Pandas

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related