I am trying to count how many times a word is present in all of the strings of a Pandas Series
I have a dataframe df which follows this logic:
word
hi
hello
bye
goodbye
And a df_2 which looks like this ( scroll to the right to see another column )
sentence metric_x
hello, what a wonderful day 10
I did not said hello today 15
what comes first, hi or hello 25
the most used word is hi 30
hi or hello, which is more formal 50
he said goodbye, even though he never said hi or hello in the first place 5
I am trying to achieve the following in df: counting how many times a word is present and what is the metric_x sum of the values which matches the word.
word count metric_x_sum
hi 4 110
hello 5 105
bye 0 0
goodbye 1 5
I am using this:
df['count'] = df['word'].apply(lambda x: df_2['sentence'].str.count(x).sum())
The problem lies in the length of the dataframes, I have 70,000 unique words in df and 250,000 unique sentences in df_2, the line above is running for 15 minutes and I have no clue for how long it might run.
After letting it run for 15 minutes I got this error:
error: multiple repeat at position 2
Is there a smarter, faster way of achieving this?