Count word frequencies of each word in a list in dataframe

Question

I have created a list of words associated with a certain category. For example:

care = ["safe", "peace", "empathy"]

And I have a dataframe containing speeches, that on average consist of 450 words. I have counted the number of matches for each category using this line of code:

df['Care'] = df['Speech'].apply(lambda x: len([val for val in x.split() if val in care]))

Which gives me the total amount of matches for each category.

However i need to review the frequencies of each word in the list. I tried using this code to solve my problem.

df.Tal.str.extractall('({})'.format('|'.join(auktoritet)))\
                           .iloc[:, 0].str.get_dummies().sum(level=0)

I've tried different methods but the problems is that i always get partial matches included. For example hammer would be counted for ham.

Any ideas on how to solve this?

Akash Baranwal · Accepted Answer · 2020-05-05 04:35:06Z

1

You can use Counter which is available from collections package

from collections import Counter
word_count=Counter()
for line in df['speech']:
   for word in line.split(' '):
      word_count[word]+=1

it will store count of all words in word_count. Then You can use

word_count.most_common()

to see the words with highest frequency.

edited May 5, 2020 at 4:35

answered May 4, 2020 at 11:16

Akash Baranwal

114 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Geomariachi Over a year ago

Thanks for you reply. The problem is the words i need frequencies for are rarely the most common ones. That's why i want to count the words based on the list.

Akash Baranwal Over a year ago

you can use reverse function on word_count.most_common() for seeing the words which occur very less

Akash Baranwal Over a year ago

The code will be list(reversed(word_count.most_common()))

Henrique Branco · Accepted Answer · 2020-05-04 10:53:15Z

0

You could transform each word in a tuple with 1 as second element ('word', 1) and sum it for each word in list.

The output will be a list of tuples with the words and the frequencies:

[('word1', 3), ('word2', 10) ... ]

This is the main idea.

answered May 4, 2020 at 10:53

Henrique Branco

1,9581 gold badge18 silver badges43 bronze badges

2 Comments

Geomariachi Over a year ago

Thanks for your reply. But would this solve the issue with partial matches? If im looking for "ham", hammer would be considerd a match because it has "ham" in it.

Henrique Branco Over a year ago

If your use equality of strings, you won't have problems. You don't need to use regex for this situation. If you type "ham" == "hammer" you'll get False since the strings are different.

Geomariachi · Accepted Answer · 2020-05-04 11:50:14Z

0

I build on Akash answer, and managed to get the frequencies of prespecified words stored in a list and then counting them in the dataframe, by simply adding a line.

from collections import Counter

word_count=Counter()
for line in df['Speech']:
   for word in line.split(' '):
       if word in care:
           word_count[word]+=1

word_count.most_common()

answered May 4, 2020 at 11:50

Geomariachi

51 bronze badge

Collectives™ on Stack Overflow

Count word frequencies of each word in a list in dataframe

3 Answers 3

3 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related