0

I have created a list of words associated with a certain category. For example:

care = ["safe", "peace", "empathy"]

And I have a dataframe containing speeches, that on average consist of 450 words. I have counted the number of matches for each category using this line of code:

df['Care'] = df['Speech'].apply(lambda x: len([val for val in x.split() if val in care]))

Which gives me the total amount of matches for each category.

However i need to review the frequencies of each word in the list. I tried using this code to solve my problem.

df.Tal.str.extractall('({})'.format('|'.join(auktoritet)))\
                           .iloc[:, 0].str.get_dummies().sum(level=0)

I've tried different methods but the problems is that i always get partial matches included. For example hammer would be counted for ham.

Any ideas on how to solve this?

3 Answers 3

1

You can use Counter which is available from collections package

from collections import Counter
word_count=Counter()
for line in df['speech']:
   for word in line.split(' '):
      word_count[word]+=1

it will store count of all words in word_count. Then You can use

word_count.most_common()

to see the words with highest frequency.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for you reply. The problem is the words i need frequencies for are rarely the most common ones. That's why i want to count the words based on the list.
you can use reverse function on word_count.most_common() for seeing the words which occur very less
The code will be list(reversed(word_count.most_common()))
0

You could transform each word in a tuple with 1 as second element ('word', 1) and sum it for each word in list.

The output will be a list of tuples with the words and the frequencies:

[('word1', 3), ('word2', 10) ... ]

This is the main idea.

2 Comments

Thanks for your reply. But would this solve the issue with partial matches? If im looking for "ham", hammer would be considerd a match because it has "ham" in it.
If your use equality of strings, you won't have problems. You don't need to use regex for this situation. If you type "ham" == "hammer" you'll get False since the strings are different.
0

I build on Akash answer, and managed to get the frequencies of prespecified words stored in a list and then counting them in the dataframe, by simply adding a line.

from collections import Counter

word_count=Counter()
for line in df['Speech']:
   for word in line.split(' '):
       if word in care:
           word_count[word]+=1

word_count.most_common()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.