Frequency count based on column values in Pandas

Question

For example I have a data frame which looks like this: First Image

And I would like to make a new data frame which shows the number of times a word was marked as spam or ham. I want it to look like this: Second image

I have tried the following code to make a list of only spam counts on a word to test but it does not seem to work and crashes the Kernel on Jupyter Notebook:

words = []
for word in df["Message"]:
    words.extend(word.split())

sentences = []
for word in df["Message"]:
    sentences.append(word.split())        

spam = []
ham = []

for word in words:
    sc = 0
    hc = 0
    for index,sentence in enumerate(sentences):
        if word in sentence:
            print(word)
            if(df["Category"][index])=="ham":
                hc+=1
            else:
                sc+=1
    spam.append(sc)
spam

Where df is the data frame shown in the First Image. How can I go about doing this?

Please provide a minimal reproducible example. Also, please do not share information as images unless absolutely necessary. See: meta.stackoverflow.com/questions/303812/…, idownvotedbecau.se/imageofcode, idownvotedbecau.se/imageofanexception. — AMC
– AMC, Commented Apr 26, 2020 at 2:27

score 1 · Accepted Answer · 2020-04-26 14:26:15Z

1

You can form two dictionaries spam and ham to store the number of occurrences of different words in spam/ham message.

from collections import defaultdict as dd
spam = dd(int)
ham = dd(int)
for i in range(len(sentences)):
    if df['Category'][i] == 'ham':
        p = sentences[i]
        for x in p:
            ham[x] += 1
    else:
        p = sentences[i]
        for x in p:
            spam[x] += 1

The output obtained from the code above for similar input to yours is as below.

>>> spam
defaultdict(<class 'int'>, {'ok': 1, 'lar': 1, 'joking': 1, 'wtf': 1, 'u': 1, 'oni': 1, 'free': 1, 'entry': 1, 'in': 1, '2': 1, 'a': 1, 'wkly': 1, 'comp': 1})
>>> ham
defaultdict(<class 'int'>, {'go': 1, 'until': 1, 'jurong': 1, 'crazy': 1, 'available': 1, 'only': 1, 'in': 1, 'u': 1, 'dun': 1, 'say': 1, 's': 1, 'oearly': 1, 'nah': 1, 'I': 1, 'don’t': 1, 'think': 1, 'he': 1, 'goes': 1, 'to': 1, 'usf': 1})

Now can manipulate the data and export it in the required format.

EDIT:

answer = []
for x in spam:
    answer.append([x,spam[x],ham[x]])

for x in ham:
    if x not in spam:
        answer.append([x,spam[x],ham[x]])

So here the numbers of rows in answer list in equal to the number of distinct words in all the messages. While the first column in every row is the word we are talking about and the second and third column is the number of occurrences of the word in spam and ham message respectively.

The output obtained for my code is as below.

['ok', 1, 0]
['lar', 1, 0]
['joking', 1, 0]
['wif', 1, 0]
['u', 1, 1]
['oni', 1, 0]
['free', 1, 0]
['entry', 1, 0]
['in', 1, 1]

edited Apr 26, 2020 at 14:26

answered Apr 26, 2020 at 1:49

user10002519

Sign up to request clarification or add additional context in comments.

3 Comments

user10002519 Over a year ago

Glad to be of help.

Apple Krumble Over a year ago

After getting the dictionaries, how do I go about manipulating it ?

user10002519 Over a year ago

Wait I'll edit my answer with some further code that I think can work.

ZeFeng Zhu · Accepted Answer · 2020-04-26 15:56:32Z

0

This would be better: https://docs.python.org/3.8/library/collections.html#collections.Counter

from collections import Counter
import pandas as pd

df # the data frame in your first image
df['Counter'] = df.Message.apply(lambda x: Counter(x.split()))

def func(df: pd.DataFrame):
    for category, data in df.groupby('Category'):
        count = Counter()
        for var in data.Counter:
            count += var
        cur = pd.DataFrame.from_dict(count, orient='index', columns=[category])
        yield cur

demo = func(df)
df2 = next(demo)
for cur in demo:
    df2 = df2.merge(cur, how='outer', left_index=True, right_index=True)

EDIT:

from collections import Counter
import pandas as pd

df # the data frame in your first image. Suit both cases(whether it is a slice of the complete data frame or not)
def func(df: pd.DataFrame):
    res = df.groupby('Category').Message.apply(' '.join).str.split().apply(Counter)
    for category, count in res.to_dict().items():
        yield pd.DataFrame.from_dict(count, orient='index', columns=[category])

demo = func(df)
df2 = next(demo)
for cur in demo:
    df2 = df2.merge(cur, how='outer', left_index=True, right_index=True)

edited Apr 26, 2020 at 15:56

answered Apr 26, 2020 at 4:26

ZeFeng Zhu

1341 silver badge5 bronze badges

2 Comments

Apple Krumble Over a year ago

Thank you for your answer but it seems to produce a SettingWithCopyWarning at df['Counter'] = df.Message.apply(lambda x: Counter(x.split()))

ZeFeng Zhu Over a year ago

@Apple Krumble Maybe your data frame in the first image is a slice of the whole data frame. You can either explicitly add df = df.copy() before df['Counter'] or pass the index like this the_complete_df.loc[index, 'Counter']

Collectives™ on Stack Overflow

Frequency count based on column values in Pandas

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related