0

For example I have a data frame which looks like this: First Image

And I would like to make a new data frame which shows the number of times a word was marked as spam or ham. I want it to look like this: Second image

I have tried the following code to make a list of only spam counts on a word to test but it does not seem to work and crashes the Kernel on Jupyter Notebook:

words = []
for word in df["Message"]:
    words.extend(word.split())

sentences = []
for word in df["Message"]:
    sentences.append(word.split())        

spam = []
ham = []

for word in words:
    sc = 0
    hc = 0
    for index,sentence in enumerate(sentences):
        if word in sentence:
            print(word)
            if(df["Category"][index])=="ham":
                hc+=1
            else:
                sc+=1
    spam.append(sc)
spam

Where df is the data frame shown in the First Image. How can I go about doing this?

3

2 Answers 2

1

You can form two dictionaries spam and ham to store the number of occurrences of different words in spam/ham message.

from collections import defaultdict as dd
spam = dd(int)
ham = dd(int)
for i in range(len(sentences)):
    if df['Category'][i] == 'ham':
        p = sentences[i]
        for x in p:
            ham[x] += 1
    else:
        p = sentences[i]
        for x in p:
            spam[x] += 1

The output obtained from the code above for similar input to yours is as below.

>>> spam
defaultdict(<class 'int'>, {'ok': 1, 'lar': 1, 'joking': 1, 'wtf': 1, 'u': 1, 'oni': 1, 'free': 1, 'entry': 1, 'in': 1, '2': 1, 'a': 1, 'wkly': 1, 'comp': 1})
>>> ham
defaultdict(<class 'int'>, {'go': 1, 'until': 1, 'jurong': 1, 'crazy': 1, 'available': 1, 'only': 1, 'in': 1, 'u': 1, 'dun': 1, 'say': 1, 's': 1, 'oearly': 1, 'nah': 1, 'I': 1, 'don’t': 1, 'think': 1, 'he': 1, 'goes': 1, 'to': 1, 'usf': 1})

Now can manipulate the data and export it in the required format.

EDIT:

answer = []
for x in spam:
    answer.append([x,spam[x],ham[x]])

for x in ham:
    if x not in spam:
        answer.append([x,spam[x],ham[x]])

So here the numbers of rows in answer list in equal to the number of distinct words in all the messages. While the first column in every row is the word we are talking about and the second and third column is the number of occurrences of the word in spam and ham message respectively.

The output obtained for my code is as below.

['ok', 1, 0]
['lar', 1, 0]
['joking', 1, 0]
['wif', 1, 0]
['u', 1, 1]
['oni', 1, 0]
['free', 1, 0]
['entry', 1, 0]
['in', 1, 1]
Sign up to request clarification or add additional context in comments.

3 Comments

Glad to be of help.
After getting the dictionaries, how do I go about manipulating it ?
Wait I'll edit my answer with some further code that I think can work.
0

This would be better: https://docs.python.org/3.8/library/collections.html#collections.Counter

from collections import Counter
import pandas as pd

df # the data frame in your first image
df['Counter'] = df.Message.apply(lambda x: Counter(x.split()))

def func(df: pd.DataFrame):
    for category, data in df.groupby('Category'):
        count = Counter()
        for var in data.Counter:
            count += var
        cur = pd.DataFrame.from_dict(count, orient='index', columns=[category])
        yield cur

demo = func(df)
df2 = next(demo)
for cur in demo:
    df2 = df2.merge(cur, how='outer', left_index=True, right_index=True)

EDIT:

from collections import Counter
import pandas as pd

df # the data frame in your first image. Suit both cases(whether it is a slice of the complete data frame or not)
def func(df: pd.DataFrame):
    res = df.groupby('Category').Message.apply(' '.join).str.split().apply(Counter)
    for category, count in res.to_dict().items():
        yield pd.DataFrame.from_dict(count, orient='index', columns=[category])

demo = func(df)
df2 = next(demo)
for cur in demo:
    df2 = df2.merge(cur, how='outer', left_index=True, right_index=True)

2 Comments

Thank you for your answer but it seems to produce a SettingWithCopyWarning at df['Counter'] = df.Message.apply(lambda x: Counter(x.split()))
@Apple Krumble Maybe your data frame in the first image is a slice of the whole data frame. You can either explicitly add df = df.copy() before df['Counter'] or pass the index like this the_complete_df.loc[index, 'Counter']

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.