2

I am trying to get the word frequencies for terms within each tweet contained in a dataframe. This is my code:

import pandas as pd
import numpy as np
import nltk
import string
import collections
from collections import Counter
nltk.download('stopwords')
sw= set(nltk.corpus.stopwords.words ('english'))
punctuation = set (string.punctuation)
data= pd.read_csv('~/Desktop/tweets.csv.zip', compression='zip')

print (data.columns)
print(data.text)
data['text'] = [str.lower () for str in data.text if str.lower () not in sw and str.lower () not in punctuation] 
print(data.text)
data["text"] = data["text"].str.split()
data['text'] = data['text'].apply(lambda x: [item for item in x if item not in sw])
print(data.text)
data['text'] = data.text.astype(str)
print(type(data.text))
tweets=data.text

data['words']= tweets.apply(nltk.FreqDist(tweets))
print(data.words)

And this is my error and traceback:

Name: text, Length: 14640, dtype: object Traceback (most recent call last):

File "", line 1, in runfile('C:/Users/leska/.spyder-py3/untitled1.py', wdir='C:/Users/leska/.spyder-py3')

File "C:\Users\leska\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile execfile(filename, namespace)

File "C:\Users\leska\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Users/leska/.spyder-py3/untitled1.py", line 30, in data['words']= tweets.apply(nltk.FreqDist(tweets))

File "C:\Users\leska\Anaconda3\lib\site-packages\pandas\core\series.py", line 4018, in apply return self.aggregate(func, *args, **kwds)

File "C:\Users\leska\Anaconda3\lib\site-packages\pandas\core\series.py", line 3883, in aggregate result, how = self._aggregate(func, *args, **kwargs)

File "C:\Users\leska\Anaconda3\lib\site-packages\pandas\core\base.py", line 506, in _aggregate result = _agg(arg, _agg_1dim)

File "C:\Users\leska\Anaconda3\lib\site-packages\pandas\core\base.py", line 456, in _agg result[fname] = func(fname, agg_how)

File "C:\Users\leska\Anaconda3\lib\site-packages\pandas\core\base.py", line 440, in _agg_1dim return colg.aggregate(how, _level=(_level or 0) + 1)

File "C:\Users\leska\Anaconda3\lib\site-packages\pandas\core\series.py", line 3902, in aggregate result = func(self, *args, **kwargs)

TypeError: 'int' object is not callable

I have verified that the type of data.text is a Pandas series.

I had tried a solution earlier that I thought worked that used tokenizing and creating a wordlist to get the word counts, but it resulted in a frequency distribution for all the tweeets rather than each one. This was the code I had tried based on my earlier question:

import pandas as pd
import numpy as np
import nltk
import string
import collections
from collections import Counter
nltk.download('stopwords')
sw= set(nltk.corpus.stopwords.words ('english'))
punctuation = set (string.punctuation)
data= pd.read_csv('~/Desktop/tweets.csv.zip', compression='zip')

print (data.columns)
print (len(data.tweet_id))
tweets = data.text
test = pd.DataFrame(data)
test.column = ["text"]
# Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.
test['tweet_without_stopwords'] = test['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (sw) and word for word in x.split() if word not in punctuation]))
print(test)
chirps = test.text
splitwords = [ nltk.word_tokenize( str(c) ) for c in chirps ]
allWords = []
for wordList in splitwords:
    allWords += wordList
allWords_clean = [w.lower () for w in allWords if w.lower () not in sw and w.lower () not in punctuation]   
tweets2 = pd.Series(allWords)

words = nltk.FreqDist(tweets2)

I really need to the term and counts for each tweet and I am stumped as to what I am doing wrong.

1 Answer 1

1

In the first code snippet, the way you applied the function to the column is the root of the issue.

# this line caused the problem
data['words']= tweets.apply(nltk.FreqDist(tweets))

Let's say you get this simple dataframe after cleaning up the tweets and want to apply nltk.FreqDist to compute word frequencies in each of the tweets. The function takes any callable.

import pandas as pd

df = pd.DataFrame(
    {
        "tweets": [
            "Hello world",
            "I am the abominable snowman",
            "I did not copy this text",
        ]
    }
)

The dataframe looks like this:

|    | tweets                      |
|---:|:----------------------------|
|  0 | Hello world                 |
|  1 | I am the abominable snowman |
|  2 | I did not copy this text    |

Now let's find out the word frequencies in each of the three sentences here.

import nltk

# define the fdist function
def find_fdist(sentence):
    tokens = nltk.tokenize.word_tokenize(sentence)
    fdist = FreqDist(tokens)

    return dict(fdist)

# apply the function on `tweets` column
df["words"] = df["tweets"].apply(find_fdist)

The resulting dataframe should look like this:

|    | tweets                      | words                                                         |
|---:|:----------------------------|:--------------------------------------------------------------|
|  0 | Hello world                 | {'Hello': 1, 'world': 1}                                      |
|  1 | I am the abominable snowman | {'I': 1, 'am': 1, 'the': 1, 'abominable': 1, 'snowman': 1}    |
|  2 | I did not copy this text    | {'I': 1, 'did': 1, 'not': 1, 'copy': 1, 'this': 1, 'text': 1} |
Sign up to request clarification or add additional context in comments.

3 Comments

Inspect what your data["text"] column looks like. If it has valid sentences in each row, this shouldn't happen. In that case, maybe you can share the dataset with me and I can attempt to reproduce the issue.
That worked! Thank you so much. I would like to clean the result better, but this is what I needed and I think I even understand it. Many thanks!
And the error I was getting was because I forgot the return statement.....supremely stupid mistake!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.