Count phrases frequency in Python dataframe

Question

The data I have is stored in a pandas dataframe - please see a reproducible example below. The real dataframe will have more than 10k lines and a lot more ords / phrases per line. I'd like to count the number of times each two-word phrase appears in column ReviewContent. If this were a text file and not the column of a dataframe I would use NLTK's Collocations module (something along the lines of answers here or here ). My question is: how can I transform column ReviewContent into a single corpus text?

import numpy as np
import pandas as pd

data = {'ReviewContent' : ['Great food',
'Low prices but above average food',
'Staff was the worst',
'Great location and great food',
'Really low prices',
'The daily menu is usually great',
'I waited a long time to be served, but it was worth it. Great food']}

df = pd.DataFrame(data)

Expected output:

[(('great', 'food'), 3), (('low', 'prices'), 2), ...]

or

[('great food', 3), ('low prices', 2)...]

alexis · Accepted Answer · 2017-05-16 15:12:52Z

4

As a sequence/iterable, df["ReviewContent"] is structured exactly the same as the result of applying nltk.sent_tokenize() to a file of text: A list of strings containing one sentence each. So just use it the same way.

counts = collections.Counter()
for sent in df["ReviewContent"]:
    words = nltk.word_tokenize(sent)
    counts.update(nltk.bigrams(words))

If you aren't sure what to do next, that's not connected to using a dataframe. For counting bigrams you don't need the collocations module, just nltk.bigrams() and a counting dictionary.

answered May 16, 2017 at 15:12

alexis

50.4k18 gold badges108 silver badges173 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Adrienne · Accepted Answer · 2017-05-16 12:26:35Z

2

I suggest using join:

corpus = ' '.join(df.ReviewContent)

Here's the result:

In [102]: corpus
Out[102]: 'Great food Low prices but above average food Staff was the worst Great location and great food Really low prices The daily menu is usually great I waited a long time to be served, but it was worth it. Great food'

answered May 16, 2017 at 12:26

Adrienne

3241 silver badge3 bronze badges

2 Comments

bogdanCsn Over a year ago

That would work but it would create "artificial" phrases - the last word of a review joined with the first word of the next review. I could probably workaround this somehow - if I don't receive a better answer I'll certainly choose this one.

Adrienne Over a year ago

Hopefully my answer gets at your question "how can I transform column ReviewContent into a single corpus text?" I agree about the downside of artificial phrases and wonder how others handle this. In the past, I've tried joining the text with an indicator symbol like ~ instead of a space, and then use finder = BigramCollocationFinder.from_words(corpus) followed by a filter to remove the artificial phrases: finder.apply_word_filter(lambda w: w == '~'), based on this example code: nltk.org/howto/collocations.html.

MaxU - stand with Ukraine · Accepted Answer · 2017-05-16 13:13:28Z

Using Pandas version 0.20.1+, you can create SparseDataFrame directly from sparse matrixes:

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(ngram_range=(2,2))

r = pd.SparseDataFrame(cv.fit_transform(df.ReviewContent), 
                       columns=cv.get_feature_names(),
                       index=df.index,
                       default_fill_value=0)

Result:

In [52]: r
Out[52]:
   above average  and great  average food  be served  but above  but it  daily menu  great food  great location  \
0              0          0             0          0          0       0           0           1               0
1              1          0             1          0          1       0           0           0               0
2              0          0             0          0          0       0           0           0               0
3              0          1             0          0          0       0           0           1               1
4              0          0             0          0          0       0           0           0               0
5              0          0             0          0          0       0           1           0               0
6              0          0             0          1          0       1           0           1               0

   is usually    ...     staff was  the daily  the worst  time to  to be  usually great  waited long  was the  was worth  \
0           0    ...             0          0          0        0      0              0            0        0          0
1           0    ...             0          0          0        0      0              0            0        0          0
2           0    ...             1          0          1        0      0              0            0        1          0
3           0    ...             0          0          0        0      0              0            0        0          0
4           0    ...             0          0          0        0      0              0            0        0          0
5           1    ...             0          1          0        0      0              1            0        0          0
6           0    ...             0          0          0        1      1              0            1        0          1

   worth it
0         0
1         0
2         0
3         0
4         0
5         0
6         1

[7 rows x 29 columns]

If you simply want to concatenate the strings from all rows into a single one, use Series.str.cat() method:

text = df.ReviewContent.str.cat(sep=' ')

Result:

In [57]: print(text)
Great food Low prices but above average food Staff was the worst Great location and great food Really low prices The daily me
nu is usually great I waited a long time to be served, but it was worth it. Great food

Collectives™ on Stack Overflow

Count phrases frequency in Python dataframe

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related