2

The data I have is stored in a pandas dataframe - please see a reproducible example below. The real dataframe will have more than 10k lines and a lot more ords / phrases per line. I'd like to count the number of times each two-word phrase appears in column ReviewContent. If this were a text file and not the column of a dataframe I would use NLTK's Collocations module (something along the lines of answers here or here ). My question is: how can I transform column ReviewContent into a single corpus text?

import numpy as np
import pandas as pd

data = {'ReviewContent' : ['Great food',
'Low prices but above average food',
'Staff was the worst',
'Great location and great food',
'Really low prices',
'The daily menu is usually great',
'I waited a long time to be served, but it was worth it. Great food']}

df = pd.DataFrame(data)

Expected output:

[(('great', 'food'), 3), (('low', 'prices'), 2), ...]

or

[('great food', 3), ('low prices', 2)...]

3 Answers 3

4

As a sequence/iterable, df["ReviewContent"] is structured exactly the same as the result of applying nltk.sent_tokenize() to a file of text: A list of strings containing one sentence each. So just use it the same way.

counts = collections.Counter()
for sent in df["ReviewContent"]:
    words = nltk.word_tokenize(sent)
    counts.update(nltk.bigrams(words))

If you aren't sure what to do next, that's not connected to using a dataframe. For counting bigrams you don't need the collocations module, just nltk.bigrams() and a counting dictionary.

Sign up to request clarification or add additional context in comments.

Comments

2

I suggest using join:

corpus = ' '.join(df.ReviewContent)

Here's the result:

In [102]: corpus
Out[102]: 'Great food Low prices but above average food Staff was the worst Great location and great food Really low prices The daily menu is usually great I waited a long time to be served, but it was worth it. Great food'

2 Comments

That would work but it would create "artificial" phrases - the last word of a review joined with the first word of the next review. I could probably workaround this somehow - if I don't receive a better answer I'll certainly choose this one.
Hopefully my answer gets at your question "how can I transform column ReviewContent into a single corpus text?" I agree about the downside of artificial phrases and wonder how others handle this. In the past, I've tried joining the text with an indicator symbol like ~ instead of a space, and then use finder = BigramCollocationFinder.from_words(corpus) followed by a filter to remove the artificial phrases: finder.apply_word_filter(lambda w: w == '~'), based on this example code: nltk.org/howto/collocations.html.
1

Using Pandas version 0.20.1+, you can create SparseDataFrame directly from sparse matrixes:

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(ngram_range=(2,2))

r = pd.SparseDataFrame(cv.fit_transform(df.ReviewContent), 
                       columns=cv.get_feature_names(),
                       index=df.index,
                       default_fill_value=0)

Result:

In [52]: r
Out[52]:
   above average  and great  average food  be served  but above  but it  daily menu  great food  great location  \
0              0          0             0          0          0       0           0           1               0
1              1          0             1          0          1       0           0           0               0
2              0          0             0          0          0       0           0           0               0
3              0          1             0          0          0       0           0           1               1
4              0          0             0          0          0       0           0           0               0
5              0          0             0          0          0       0           1           0               0
6              0          0             0          1          0       1           0           1               0

   is usually    ...     staff was  the daily  the worst  time to  to be  usually great  waited long  was the  was worth  \
0           0    ...             0          0          0        0      0              0            0        0          0
1           0    ...             0          0          0        0      0              0            0        0          0
2           0    ...             1          0          1        0      0              0            0        1          0
3           0    ...             0          0          0        0      0              0            0        0          0
4           0    ...             0          0          0        0      0              0            0        0          0
5           1    ...             0          1          0        0      0              1            0        0          0
6           0    ...             0          0          0        1      1              0            1        0          1

   worth it
0         0
1         0
2         0
3         0
4         0
5         0
6         1

[7 rows x 29 columns]

If you simply want to concatenate the strings from all rows into a single one, use Series.str.cat() method:

text = df.ReviewContent.str.cat(sep=' ')

Result:

In [57]: print(text)
Great food Low prices but above average food Staff was the worst Great location and great food Really low prices The daily me
nu is usually great I waited a long time to be served, but it was worth it. Great food

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.