10

I'm looking for an effective way to construct a Term Document Matrix in Python that can be used together with extra data.

I have some text data with a few other attributes. I would like to run some analyses on the text and I would like to be able to correlate features extracted from text (such as individual word tokens or LDA topics) with the other attributes.

My plan was load the data as a pandas data frame and then each response will represent a document. Unfortunately, I ran into an issue:

import pandas as pd
import nltk

pd.options.display.max_colwidth = 10000

txt_data = pd.read_csv("data_file.csv",sep="|")
txt = str(txt_data.comment)
len(txt)
Out[7]: 71581 

txt = nltk.word_tokenize(txt)
txt = nltk.Text(txt)
txt.count("the")
Out[10]: 45

txt_lines = []
f = open("txt_lines_only.txt")
for line in f:
    txt_lines.append(line)

txt = str(txt_lines)
len(txt)
Out[14]: 1668813

txt = nltk.word_tokenize(txt)
txt = nltk.Text(txt)
txt.count("the")
Out[17]: 10086

Note that in both cases, text was processed in such a way that only the anything but spaces, letters and ,.?! was removed (for simplicity).

As you can see a pandas field converted into a string returns fewer matches and the length of the string is also shorter.

Is there any way to improve the above code?

Also, str(x) creates 1 big string out of the comments while [str(x) for x in txt_data.comment] creates a list object which cannot be broken into a bag of words. What is the best way to produce a nltk.Text object that will retain document indices? In other words I'm looking for a way to create a Term Document Matrix, R's equivalent of TermDocumentMatrix() from tm package.

Many thanks.

2
  • 1
    not sure what your question is, but there are other libraries for NLP that might be of help for you, libraries like pattern, textblob, C&C, if you reached a dead end you can try those libraries too, each of them has their own advantage over the others. Commented Jan 14, 2016 at 8:01
  • Thanks @mid , I'm aware of gensim, but I've never heard of textblob previously, it does indeed look useful though! I'm quite new to Python (I usually work in R) and I really doubt that I've reached a dead end with NLTK, considering how popular the package is, I'm certain that I'm just missing something. Commented Jan 16, 2016 at 2:57

1 Answer 1

12

The benefit of using a pandas DataFrame would be to apply the nltk functionality to each row like so:

word_file = "/usr/share/dict/words"
words = open(word_file).read().splitlines()[10:50]
random_word_list = [[' '.join(np.random.choice(words, size=1000, replace=True))] for i in range(50)]

df = pd.DataFrame(random_word_list, columns=['text'])
df.head()

                                                text
0  Aaru Aaronic abandonable abandonedly abaction ...
1  abampere abampere abacus aback abalone abactor...
2  abaisance abalienate abandonedly abaff abacina...
3  Ababdeh abalone abac abaiser abandonable abact...
4  abandonable abandon aba abaiser abaft Abama ab...

len(df)

50

txt = df.text.apply(word_tokenize)
txt.head()

0    [Aaru, Aaronic, abandonable, abandonedly, abac...
1    [abampere, abampere, abacus, aback, abalone, a...
2    [abaisance, abalienate, abandonedly, abaff, ab...
3    [Ababdeh, abalone, abac, abaiser, abandonable,...
4    [abandonable, abandon, aba, abaiser, abaft, Ab...

txt.apply(len)

0     1000
1     1000
2     1000
3     1000
4     1000
....
44    1000
45    1000
46    1000
47    1000
48    1000
49    1000
Name: text, dtype: int64

As a result, you get the .count() for each row entry:

txt = txt.apply(lambda x: nltk.Text(x).count('abac'))
txt.head()

0    27
1    24
2    17
3    25
4    32

You can then sum the result using:

txt.sum()

1239
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks @Stefan, that just about resolves my problem however txt object is still a pandas data frame object which means that I can only use some of NLTK functions using apply, map or for loops. However, if I want to do something like nltk.Text(txt).concordance("the") I will run into problems. To resolve this I will still need to convert the entire text variable into a string and as we saw in my first example, that string will be truncated for some reason. Any thoughts on how to overcome this? Many thanks!
You can convert the entire text column into one list of words using: [t for t in df.text.tolist()] - either after creation or after .tokenize().

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.