3

I'm trying to use this lda package to process a term-document matrix csv file with 39568 rows and 27519 columns containing counting/natural numbers only.

Problem: I'm getting a MemoryError with my approach to read the file and store it to a numpy array.

Goal: Get the numbers from the TDM csv file and convert it to numpy array so I can use the numpy array as input for the lda.

with open("Results/TDM - Matrix Only.csv", 'r') as matrix_file:
    matrix = np.array([[int(value) for value in line.strip().split(',')] for line in matrix_file])

I've also tried using the numpy append, vstack and concatenate and I still get the MemoryError.

Is there a way to avoid the MemoryError?

Edit:

I've tried using dtype int32 and int and it gives me:

WindowsError: [Error 8] Not enough storage is available to process this command

I've also tried using dtype float64 and it gives me:

OverflowError: cannot fit 'long' into an index-sized integer

With these codes:

fp = np.memmap("Results/TDM-memmap.txt", dtype='float64', mode='w+', shape=(len(documents), len(vocabulary)))
matrix = np.genfromtxt("Results/TDM.csv", dtype='float64', delimiter=',', skip_header=1)
fp[:] = matrix[:]

and

with open("Results/TDM.csv", 'r') as tdm_file:
    vocabulary = [value for value in tdm_file.readline().strip().split(',')]
    fp = np.memmap("Results/TDM-memmap.txt", dtype='float64', mode='w+', shape=(len(documents), len(vocabulary)))
    for idx, line in enumerate(tdm_file):
        fp[idx] = np.array(line.strip().split(','))

Other info that might matter

  • Win10 64bit
  • 8GB RAM (7.9 usable) | peaks at 5.5GB from more or less 3GB (around 2GB used) before it reports MemoryError
  • Python 2.7.10 [MSC v.1500 32 bit (Intel)]
  • Using PyCharm Community Edition 5.0.3
12
  • Have you tried numpy.loadtxt? Commented Jan 3, 2016 at 20:08
  • 1
    Separate the list comprehension (that makes a nested list of lists) from the array call. Which one produces the memory error? loadtxt, genfromtxt do essentially what you are doing - collecting values in a list and making the array at the end. Commented Jan 3, 2016 at 20:08
  • 2
    Depending on how many zeros are in your dataset, it may be useful to use a sparse matrix format to avoid memory errors. Commented Jan 3, 2016 at 20:21
  • @karlson Yes, just now and I get the error from ...\numpy\lib\npyio.py, line 916, in loadtxt which says for i, line in enumerate(itertools.chain([first_line], fh)): followed by the MemoryError Commented Jan 3, 2016 at 20:44
  • 1
    What dtype(s) will the final array contain? If you can't hold the entire .csv file in memory you can read sequential chunks of rows (e.g. here), then write them to a (possibly memory-mapped) numpy array or an HDF5 file. Commented Jan 3, 2016 at 22:42

1 Answer 1

1

Since your word counts will be almost all zeros, it would be much more efficient to store them in a scipy.sparse matrix. For example:

from scipy import sparse
import textmining
import lda

# a small example matrix
tdm = textmining.TermDocumentMatrix()
tdm.add_doc("here's a bunch of words in a sentence")
tdm.add_doc("here's some more words")
tdm.add_doc("and another sentence")
tdm.add_doc("have some more words")

# tdm.sparse is a list of dicts, where each dict contains {word:count} for a single
# document
ndocs = len(tdm.sparse)
nwords = len(tdm.doc_count)
words = tdm.doc_count.keys()

# initialize output sparse matrix
X = sparse.lil_matrix((ndocs, nwords),dtype=int)

# iterate over documents, fill in rows of X
for ii, doc in enumerate(tdm.sparse):
    for word, count in doc.iteritems():
        jj = words.index(word)
        X[ii, jj] = count

X is now an (ndocs, nwords) scipy.sparse.lil_matrix, and words is a list corresponding to the columns of X:

print(words)
# ['a', 'and', 'another', 'sentence', 'have', 'of', 'some', 'here', 's', 'words', 'in', 'more', 'bunch']

print(X.todense())
# [[2 0 0 1 0 1 0 1 1 1 1 0 1]
#  [0 0 0 0 0 0 1 1 1 1 0 1 0]
#  [0 1 1 1 0 0 0 0 0 0 0 0 0]
#  [0 0 0 0 1 0 1 0 0 1 0 1 0]]

You could pass X directly to lda.LDA.fit, although it will probably be faster to convert it to a scipy.sparse.csr_matrix first:

X = X.tocsr()
model = lda.LDA(n_topics=2, random_state=0, n_iter=100)
model.fit(X)
# INFO:lda:n_documents: 4
# INFO:lda:vocab_size: 13
# INFO:lda:n_words: 21
# INFO:lda:n_topics: 2
# INFO:lda:n_iter: 100
# INFO:lda:<0> log likelihood: -126
# INFO:lda:<10> log likelihood: -102
# INFO:lda:<20> log likelihood: -99
# INFO:lda:<30> log likelihood: -97
# INFO:lda:<40> log likelihood: -100
# INFO:lda:<50> log likelihood: -100
# INFO:lda:<60> log likelihood: -104
# INFO:lda:<70> log likelihood: -108
# INFO:lda:<80> log likelihood: -98
# INFO:lda:<90> log likelihood: -98
# INFO:lda:<99> log likelihood: -99
Sign up to request clarification or add additional context in comments.

1 Comment

Took me a while to install SciPy and use it on PyCharm. Ended up using the Scipy from Unofficial Windows Binaries for Python Extension Packages. Tried the code above with my data, it's working and more faster! Thank you for the quick guide on converting the tdm to a scipy sparse matrix and for your time!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.