Is there a more efficient way to append lines from a large file to a numpy array? - MemoryError

Question

I'm trying to use this lda package to process a term-document matrix csv file with 39568 rows and 27519 columns containing counting/natural numbers only.

Problem: I'm getting a MemoryError with my approach to read the file and store it to a numpy array.

Goal: Get the numbers from the TDM csv file and convert it to numpy array so I can use the numpy array as input for the lda.

with open("Results/TDM - Matrix Only.csv", 'r') as matrix_file:
    matrix = np.array([[int(value) for value in line.strip().split(',')] for line in matrix_file])

I've also tried using the numpy append, vstack and concatenate and I still get the MemoryError.

Is there a way to avoid the MemoryError?

Edit:

I've tried using dtype int32 and int and it gives me:

WindowsError: [Error 8] Not enough storage is available to process this command

I've also tried using dtype float64 and it gives me:

OverflowError: cannot fit 'long' into an index-sized integer

With these codes:

fp = np.memmap("Results/TDM-memmap.txt", dtype='float64', mode='w+', shape=(len(documents), len(vocabulary)))
matrix = np.genfromtxt("Results/TDM.csv", dtype='float64', delimiter=',', skip_header=1)
fp[:] = matrix[:]

and

with open("Results/TDM.csv", 'r') as tdm_file:
    vocabulary = [value for value in tdm_file.readline().strip().split(',')]
    fp = np.memmap("Results/TDM-memmap.txt", dtype='float64', mode='w+', shape=(len(documents), len(vocabulary)))
    for idx, line in enumerate(tdm_file):
        fp[idx] = np.array(line.strip().split(','))

Other info that might matter

Win10 64bit
8GB RAM (7.9 usable) | peaks at 5.5GB from more or less 3GB (around 2GB used) before it reports MemoryError
Python 2.7.10 [MSC v.1500 32 bit (Intel)]
Using PyCharm Community Edition 5.0.3

Separate the list comprehension (that makes a nested list of lists) from the array call. Which one produces the memory error? loadtxt, genfromtxt do essentially what you are doing - collecting values in a list and making the array at the end. — hpaulj
– hpaulj, Commented Jan 3, 2016 at 20:08
Depending on how many zeros are in your dataset, it may be useful to use a sparse matrix format to avoid memory errors. — Ryan
– Ryan, Commented Jan 3, 2016 at 20:21
@karlson Yes, just now and I get the error from ...\numpy\lib\npyio.py, line 916, in loadtxt which says for i, line in enumerate(itertools.chain([first_line], fh)): followed by the MemoryError — ZeferiniX
– ZeferiniX, Commented Jan 3, 2016 at 20:44
What dtype(s) will the final array contain? If you can't hold the entire .csv file in memory you can read sequential chunks of rows (e.g. here), then write them to a (possibly memory-mapped) numpy array or an HDF5 file. — ali_m
– ali_m, Commented Jan 3, 2016 at 22:42

ali_m · Accepted Answer · 2016-01-04 14:58:51Z

Since your word counts will be almost all zeros, it would be much more efficient to store them in a scipy.sparse matrix. For example:

from scipy import sparse
import textmining
import lda

# a small example matrix
tdm = textmining.TermDocumentMatrix()
tdm.add_doc("here's a bunch of words in a sentence")
tdm.add_doc("here's some more words")
tdm.add_doc("and another sentence")
tdm.add_doc("have some more words")

# tdm.sparse is a list of dicts, where each dict contains {word:count} for a single
# document
ndocs = len(tdm.sparse)
nwords = len(tdm.doc_count)
words = tdm.doc_count.keys()

# initialize output sparse matrix
X = sparse.lil_matrix((ndocs, nwords),dtype=int)

# iterate over documents, fill in rows of X
for ii, doc in enumerate(tdm.sparse):
    for word, count in doc.iteritems():
        jj = words.index(word)
        X[ii, jj] = count

X is now an (ndocs, nwords) scipy.sparse.lil_matrix, and words is a list corresponding to the columns of X:

print(words)
# ['a', 'and', 'another', 'sentence', 'have', 'of', 'some', 'here', 's', 'words', 'in', 'more', 'bunch']

print(X.todense())
# [[2 0 0 1 0 1 0 1 1 1 1 0 1]
#  [0 0 0 0 0 0 1 1 1 1 0 1 0]
#  [0 1 1 1 0 0 0 0 0 0 0 0 0]
#  [0 0 0 0 1 0 1 0 0 1 0 1 0]]

You could pass X directly to lda.LDA.fit, although it will probably be faster to convert it to a scipy.sparse.csr_matrix first:

X = X.tocsr()
model = lda.LDA(n_topics=2, random_state=0, n_iter=100)
model.fit(X)
# INFO:lda:n_documents: 4
# INFO:lda:vocab_size: 13
# INFO:lda:n_words: 21
# INFO:lda:n_topics: 2
# INFO:lda:n_iter: 100
# INFO:lda:<0> log likelihood: -126
# INFO:lda:<10> log likelihood: -102
# INFO:lda:<20> log likelihood: -99
# INFO:lda:<30> log likelihood: -97
# INFO:lda:<40> log likelihood: -100
# INFO:lda:<50> log likelihood: -100
# INFO:lda:<60> log likelihood: -104
# INFO:lda:<70> log likelihood: -108
# INFO:lda:<80> log likelihood: -98
# INFO:lda:<90> log likelihood: -98
# INFO:lda:<99> log likelihood: -99

Took me a while to install SciPy and use it on PyCharm. Ended up using the Scipy from Unofficial Windows Binaries for Python Extension Packages. Tried the code above with my data, it's working and more faster! Thank you for the quick guide on converting the tdm to a scipy sparse matrix and for your time!

Collectives™ on Stack Overflow

Is there a more efficient way to append lines from a large file to a numpy array? - MemoryError

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related