-1

I'm trying to plug a bunch of data (sentiment-tagged tweets) into an SVM using scikit-learn. I've been using CountVectorizer to build a sparse array of word counts, and it's all working fine with smallish data sets (~5000 tweets). However, when I try to use a larger corpus (ideally 150,000 tweets, but I'm currently exploring with 15,000), .toarray(), which converts a sparse format to a denser format, immediately starts taking up immense amounts of memory (30k tweets hit over 50gb before the MemoryError.

So my question is -- is there a way to feed LinearSVC() or a different manifestation of SVM a sparse matrix? Am I necessarily required to use a dense matrix? It doesn't seem like a different vectorizer would help fix this problem (as this problem seems to be solved by: MemoryError in toarray when using DictVectorizer of Scikit Learn). Is a different model the solution? It seems like all of the scikit-learn models require a dense array representation at some point, unless I've been looking in the wrong places.

cv = CountVectorizer(analyzer=str.split)
clf = svm.LinearSVC()

X = cv.fit_transform(data)
trainArray = X[:breakpt].toarray()
testArray = X[breakpt:].toarray()

clf.fit(trainArray, label)
guesses = clf.predict(testArray)

1 Answer 1

2

LinearSVC.fit and its predict method can both handle a sparse matrix as the first argument, so just removing the toarray calls from your code should work.

All estimators that take sparse inputs are documented as doing so. E.g., the docstring for LinearSVC states:

Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
    Training vector, where n_samples in the number of samples and
    n_features is the number of features.
Sign up to request clarification or add additional context in comments.

1 Comment

wow - I can't believe I missed that, thanks for pointing it out

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.