Prepare data for text classification using Scikit Learn SVM

Question

I'm trying to apply SVM from Scikit learn to classify the tweets I collected. So, there will be two categories, name them A and B. For now, I have all the tweets categorized in two text file, 'A.txt' and 'B.txt'. However, I'm not sure what type of data inputs the Scikit Learn SVM is asking for. I have a dictionary with labels (A and B) as its keys and a dictionary of features (unigrams) and their frequencies as values. Sorry, I'm really new to machine learning and not sure what I should do to get the SVM work. And I found that SVM is using numpy.ndarray as the type of its data input. Do I need to create one based on my own data? Should it be something like this?

Labels    features    frequency
  A        'book'        54
  B       'movies'       32

Any help is appreciated.

ogrisel · Accepted Answer · 2015-04-29 11:59:14Z

21

Have a look at the documentation on text feature extraction.

Also have a look at the text classification example.

There is also a tutorial here:

http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In particular don't focus too much on SVM models (in particular not sklearn.svm.SVC that is more interesting for kernel models hence not text classification): a simple Perceptron, LogisticRegression or Bernoulli naive Bayes models might work as good while being much faster to train.

edited Apr 29, 2015 at 11:59

answered Dec 18, 2012 at 22:59

ogrisel

40.3k14 gold badges120 silver badges125 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Divyang Shah Over a year ago

multinomial naive bayes / SVM both will work for you.

Alex Plugaru Over a year ago

the link to the text classification example is 404

ogrisel Over a year ago

Thanks for the report I fixed the broken link.

user123 Over a year ago

@ogrisel: I am trying with 10 classes using naive bayes, but not satisfied with the result. svm is good fit if dataset is small, each class of around 100 sentences

ogrisel Over a year ago

For small number of samples (e.g. less than 10000 samples or so), SVC(kernel='linear') might be fast enough to converge. However it should give similar predictive performance as LinearSVC and comparable performance to LogisticRegression that should be both faster and can scale to hundreds of thousands of samples . For each case you need to pick the best value for C via cross-validation. Furthermore LogisticRegression provides good probability estimates by default (with the predict_proba method). This is why I advise you to use linear models over the generic SVC by default.

|

Collectives™ on Stack Overflow

Prepare data for text classification using Scikit Learn SVM

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related