11

I'm trying to apply SVM from Scikit learn to classify the tweets I collected. So, there will be two categories, name them A and B. For now, I have all the tweets categorized in two text file, 'A.txt' and 'B.txt'. However, I'm not sure what type of data inputs the Scikit Learn SVM is asking for. I have a dictionary with labels (A and B) as its keys and a dictionary of features (unigrams) and their frequencies as values. Sorry, I'm really new to machine learning and not sure what I should do to get the SVM work. And I found that SVM is using numpy.ndarray as the type of its data input. Do I need to create one based on my own data? Should it be something like this?

Labels    features    frequency
  A        'book'        54
  B       'movies'       32

Any help is appreciated.

1 Answer 1

21

Have a look at the documentation on text feature extraction.

Also have a look at the text classification example.

There is also a tutorial here:

http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In particular don't focus too much on SVM models (in particular not sklearn.svm.SVC that is more interesting for kernel models hence not text classification): a simple Perceptron, LogisticRegression or Bernoulli naive Bayes models might work as good while being much faster to train.

Sign up to request clarification or add additional context in comments.

6 Comments

multinomial naive bayes / SVM both will work for you.
the link to the text classification example is 404
Thanks for the report I fixed the broken link.
@ogrisel: I am trying with 10 classes using naive bayes, but not satisfied with the result. svm is good fit if dataset is small, each class of around 100 sentences
For small number of samples (e.g. less than 10000 samples or so), SVC(kernel='linear') might be fast enough to converge. However it should give similar predictive performance as LinearSVC and comparable performance to LogisticRegression that should be both faster and can scale to hundreds of thousands of samples . For each case you need to pick the best value for C via cross-validation. Furthermore LogisticRegression provides good probability estimates by default (with the predict_proba method). This is why I advise you to use linear models over the generic SVC by default.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.