5

I have a text classification task with 2599 documents and five labels from 1 to 5. The documents are

label | texts
----------
5     |1190
4     |839
3     |239
1     |204
2     |127

All ready classified this textual data with very low performance, and also get warnings about ill defined metrics:

Accuracy: 0.461057692308

score: 0.461057692308

precision: 0.212574195636

recall: 0.461057692308

  'precision', 'predicted', average, warn_for)
 confussion matrix:
[[  0   0   0   0 153]
  'precision', 'predicted', average, warn_for)
 [  0   0   0   0  94]
 [  0   0   0   0 194]
 [  0   0   0   0 680]
 [  0   0   0   0 959]]

 clasification report:
             precision    recall  f1-score   support

          1       0.00      0.00      0.00       153
          2       0.00      0.00      0.00        94
          3       0.00      0.00      0.00       194
          4       0.00      0.00      0.00       680
          5       0.46      1.00      0.63       959

avg / total       0.21      0.46      0.29      2080

Clearly this is happening by the fact that I have an unbalanced dataset, so I found this paper where the authors propose several aproaches to deal with this issue:

The problem is that with imbalanced datasets, the learned boundary is too close to the positive instances. We need to bias SVM in a way that will push the boundary away from the positive instances. Veropoulos et al [14] suggest using different error costs for the positive (C +) and negative (C - ) classes

I know that this could be very complicated but SVC offers several hyper parameters, So my question is: Is there any way to bias SVC in a way that push the boundary away from possitive instances with the hyper parameters that offer SVC classifier?. I know that this could be a difficult problem but any help is welcome, thanks in advance guys.

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
tfidf_vect= TfidfVectorizer(use_idf=True, smooth_idf=True, sublinear_tf=False, ngram_range=(2,2))
from sklearn.cross_validation import train_test_split, cross_val_score

import pandas as pd
df = pd.read_csv('/path/of/the/file.csv',
                     header=0, sep=',', names=['id', 'text', 'label'])



reduced_data = tfidf_vect.fit_transform(df['text'].values)
y = df['label'].values



from sklearn.decomposition.truncated_svd import TruncatedSVD
svd = TruncatedSVD(n_components=5)
reduced_data = svd.fit_transform(reduced_data)

from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(reduced_data,
                                                    y, test_size=0.33)

#with no weights:

from sklearn.svm import SVC
clf = SVC(kernel='linear', class_weight={1: 10})
clf.fit(reduced_data, y)
prediction = clf.predict(X_test)

w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - clf.intercept_[0] / w[1]


# get the separating hyperplane using weighted classes
wclf = SVC(kernel='linear', class_weight={1: 10})
wclf.fit(reduced_data, y)

ww = wclf.coef_[0]
wa = -ww[0] / ww[1]
wyy = wa * xx - wclf.intercept_[0] / ww[1]

# plot separating hyperplanes and samples
import matplotlib.pyplot as plt
h0 = plt.plot(xx, yy, 'k-', label='no weights')
h1 = plt.plot(xx, wyy, 'k--', label='with weights')
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=y, cmap=plt.cm.Paired)
plt.legend()

plt.axis('tight')
plt.show()

But I get nothing and I cant understand what happened, this is the plot:

weighted vs normal

then:

#Let's show some metrics[unweighted]:
from sklearn.metrics.metrics import precision_score, \
    recall_score, confusion_matrix, classification_report, accuracy_score
print '\nAccuracy:', accuracy_score(y_test, prediction)
print '\nscore:', clf.score(X_train, y_train)
print '\nrecall:', recall_score(y_test, prediction)
print '\nprecision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test, prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)

#Let's show some metrics[weighted]:
print 'weighted:\n'

from sklearn.metrics.metrics import precision_score, \
    recall_score, confusion_matrix, classification_report, accuracy_score
print '\nAccuracy:', accuracy_score(y_test, prediction)
print '\nscore:', wclf.score(X_train, y_train)
print '\nrecall:', recall_score(y_test, prediction)
print '\nprecision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test, prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)

This is the data that Im using. How can I fix this and plot in a right way this problem?. thanks in advance guys!.

From an answer in this question I remove this lines:

#
# from sklearn.decomposition.truncated_svd import TruncatedSVD
# svd = TruncatedSVD(n_components=5)
# reduced_data = svd.fit_transform(reduced_data)


#
# w = clf.coef_[0]
# a = -w[0] / w[1]
# xx = np.linspace(-10, 10)
# yy = a * xx - clf.intercept_[0] / w[1]

# ww = wclf.coef_[0]
# wa = -ww[0] / ww[1]
# wyy = wa * xx - wclf.intercept_[0] / ww[1]
#
# # plot separating hyperplanes and samples
# import matplotlib.pyplot as plt
# h0 = plt.plot(xx, yy, 'k-', label='no weights')
# h1 = plt.plot(xx, wyy, 'k--', label='with weights')
# plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=y, cmap=plt.cm.Paired)
# plt.legend()
#
# plt.axis('tight')
# plt.show()

This where the results:

Accuracy: 0.787878787879

score: 0.779437105112

recall: 0.787878787879

precision: 0.827705441238

This metrics improved. How can I plot this results in order to have a nice example like the documentation one. I would like to see the behavior of the two hyper planes?. Thanks guys!

3
  • 1
    Clearly this is happening by the fact that I have an unbalanced dataset - I don't find that clear at all based on what you've said. Can you please show us your code and maybe even data? Commented Feb 12, 2015 at 11:05
  • 1
    What do you get without the SVD and without touching the class_weight parameter? Try to focus on the performance first and then on plotting. Commented Feb 15, 2015 at 9:17
  • @Ivlad without using the example from the documentation for unbalanced datasets this is the performance I got: Accuracy: 0.461057692308 score: 0.461057692308 precision: 0.212574195636 recall: 0.461057692308 this was the best I could do with grid search. Commented Feb 15, 2015 at 18:10

5 Answers 5

4

By reducing your data to 5 features with the SVD:

svd = TruncatedSVD(n_components=5)
reduced_data = svd.fit_transform(reduced_data)

You lose a lot of information. Just by removing those lines I get 78% accuracy.

Leaving the class_weight parameter as you set it seems to do better than removing it. I haven't tried giving it other values.

Look into using k-fold cross validation and grid search to tune the parameters of your model. You can also use a pipeline if you want to reduce the dimensionality of your data, in order to figure out how much you want to reduce it without affecting performance. Here is an example that shows how to tune your entire pipeline using grid search.

As for plotting, you can only plot 2d or 3d data. After you train using more dimensions, you can reduce your data to 2 or 3 dimensions and plot that. See here for a plotting example. The code looks similar to what you're plotting and I got similar results to yours. The problem is that your data has many features and you can only plot things to a 2d or 3d surface. That will usually make it look weird and hard to tell what is going on.

I suggest you don't bother with plotting as it's not going to tell you much for data in high dimensions. Use k-fold cross validation with a grid search in order to get the best parameters and if you want to look into overfitting closer, plot learning curves instead.

All this combined will tell you a lot more about the behavior of your model than plotting the hyperplane.

Sign up to request clarification or add additional context in comments.

11 Comments

Thanks for the help. I remove those lines and I get this exception:Traceback (most recent call last): File "/Users/user/test.py", line 35, in <module> a = -w[0] / w[1] File "/usr/local/lib/python2.7/site-packages/scipy/sparse/csr.py", line 253, in __getitem__ return self._get_row_slice(row, col) File "/usr/local/lib/python2.7/site-packages/scipy/sparse/csr.py", line 320, in _get_row_slice raise IndexError('index (%d) out of range' % i) IndexError: index (1) out of range . Any idea of how to fix this?.
@ml_guy - I also commented w = clf.coef_[0] a = -w[0] / w[1] xx = np.linspace(-5, 5) yy = a * xx - clf.intercept_[0] / w[1] in order to fix that. I also commented the plotting code.
@ml_guy - you can choose to plot it in 2 or 3 dimensions, you cannot plot it in higher dimensions. For this, you need to reduce your data using PCA or another dimensionality reduction algorithm. A cross validation method will not prevent overfitting, but it will help you identify it: if you get good results using cross validation, you can assume your algorithm is good. Otherwise you can assume there's a problem. I am going to add plotting code to my answer in a short while.
@ml_guy - you're not doing something wrong, if you reduce your dimensions too much, you'll get worse performance. That's to be expected. You shouldn't reduce them unless you have a good reason, such as to improve execution time or accuracy, the latter not being the case here. I've added more information to my post regarding how you can diagnose issues with your model.
@ml_guy - I don't know how else to do it other than they show on the scikit-learn site that I linked to. If that doesn't give good results, then your data just cannot be plotted satisfactorily with reduced dimensions. I guess you can also try Principal Components Analysis (PCA) from scikit, see if that helps.
|
2

If I understood your input correctly you have:

1190 of 5 labeled texts 1409 of 1-4 labeled texts

You may try to do a sequental classification. First threat all 5 labels as 1 and all other as 0. Train a classifier for this task

Second, drop out all 5 examples from your dataset. Train classifier to classify 1-4 labels.

Upon classification run first classifier, if it returns 0 - run second classifier to obtain final label.

Though I don't think that this distribution is really skewed and unballanced (it should be smth like 90% of 5, 10% - all rest, to be really skewed, so that it might be interesting to introduce bias to SVC). Thus I think you might want to try some other classification algorithm since looks like your choice is not suitable for this task. Or maybe you need to use different kernel with your SVC (I assume you use linear kernel, try something different - RBF or polynomial maybe).

5 Comments

I am using rbf and i tried with multinomial, RF, LR, and with svc it has the best performance.
thanks for the feedback. Any Idea of how to bias SVC classifier?.
I beleive klubow provided some useful info below. In scikit-learn you could read about proper way of working with unballanced classed here: scikit-learn.org/stable/auto_examples/svm/… Keep in mind that you will need to manually tune the class weights for optimal performance, or try to use GridSearchCV to tune them automatically (though you will need to use some specific scoring metrics, otherwise you won't get a good result).
Thanks for the reference. I dont get what mean: class_weight={1: 10} (i.e. {1: 10}). For my case it would be:{1:5}?
Why, it is your class labels followed by weight you give them. In your case it will be 5: 10 for example (if your '5' class has this label), if you want to give C parameter a bigger value for 5th class.
2

As a simple solution, just multiply instances in the smaller classes and balance the number of instances. This works even it seems stupid and it does not require in the rig configuration.

The idea of using this approach is to mimic the behaviour of scaled learning rate for each class regarding its class size. That is, in a gradient based optimization methods, you should scale the learning rate inversely proportional to the class sizes for each class so that you can prevent the model to overlearn some classes against the others.

If your problem is pretty big and you are using batch updates, then instead of booking all the dataset and counting classes, consider only the mini-batch and tune learning rates dynamically regarding number of instances for each class in the mini-batch.

That means if your master-learning rate is 0.01 and in a batch of instances you have 0.4 of them class A and 0.6 of them class B then for each class you need to tune the final learning rate as master_learning rate for class A (that means keep it same), 2/3*master_learning rate for class B. Hence you step wider for class A and reversely for class B.

My choice to go, especially for large problems and augmenting the data for smaller classes by replicating instances or as a more robust choice, adding some noises and variances to replicated instances. In such way, (depending on your problem) you can also train a model which is more robust to small changes (this is very common for especially image classification problems.).

12 Comments

Do you have any references for the statement that this works? I don't see it working with SVMs.
You just balance the number of model updates for each class. Theoretically, it is more meaningful to set learning rate inversely proportional to number of instances per class. By setting the number of instances you just mimic the same behavior in a naive way.
@ml_guy Because SVMs find a large-margin separating hyperplane based on the points you have. Having more of the same points will not change how the hyperplane found looks like.
Before downvoting just try this with a toy problem. Then we can discuss again.
1. A toy problem doesn't mean anything; 2. You did not give any arguments or references for your suggestion; 3. It seems like you're suggesting something that scikit-learn allows you to do already in a batter way; 4. Really, what learning rate for SVC? There's no such thing that you can tweak. So really, this is a bad answer regardless of whether or not this works.
|
2
+50

You probably already tried to set class-weight to auto, but I'd like to check for certain.

Maybe experiments with balancing (oversampling or undersampling) can help, some lib for it is already advised by klubow.

1 Comment

With class-weight auto this was the result:Accuracy: 0.453846153846 score:0.458315519453 recall: 0.453846153846 precision: 0.205976331361 the metrics go down..
1

You may want to check class_weight parameter (http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) for SVM classifier or balance your data (https://github.com/fmfn/UnbalancedDataset/blob/master/UnbalancedDataset.py)

2 Comments

For this issue what kind of class_weight do I need to set in order to bias the SVM?. Thanks for the help!
This helped me a lot, any idea of how to plot this?. I just want to be fair for the bounty stuff.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.