Python-Scikit. Training and testing data using SVM

Question

I am working on training and testing of data using SVM (scikit). I am training SVM and preparing a pickle from it. Then, I am using that pickle to test my system. First I am reading the training data and testing data in variables train_data and test_data respectively.

After that, the code I am using for training is:

vectorizer = TfidfVectorizer(max_df = 0.8,
                             sublinear_tf=True,
                             use_idf=True)
train_vectors = vectorizer.fit_transform(train_data)
test_vectors = vectorizer.transform(test_data)

classifier_rbf = svm.SVC()
classifier_rbf.fit(train_vectors, train_labels)
from sklearn.externals import joblib
joblib.dump(classifier_rbf, 'pickl/train_rbf_SVM.pkl',1)

Again while testing, I am reading the training data and testing data in variables train_data and test_data respectively. The code I am using for testing is:

vectorizer = TfidfVectorizer(max_df = 0.8,
                             sublinear_tf=True,
                             use_idf=True)
train_vectors = vectorizer.fit_transform(train_data)
test_vectors = vectorizer.transform(test_data)
from sklearn.externals import joblib
classifier_rbf = joblib.load('pickl/train_rbf_SVM.pkl')
prediction_rbf = classifier_rbf.predict(test_vectors)

This code is working fine and giving me correct output. My question is - is it compulsory to read training data whenever I want to do testing?

Thank you.

Vivek Kumar · Accepted Answer · 2017-02-06 05:35:40Z

2

In your case, yes. Because you are not saving (pickling) the tfidfVectorizer. The test data must be transformed in the exact same way as the train data is transformed to give any meanungful predictions. So, if you want to not read train data again and again, pickle the tfidfVectorizer too along with some estimator and unpicke it during testing.

Also you may want to look at the Pipeline provided in scikit-learn to combine data pre processing and estimating into one object which you can pickle and unpicke easily without having to worry about pickling and loading various parts of the training

Edit - Added code

While training for the first time, add this line to your code in the end:

joblib.dump(vectorizer, 'pickl/train_vectorizer.pkl',1)

Now when testing on the data, no need to load training data. Just load the already fitted vectorizer:

classifier_rbf = joblib.load('pickl/train_rbf_SVM.pkl')
vectorizer = joblib.load('pickl/train_vectorizer.pkl')

test_vectors = vectorizer.transform(test_data)
prediction_rbf = classifier_rbf.predict(test_vectors)

edited Feb 6, 2017 at 5:35

answered Feb 4, 2017 at 4:35

Vivek Kumar

36.8k9 gold badges116 silver badges139 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Himadri Over a year ago

I pickled my train_vector as well as train_labels. Even then, if I removed line with call to fit_transform, it gives me error that vocabulary doesn't fitted.

Vivek Kumar Over a year ago

Trainvector and trainlabels doest matter.. What you should pickle is the vectorizer and classifier_rbf

Himadri Over a year ago

That is what I did in code shown in my question. Can you plz write the code in your answer for me. Thank you.

Vivek Kumar Over a year ago

No, in your code you are not pickling the vectorizer. I have changed the answer to reflect the same.

Himadri Over a year ago

Done! Worked as I was expected.

Collectives™ on Stack Overflow

Python-Scikit. Training and testing data using SVM

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related