0

I am working on training and testing of data using SVM (scikit). I am training SVM and preparing a pickle from it. Then, I am using that pickle to test my system. First I am reading the training data and testing data in variables train_data and test_data respectively.

After that, the code I am using for training is:

vectorizer = TfidfVectorizer(max_df = 0.8,
                             sublinear_tf=True,
                             use_idf=True)
train_vectors = vectorizer.fit_transform(train_data)
test_vectors = vectorizer.transform(test_data)

classifier_rbf = svm.SVC()
classifier_rbf.fit(train_vectors, train_labels)
from sklearn.externals import joblib
joblib.dump(classifier_rbf, 'pickl/train_rbf_SVM.pkl',1)

Again while testing, I am reading the training data and testing data in variables train_data and test_data respectively. The code I am using for testing is:

vectorizer = TfidfVectorizer(max_df = 0.8,
                             sublinear_tf=True,
                             use_idf=True)
train_vectors = vectorizer.fit_transform(train_data)
test_vectors = vectorizer.transform(test_data)
from sklearn.externals import joblib
classifier_rbf = joblib.load('pickl/train_rbf_SVM.pkl')
prediction_rbf = classifier_rbf.predict(test_vectors)

This code is working fine and giving me correct output. My question is - is it compulsory to read training data whenever I want to do testing?

Thank you.

0

1 Answer 1

2

In your case, yes. Because you are not saving (pickling) the tfidfVectorizer. The test data must be transformed in the exact same way as the train data is transformed to give any meanungful predictions. So, if you want to not read train data again and again, pickle the tfidfVectorizer too along with some estimator and unpicke it during testing.

Also you may want to look at the Pipeline provided in scikit-learn to combine data pre processing and estimating into one object which you can pickle and unpicke easily without having to worry about pickling and loading various parts of the training

Edit - Added code

While training for the first time, add this line to your code in the end:

joblib.dump(vectorizer, 'pickl/train_vectorizer.pkl',1)

Now when testing on the data, no need to load training data. Just load the already fitted vectorizer:

classifier_rbf = joblib.load('pickl/train_rbf_SVM.pkl')
vectorizer = joblib.load('pickl/train_vectorizer.pkl')

test_vectors = vectorizer.transform(test_data)
prediction_rbf = classifier_rbf.predict(test_vectors)
Sign up to request clarification or add additional context in comments.

5 Comments

I pickled my train_vector as well as train_labels. Even then, if I removed line with call to fit_transform, it gives me error that vocabulary doesn't fitted.
Trainvector and trainlabels doest matter.. What you should pickle is the vectorizer and classifier_rbf
That is what I did in code shown in my question. Can you plz write the code in your answer for me. Thank you.
No, in your code you are not pickling the vectorizer. I have changed the answer to reflect the same.
Done! Worked as I was expected.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.