Error when classifying new Linear SVM dataframe

Question

I created a multi-class classification model with Linear SVM. But I am not able to classify a new loaded dataframe (my base that must be classified) I have the following error.

What should I do to convert my new text(df.reason_text) to TFID and classify(call model.prediction(?)) with my model?

Training Model

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(ngram_range=(1,2), stop_words=stopwords) 
features = tfidf.fit_transform(training.Description).toarray()
labels = training.category_id

model = LinearSVC()
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features, labels, training.index, test_size=0.33, random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Now I'm not able to convert my new dataframe to classify

Load New DataFrame by Classification

from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir='s3://athenaxxxxxxxx/result/', 
                   region_name='us-east-2')
df = pd.read_sql("select * from data.classification_text_reason", conn)

features2 = tfidf.fit_transform(df.reason_text).toarray()
features2.shape

After I convert the new data frame text with TFID and have it sorted, I get the following message

y_pred1 = model.predict(features2)

error

ValueError: X has 1272 features per sample; expecting 5319

'

features2 = tfidf.fit_transform(df.reason_text).toarray() change it to features2 = tfidf.transform(df.reason_text).toarray() — Parthasarathy Subburaj
– Parthasarathy Subburaj, Commented Feb 3, 2020 at 8:10

Noki · Accepted Answer · 2020-02-03 08:09:29Z

1

When you are loading a new DF for classification, you are calling fit_tranform() again, but you should be calling only transform().

fit_transform() description: Learn vocabulary and idf, return term-document matrix.

transform() description: Transform documents to document-term matrix.

You need to use the transformer created when training the algorithm, so the code would be:

tfidf.transform(df.reason_text).toarray()

If you still have the feature shape error, there may be a problem with the shapes of the arrays. Solve the transform part and if the error still occurs, post an example of the train and the test data in array format, I will keep helping.

answered Feb 3, 2020 at 8:09

Noki

96310 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Error when classifying new Linear SVM dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related