I created a multi-class classification model with Linear SVM. But I am not able to classify a new loaded dataframe (my base that must be classified) I have the following error.
What should I do to convert my new text(df.reason_text) to TFID and classify(call model.prediction(?)) with my model?
Training Model
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(ngram_range=(1,2), stop_words=stopwords)
features = tfidf.fit_transform(training.Description).toarray()
labels = training.category_id
model = LinearSVC()
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features, labels, training.index, test_size=0.33, random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Now I'm not able to convert my new dataframe to classify
Load New DataFrame by Classification
from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir='s3://athenaxxxxxxxx/result/',
region_name='us-east-2')
df = pd.read_sql("select * from data.classification_text_reason", conn)
features2 = tfidf.fit_transform(df.reason_text).toarray()
features2.shape
After I convert the new data frame text with TFID and have it sorted, I get the following message
y_pred1 = model.predict(features2)
error
ValueError: X has 1272 features per sample; expecting 5319
'
features2 = tfidf.fit_transform(df.reason_text).toarray()change it tofeatures2 = tfidf.transform(df.reason_text).toarray()