0

I created a multi-class classification model with Linear SVM. But I am not able to classify a new loaded dataframe (my base that must be classified) I have the following error.

What should I do to convert my new text(df.reason_text) to TFID and classify(call model.prediction(?)) with my model?

Training Model

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(ngram_range=(1,2), stop_words=stopwords) 
features = tfidf.fit_transform(training.Description).toarray()
labels = training.category_id

model = LinearSVC()
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features, labels, training.index, test_size=0.33, random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Now I'm not able to convert my new dataframe to classify

Load New DataFrame by Classification

from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir='s3://athenaxxxxxxxx/result/', 
                   region_name='us-east-2')
df = pd.read_sql("select * from data.classification_text_reason", conn)

features2 = tfidf.fit_transform(df.reason_text).toarray()
features2.shape

After I convert the new data frame text with TFID and have it sorted, I get the following message

y_pred1 = model.predict(features2)

error

ValueError: X has 1272 features per sample; expecting 5319

'

1
  • 1
    features2 = tfidf.fit_transform(df.reason_text).toarray() change it to features2 = tfidf.transform(df.reason_text).toarray() Commented Feb 3, 2020 at 8:10

1 Answer 1

1

When you are loading a new DF for classification, you are calling fit_tranform() again, but you should be calling only transform().

fit_transform() description: Learn vocabulary and idf, return term-document matrix.

transform() description: Transform documents to document-term matrix.

You need to use the transformer created when training the algorithm, so the code would be:

tfidf.transform(df.reason_text).toarray()

If you still have the feature shape error, there may be a problem with the shapes of the arrays. Solve the transform part and if the error still occurs, post an example of the train and the test data in array format, I will keep helping.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.