scikit learn transform multiple text features

Question

I'm trying to classify multiple text features to a status. The data includes messages (errors and warnings) from different servers with the components and will result in different states. For example:

ServerName     Name     Description                               Severity   State
-------------- -------- ----------------------------------------- ---------- -------------
QWERT-XY-123   MySQL    Service not available on target machine   error      important
QWERT-XY-146   Oracle   Service caused an error                   warning    unimportant
...

This is a part of the vectorizing:

from sklearn.feature_extraction.text import HashingVectorizer

vectorizer = HashingVectorizer()

X_Servername = df["ServerName"].values
X_Name = df["Name"].values
X_Description = df["Description"].values
X_Severity = df["Severity"].values
y = df["State"].values

X_Servername = vectorizer.transform(X_Servername)
X_Name = vectorizer.transform(X_Name)
X_Description = vectorizer.transform(X_Description)

features=list(zip(X_Servername,X_Name,X_Description,X_Severity))

Now I want to fit the Model:

from sklearn.svm import SVC

model = SVC(kernel = "linear", probability=True)
model.fit(features, y)

And the result is the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-183-71455dd49f0b> in <module>()
  2 
  3 model = SVC(kernel = "linear", probability=True)
----> 4 model.fit(features, y)
  5 
  6 #print(model.score(X_test, y))

D:\Enviroment\Anaconda3\lib\site-packages\sklearn\svm\base.py in fit(self, X, y, sample_weight)
147         self._sparse = sparse and not callable(self.kernel)
148 
149 -->     X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr')
150         y = self._validate_targets(y)
151 

D:\Enviroment\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
571     X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
572                     ensure_2d, allow_nd, ensure_min_samples,
573 -->                 ensure_min_features, warn_on_dtype, estimator)
574     if multi_output:
575         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

D:\Enviroment\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
431                                       force_all_finite)
432     else:
433 -->     array = np.array(array, dtype=dtype, order=order, copy=copy)
434 
435         if ensure_2d:

ValueError: setting an array element with a sequence.

So my question is about how to use multiple features with the hashingvectorizer or is the only way putting all features into one line?

Thanks for your help.

Update

The failer is on how to build the vectorized feature list. Instead of:

features=list(zip(X_Servername,X_Name,X_Description,X_Severity))

I now uses this function where extracted appends all created vectorized values (X_ServerName, X_Name, ...):

def combine(extracted):
    if any(sparse.issparse(fea) for fea in extracted):
        stacked = sparse.hstack(extracted).tocsr()
        stacked = stacked.toarray()
    else:
        stacked = np.hstack(extracted)

    return stacked

You never fit your vectorizer before you attempt to transform your data. I'm guessing your output isn't what you think it is before you try to fit the SVC — G. Anderson
– G. Anderson, Commented Feb 19, 2019 at 17:03
Hi @G.Anderson thanks for your reply. I fit the vectorizer with fit_transform but there is still the same error — Ax3l
– Ax3l, Commented Feb 19, 2019 at 17:15
Possible duplicate of ValueError: setting an array element with a sequence. while using SVM in scikit-learn — G. Anderson
– G. Anderson, Commented Feb 19, 2019 at 17:49

Sergey Bushmanov · Accepted Answer · 2019-02-19 18:45:07Z

0

Please try the code below:

from sklearn_pandas import DataFrameMapper, gen_features
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.preprocessing import LabelEncoder

cat_features = ["ServerName", "Name", "Description", "Severity"]
gf = gen_features(cat_features, [HashingVectorizer])
mapper = DataFrameMapper(gf)
cat_features_transformed = mapper.fit_transform(df)

target_name_encoded = LabelEncoder().fit_transform(df["State"])

from sklearn.svm import SVC

model = SVC(kernel = "linear", probability=True)
model.fit(cat_features_transformed, target_name_encoded)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=True, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

### For test/prediction part ###

test_features_transformed = mapper.transform(df_test)
predictions = model.predict(test_features_transformed)

Note, you may need to run

pip install sklearn-pandas

if you do not have sklearn-pandas installed on your machine.

The aforementioned solution will allow you (1) transform your data to suitable format and later (2) apply the same fitted transformations to your test data via transform method.

Please let us know if this helps

edited Feb 19, 2019 at 18:45

answered Feb 19, 2019 at 18:28

Sergey Bushmanov

25.6k8 gold badges64 silver badges84 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

KRKirov Over a year ago

Is there an advantage of using sklearn-pandas to building a solution based on column transformer or feature union and incorporating these into a pipeline?

Ax3l Over a year ago

Seems to solve my problem. The model can be fit. I will test it tomorrow :-)

Sergey Bushmanov Over a year ago

@KRKirov DataFrameMapper and ColumnTransformer are basically the same, the code of using gen_features is knitter. But you always can achieve the same by writing the sequence of transformations explicitly.

KRKirov Over a year ago

@SergeyBushmanov, thanks for the response. Pardon me for saying this, but I find the solution based on sklearn-pandas somewhat untidy. It would have probably been easier to read a solution based on a pipeline using the standard sklearn transformers.

Collectives™ on Stack Overflow

scikit learn transform multiple text features

Update

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Update

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related