1

I am currently following a video on using machine learning algorithms against the KDD 99 cup dataset. when running the code below I get an error saying "could not convert string to float 'normal'".'normal' is one the labels that is found in the Y feature set shown below. the y feature set has 23 labels when I test the algorithm to only predict against 3 y features (normal, smurf and neptune) it works perfectly fine but as soon as I try and get it to predict against all the labels I get the error. any guidance would be appreciated as I have been working on this for 2 days now.

feature_cols =['duration','src_bytes','dst_bytes','land',
   'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in',
   'num_compromised', 'root_shell', 'su_attempted', 'num_root',
   'num_file_creations', 'num_shells', 'num_access_files',
   'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count',
   'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate',
   'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate',
   'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count',
   'dst_host_same_srv_rate', 'dst_host_diff_srv_rate',
   'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate',
   'dst_host_serror_rate', 'dst_host_srv_serror_rate',
   'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'label',
   'proto__icmp', 'proto__tcp', 'proto__udp']

x = dataset[feature_cols]
y = dataset.label

y.value_counts(normalize=True)

Y feature labels

smurf.          
neptune.  
normal.   
back.    
satan.     
ipsweep.     
portsweep.    
warezclient.   
teardrop.    
pod.        
nmap.     
guess_passwd.
buffer_overflow.
land.
warezmaster.
imap.
rootkit.
loadmodule.
ftp_write.
multihop.
phf.
perl.
spy.
Name: label, dtype: float64

Code and Error

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
scores = cross_val_score(dt, x, y, scoring='accuracy', cv=10)
print (scores)
print ("Accuracy: %2.10f" % np.mean(scores))

ValueError                                Traceback (most recent call last)
<ipython-input-70-722f95b657f5> in <module>()
      1 from sklearn.tree import DecisionTreeClassifier
      2 dt = DecisionTreeClassifier()
----> 3 scores = cross_val_score(dt, x, y, scoring='accuracy', cv=10)
      4 print (scores)
      5 print ("Accuracy: %2.10f" % np.mean(scores))

~\Anaconda3\lib\site-packages\sklearn\cross_validation.py in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
   1579                                               train, test, verbose, None,
   1580                                               fit_params)
-> 1581                       for train, test in cv)
   1582     return np.array(scores)[:, 0]
   1583 

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self, iterable)
    777             # was dispatched. In particular this covers the edge
    778             # case of Parallel used with an exhausted iterator.
--> 779             while self.dispatch_one_batch(iterator):
    780                 self._iterating = True
    781             else:

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in dispatch_one_batch(self, iterator)
    623                 return False
    624             else:
--> 625                 self._dispatch(tasks)
    626                 return True
    627 

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in _dispatch(self, batch)
    586         dispatch_timestamp = time.time()
    587         cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 588         job = self._backend.apply_async(batch, callback=cb)
    589         self._jobs.append(job)
    590 

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in apply_async(self, func, callback)
    109     def apply_async(self, func, callback=None):
    110         """Schedule a func to be run"""
--> 111         result = ImmediateResult(func)
    112         if callback:
    113             callback(result)

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in __init__(self, batch)
    330         # Don't delay the application, to avoid keeping the input
    331         # arguments in memory
--> 332         self.results = batch()
    333 
    334     def get(self):

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
    132 
    133     def __len__(self):

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in <listcomp>(.0)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
    132 
    133     def __len__(self):

~\Anaconda3\lib\site-packages\sklearn\cross_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
   1673             estimator.fit(X_train, **fit_params)
   1674         else:
-> 1675             estimator.fit(X_train, y_train, **fit_params)
   1676 
   1677     except Exception as e:

~\Anaconda3\lib\site-packages\sklearn\tree\tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    788             sample_weight=sample_weight,
    789             check_input=check_input,
--> 790             X_idx_sorted=X_idx_sorted)
    791         return self
    792 

~\Anaconda3\lib\site-packages\sklearn\tree\tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    114         random_state = check_random_state(self.random_state)
    115         if check_input:
--> 116             X = check_array(X, dtype=DTYPE, accept_sparse="csc")
    117             y = check_array(y, ensure_2d=False, dtype=None)
    118             if issparse(X):

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    431                                       force_all_finite)
    432     else:
--> 433         array = np.array(array, dtype=dtype, order=order, copy=copy)
    434 
    435         if ensure_2d:

ValueError: could not convert string to float: 'normal.'

Full Code As Requested

import pandas as pd

import warnings
warnings.filterwarnings('ignore')

col_names = ["duration","protocol_type","service","flag","src_bytes",
    "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
    "logged_in","num_compromised","root_shell","su_attempted","num_root",
    "num_file_creations","num_shells","num_access_files","num_outbound_cmds",
    "is_host_login","is_guest_login","count","srv_count","serror_rate",
    "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
    "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
    "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
    "dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]
dataset = pd.read_csv('../data/kddcup.data', header=None, names=col_names)

# Warning, takes a while to load

# make dummy variables for protocol type

protocol_dummies = pd.get_dummies(dataset['protocol_type'], prefix='proto_')

# concatenate the dummy variable columns onto the original DataFrame (axis=0 means rows, axis=1 means columns)
dataset = pd.concat([dataset, protocol_dummies], axis=1)

del dataset['protocol_type']

x = dataset.drop(['label'], axis=1)
y = dataset.label

from sklearn.cross_validation import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.cross_validation import train_test_split
from datetime import datetime

feature_cols =['duration','src_bytes','dst_bytes','land',
       'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in',
       'num_compromised', 'root_shell', 'su_attempted', 'num_root',
       'num_file_creations', 'num_shells', 'num_access_files',
       'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count',
       'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate',
       'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate',
       'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count',
       'dst_host_same_srv_rate', 'dst_host_diff_srv_rate',
       'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate',
       'dst_host_serror_rate', 'dst_host_srv_serror_rate',
       'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'label',
       'proto__icmp', 'proto__tcp', 'proto__udp']

x = dataset[feature_cols]
y = dataset.label

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
scores = cross_val_score(dt, x, y, scoring='accuracy', cv=10)
print (scores)
print ("Accuracy: %2.10f" % np.mean(scores))

one single line from the kdd dataset

0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.

6
  • what is logreg? how did you assign a value for logreg? Commented Apr 25, 2018 at 14:26
  • sorry @âńōŋŷXmoůŜ that should say dt not logreg but either way still same issue cheers Commented Apr 25, 2018 at 14:31
  • something is incorrect in assigning y labels. y = dataset.label. I cannot see the whole picture since you did not give us the input data. But, I can see that y is assigned to a label (string) rather than the value of y (numeric array). Can you also show us the entire code from the begnning? Commented Apr 25, 2018 at 14:35
  • added full code onto end Commented Apr 25, 2018 at 14:44
  • @âńōŋŷXmoůŜ ive also provided one line from the kdd dataset file to help you understand what the file looks like. the original has millions of lines like this all representing a network connection Commented Apr 25, 2018 at 14:55

1 Answer 1

1

I have just realized that I have left the label column in the x features. I have taken it out and now it works.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.