Python/Sklearn - Value Error: could not convert string to float

Question

I'm trying to run a kNN classifier across my dataset using 10-fold CV. I have some experience with models in WEKA but struggling to transfer this over to Sklearn.

Below is my code

filename = 'train4.csv'
names = ['attribute names are here']

df = pandas.read_csv(filename, names=names)

num_folds = 10
kfold = KFold(n_splits=10, random_state=7)
model = KNeighborsClassifier()
results = cross_val_score(model, df.drop('mix1_instrument', axis=1), df['mix1_instrument'], cv=kfold)
print(results.mean())

I am receiving this error

 ValueError: could not convert string to float: ''

How can I convert this attribute? And this contains useful information for classifying my instances would a conversion impact this?

There are two attributes that are 'object' that I believe need converting named 'class1' and class2'

Sample data below...

{
    'temporalCentroid': {
        0: 'temporalCentroid',
        1: '1.67324',
        2: '1.330722',
        3: '0.786984',
        4: '1.850129'
    },
    'LogSpecCentroid': {
        0: 'LogSpecCentroid',
        1: '-1.043802',
        2: '-0.82943',
        3: '-2.441297',
        4: '-0.837145'
    },
    'LogSpecSpread': {
        0: 'LogSpecSpread',
        1: '0.747558',
        2: '1.378373',
        3: '0.667634',
        4: '1.238404'
    },
    'MFCC1': {
        0: 'MFCC1',
        1: '3.502117',
        2: '6.697601',
        3: '4.011488',
        4: '0.823614'
    },
    'MFCC2': {
        0: 'MFCC2',
        1: '-9.208897',
        2: '-9.741549',
        3: '15.27665',
        4: '-15.22256'
    },
    'MFCC3': {
        0: 'MFCC3',
        1: '-2.334097',
        2: '-9.868089',
        3: '0.802509',
        4: '-4.978688'
    },
    'MFCC4': {
        0: 'MFCC4',
        1: '-9.013086',
        2: '0.609091',
        3: '2.50685',
        4: '-2.489553'
    },
    'MFCC5': {
        0: 'MFCC5',
        1: '4.847481',
        2: '1.733307',
        3: '0.10459',
        4: '1.066615'
    },
    'MFCC6': {
        0: 'MFCC6',
        1: '-4.770421',
        2: '-5.381835',
        3: '-0.260118',
        4: '-1.020861'
    },
    'MFCC7': {
        0: 'MFCC7',
        1: '-3.362488',
        2: '-1.261088',
        3: '0.593255',
        4: '-2.007349'
    },
    'MFCC8': {
        0: 'MFCC8',
        1: '-9.527529',
        2: '-3.809237',
        3: '-0.362287',
        4: '-8.938164'
    },
    'MFCC9': {
        0: 'MFCC9',
        1: '-9.629579',
        2: '1.486923',
        3: '-2.957592',
        4: '-2.324424'
    },
    'MFCC10': {
        0: 'MFCC10',
        1: '1.848685',
        2: '-3.938455',
        3: '-1.884439',
        4: '-2.535579'
    },
    'MFCC11': {
        0: 'MFCC11',
        1: '-2.311295',
        2: '-2.159865',
        3: '-0.827179',
        4: '0.638553'
    },
    'MFCC12': {
        0: 'MFCC12',
        1: '-7.696675',
        2: '-3.138412',
        3: '-0.605056',
        4: '-1.116259'
    },
    'MFCC13': {
        0: 'MFCC13',
        1: '10.35572',
        2: '9.095669',
        3: '6.426399',
        4: '15.04535'
    },
    'MFCCMin': {
        0: 'MFCCMin',
        1: '-9.629579',
        2: '-9.868089',
        3: '-2.957592',
        4: '-15.22256'
    },
    'MFCCMax': {
        0: 'MFCCMax',
        1: '10.35572',
        2: '9.095669',
        3: '15.27665',
        4: '15.04535'
    },
    'MFCCSum': {
        0: 'MFCCSum',
        1: '-37.300064',
        2: '-19.675939',
        3: '22.82507',
        4: '-23.059305'
    },
    'MFCCAvg': {
        0: 'MFCCAvg',
        1: '-2.869235692',
        2: '-1.513533769',
        3: '1.755774615',
        4: '-1.773792692'
    },
    'MFCCStd': {
        0: 'MFCCStd',
        1: '6.409842944',
        2: '5.558499123',
        3: '4.756836281',
        4: '6.76039911'
    },
    'Energy': {
        0: 'Energy',
        1: '-2.96148',
        2: '-3.522993',
        3: '-3.409359',
        4: '-2.235853'
    },
    'ZeroCrossings': {
        0: 'ZeroCrossings',
        1: '128',
        2: '188',
        3: '43',
        4: '288'
    },
    'SpecCentroid': {
        0: 'SpecCentroid',
        1: '284.0513',
        2: '414.8489',
        3: '102.2096',
        4: '405.1262'
    },
    'SpecSpread': {
        0: 'SpecSpread',
        1: '207.5526',
        2: '350.7937',
        3: '53.52178',
        4: '360.0353'
    },
    'Rolloff': {
        0: 'Rolloff',
        1: '263.7817',
        2: '783.2703',
        3: '129.1992',
        4: '912.4695'
    },
    'Flux': {
        0: 'Flux',
        1: '0',
        2: '0',
        3: '0',
        4: '0'
    },
    'bandsCoefMin': {
        0: 'bandsCoefMin',
        1: '-0.224957',
        2: '-0.247903',
        3: '-0.22283',
        4: '-0.232534'
    },
    'bandsCoefMax': {
        0: 'bandsCoefMax',
        1: '-0.074945',
        2: '-0.113654',
        3: '-0.062254',
        4: '-0.080883'
    },
    'bandsCoefSum1': {
        0: 'bandsCoefSum1',
        1: '-5.575428',
        2: '-5.524777',
        3: '-5.511125',
        4: '-5.532536'
    },
    'bandsCoefAvg': {
        0: 'bandsCoefAvg',
        1: '-0.168952364',
        2: '-0.167417485',
        3: '-0.167003788',
        4: '-0.167652606'
    },
    'bandsCoefStd': {
        0: 'bandsCoefStd',
        1: '0.042580181',
        2: '0.048429973',
        3: '0.049881374',
        4: '0.0475839'
    },
    'bandsCoefSum': {
        0: 'bandsCoefSum',
        1: '382.5963',
        2: '360.9232',
        3: '384.3541',
        4: '368.9903'
    },
    'prjmin': {
        0: 'prjmin',
        1: '-0.999362',
        2: '-0.999719',
        3: '-0.988315',
        4: '-0.999421'
    },
    'prjmax': {
        0: 'prjmax',
        1: '0.023797',
        2: '0.009596',
        3: '0.028112',
        4: '0.024612'
    },
    'prjSum': {
        0: 'prjSum',
        1: '-0.99911',
        2: '-1.006792',
        3: '-1.084054',
        4: '-1.002478'
    },
    'prjAvg': {
        0: 'prjAvg',
        1: '-0.030276061',
        2: '-0.030508848',
        3: '-0.032850121',
        4: '-0.030378121'
    },
    'prjStd': {
        0: 'prjStd',
        1: '0.174082468',
        2: '0.174040569',
        3: '0.173600498',
        4: '0.174064118'
    },
    'LogAttackTime': {
        0: 'LogAttackTime',
        1: '0.365883',
        2: '-0.35427',
        3: '-0.669283',
        4: '-0.026181'
    },
    'HamoPkMin': {
        0: 'HamoPkMin',
        1: '0',
        2: '0',
        3: '0',
        4: '0'
    },
    'HamoPkMax': {
        0: 'HamoPkMax',
        1: '1.025473',
        2: '1.05761',
        3: '0.986766',
        4: '0.957316'
    },
    'HamoPkSum': {
        0: 'HamoPkSum',
        1: '14.391206',
        2: '20.306125',
        3: '9.727358',
        4: '14.772449'
    },
    'HamoPkAvg': {
        0: 'HamoPkAvg',
        1: '0.513971643',
        2: '0.72521875',
        3: '0.347405643',
        4: '0.527587464'
    },
    'HamoPkStd': {
        0: 'HamoPkStd',
        1: '0.376622124',
        2: '0.325929503',
        3: '0.388971641',
        4: '0.381693476'
    },
    'class1': {
        0: 'class1',
        1: 'aerophone',
        2: 'aerophone',
        3: 'chordophone',
        4: 'aerophone'
    },
    'class2': {
        0: 'class2',
        1: 'aero_single-reed',
        2: 'aero_lip-vibrated',
        3: 'chrd_simple',
        4: 'aero_single-reed'
    },
    'mix1_instrument': {
        0: 'mix1_instrument',
        1: 'Saxophone',
        2: 'Trumpet',
        3: 'Piano',
        4: 'Clarinet'
    }
}

Thanks

you should get rid of the first row, because it's duplicating column names... — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Nov 15, 2017 at 16:54

MaxU - stand with Ukraine · Accepted Answer · 2017-11-15 17:20:38Z

4

Here is a small demo:

Source DF:

In [43]: df
Out[43]:
     Energy  HamoPkStd       class1             class2 mix1_instrument
0 -2.961480  14.391206    aerophone   aero_single-reed       Saxophone
1 -3.522993  20.306125  chordophone  aero_lip-vibrated         Trumpet
2 -3.409359   9.727358    aerophone        chrd_simple           Piano

Labels encoding:

In [44]: %paste
from sklearn.preprocessing import LabelBinarizer, LabelEncoder

str_cols = df.columns[df.columns.str.contains('(?:class|instrument)')]
clfs = {c:LabelEncoder() for c in str_cols}

for col, clf in clfs.items():
    df[col] = clfs[col].fit_transform(df[col])
## -- End pasted text --

Result - all text/string columns have been converted to numbers, so we can feed it to Neural Networks:

In [45]: df
Out[45]:
     Energy  HamoPkStd  class1  class2  mix1_instrument
0 -2.961480  14.391206       0       1                1
1 -3.522993  20.306125       1       0                2
2 -3.409359   9.727358       0       2                0

Inverse transfomration:

In [48]: clfs['class1'].inverse_transform(df['class1'])
Out[48]: array(['aerophone', 'chordophone', 'aerophone'], dtype=object)

In [49]: clfs['mix1_instrument'].inverse_transform(df['mix1_instrument'])
Out[49]: array(['Saxophone', 'Trumpet', 'Piano'], dtype=object)

edited Nov 15, 2017 at 17:20

answered Nov 15, 2017 at 16:39

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

0xgareth Over a year ago

will this impact the accuracy of the model? considering the information is extremely useful? I'm just trying to get a grasp of the theory behind this for my own understanding

MaxU - stand with Ukraine Over a year ago

@Gareth, can you provide a small reproducible data set? If you have text in some columns and this is important information that must be treated by your model, then you want first to binarize it (convert it to numbers) and only then feed it to your model

0xgareth Over a year ago

these attributes contain _ (underscores) so when I run this code and then my model I am receiving the same error

MaxU - stand with Ukraine Over a year ago

@Gareth, as i've written in the answer - those underscores will be converted to NaN values...

0xgareth Over a year ago

I have included sample data in my question

|

Collectives™ on Stack Overflow

Python/Sklearn - Value Error: could not convert string to float

1 Answer 1

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related