2

I'm trying to run a kNN classifier across my dataset using 10-fold CV. I have some experience with models in WEKA but struggling to transfer this over to Sklearn.

Below is my code

filename = 'train4.csv'
names = ['attribute names are here']

df = pandas.read_csv(filename, names=names)

num_folds = 10
kfold = KFold(n_splits=10, random_state=7)
model = KNeighborsClassifier()
results = cross_val_score(model, df.drop('mix1_instrument', axis=1), df['mix1_instrument'], cv=kfold)
print(results.mean())

I am receiving this error

 ValueError: could not convert string to float: ''

How can I convert this attribute? And this contains useful information for classifying my instances would a conversion impact this?

There are two attributes that are 'object' that I believe need converting named 'class1' and class2'

Sample data below...

{
    'temporalCentroid': {
        0: 'temporalCentroid',
        1: '1.67324',
        2: '1.330722',
        3: '0.786984',
        4: '1.850129'
    },
    'LogSpecCentroid': {
        0: 'LogSpecCentroid',
        1: '-1.043802',
        2: '-0.82943',
        3: '-2.441297',
        4: '-0.837145'
    },
    'LogSpecSpread': {
        0: 'LogSpecSpread',
        1: '0.747558',
        2: '1.378373',
        3: '0.667634',
        4: '1.238404'
    },
    'MFCC1': {
        0: 'MFCC1',
        1: '3.502117',
        2: '6.697601',
        3: '4.011488',
        4: '0.823614'
    },
    'MFCC2': {
        0: 'MFCC2',
        1: '-9.208897',
        2: '-9.741549',
        3: '15.27665',
        4: '-15.22256'
    },
    'MFCC3': {
        0: 'MFCC3',
        1: '-2.334097',
        2: '-9.868089',
        3: '0.802509',
        4: '-4.978688'
    },
    'MFCC4': {
        0: 'MFCC4',
        1: '-9.013086',
        2: '0.609091',
        3: '2.50685',
        4: '-2.489553'
    },
    'MFCC5': {
        0: 'MFCC5',
        1: '4.847481',
        2: '1.733307',
        3: '0.10459',
        4: '1.066615'
    },
    'MFCC6': {
        0: 'MFCC6',
        1: '-4.770421',
        2: '-5.381835',
        3: '-0.260118',
        4: '-1.020861'
    },
    'MFCC7': {
        0: 'MFCC7',
        1: '-3.362488',
        2: '-1.261088',
        3: '0.593255',
        4: '-2.007349'
    },
    'MFCC8': {
        0: 'MFCC8',
        1: '-9.527529',
        2: '-3.809237',
        3: '-0.362287',
        4: '-8.938164'
    },
    'MFCC9': {
        0: 'MFCC9',
        1: '-9.629579',
        2: '1.486923',
        3: '-2.957592',
        4: '-2.324424'
    },
    'MFCC10': {
        0: 'MFCC10',
        1: '1.848685',
        2: '-3.938455',
        3: '-1.884439',
        4: '-2.535579'
    },
    'MFCC11': {
        0: 'MFCC11',
        1: '-2.311295',
        2: '-2.159865',
        3: '-0.827179',
        4: '0.638553'
    },
    'MFCC12': {
        0: 'MFCC12',
        1: '-7.696675',
        2: '-3.138412',
        3: '-0.605056',
        4: '-1.116259'
    },
    'MFCC13': {
        0: 'MFCC13',
        1: '10.35572',
        2: '9.095669',
        3: '6.426399',
        4: '15.04535'
    },
    'MFCCMin': {
        0: 'MFCCMin',
        1: '-9.629579',
        2: '-9.868089',
        3: '-2.957592',
        4: '-15.22256'
    },
    'MFCCMax': {
        0: 'MFCCMax',
        1: '10.35572',
        2: '9.095669',
        3: '15.27665',
        4: '15.04535'
    },
    'MFCCSum': {
        0: 'MFCCSum',
        1: '-37.300064',
        2: '-19.675939',
        3: '22.82507',
        4: '-23.059305'
    },
    'MFCCAvg': {
        0: 'MFCCAvg',
        1: '-2.869235692',
        2: '-1.513533769',
        3: '1.755774615',
        4: '-1.773792692'
    },
    'MFCCStd': {
        0: 'MFCCStd',
        1: '6.409842944',
        2: '5.558499123',
        3: '4.756836281',
        4: '6.76039911'
    },
    'Energy': {
        0: 'Energy',
        1: '-2.96148',
        2: '-3.522993',
        3: '-3.409359',
        4: '-2.235853'
    },
    'ZeroCrossings': {
        0: 'ZeroCrossings',
        1: '128',
        2: '188',
        3: '43',
        4: '288'
    },
    'SpecCentroid': {
        0: 'SpecCentroid',
        1: '284.0513',
        2: '414.8489',
        3: '102.2096',
        4: '405.1262'
    },
    'SpecSpread': {
        0: 'SpecSpread',
        1: '207.5526',
        2: '350.7937',
        3: '53.52178',
        4: '360.0353'
    },
    'Rolloff': {
        0: 'Rolloff',
        1: '263.7817',
        2: '783.2703',
        3: '129.1992',
        4: '912.4695'
    },
    'Flux': {
        0: 'Flux',
        1: '0',
        2: '0',
        3: '0',
        4: '0'
    },
    'bandsCoefMin': {
        0: 'bandsCoefMin',
        1: '-0.224957',
        2: '-0.247903',
        3: '-0.22283',
        4: '-0.232534'
    },
    'bandsCoefMax': {
        0: 'bandsCoefMax',
        1: '-0.074945',
        2: '-0.113654',
        3: '-0.062254',
        4: '-0.080883'
    },
    'bandsCoefSum1': {
        0: 'bandsCoefSum1',
        1: '-5.575428',
        2: '-5.524777',
        3: '-5.511125',
        4: '-5.532536'
    },
    'bandsCoefAvg': {
        0: 'bandsCoefAvg',
        1: '-0.168952364',
        2: '-0.167417485',
        3: '-0.167003788',
        4: '-0.167652606'
    },
    'bandsCoefStd': {
        0: 'bandsCoefStd',
        1: '0.042580181',
        2: '0.048429973',
        3: '0.049881374',
        4: '0.0475839'
    },
    'bandsCoefSum': {
        0: 'bandsCoefSum',
        1: '382.5963',
        2: '360.9232',
        3: '384.3541',
        4: '368.9903'
    },
    'prjmin': {
        0: 'prjmin',
        1: '-0.999362',
        2: '-0.999719',
        3: '-0.988315',
        4: '-0.999421'
    },
    'prjmax': {
        0: 'prjmax',
        1: '0.023797',
        2: '0.009596',
        3: '0.028112',
        4: '0.024612'
    },
    'prjSum': {
        0: 'prjSum',
        1: '-0.99911',
        2: '-1.006792',
        3: '-1.084054',
        4: '-1.002478'
    },
    'prjAvg': {
        0: 'prjAvg',
        1: '-0.030276061',
        2: '-0.030508848',
        3: '-0.032850121',
        4: '-0.030378121'
    },
    'prjStd': {
        0: 'prjStd',
        1: '0.174082468',
        2: '0.174040569',
        3: '0.173600498',
        4: '0.174064118'
    },
    'LogAttackTime': {
        0: 'LogAttackTime',
        1: '0.365883',
        2: '-0.35427',
        3: '-0.669283',
        4: '-0.026181'
    },
    'HamoPkMin': {
        0: 'HamoPkMin',
        1: '0',
        2: '0',
        3: '0',
        4: '0'
    },
    'HamoPkMax': {
        0: 'HamoPkMax',
        1: '1.025473',
        2: '1.05761',
        3: '0.986766',
        4: '0.957316'
    },
    'HamoPkSum': {
        0: 'HamoPkSum',
        1: '14.391206',
        2: '20.306125',
        3: '9.727358',
        4: '14.772449'
    },
    'HamoPkAvg': {
        0: 'HamoPkAvg',
        1: '0.513971643',
        2: '0.72521875',
        3: '0.347405643',
        4: '0.527587464'
    },
    'HamoPkStd': {
        0: 'HamoPkStd',
        1: '0.376622124',
        2: '0.325929503',
        3: '0.388971641',
        4: '0.381693476'
    },
    'class1': {
        0: 'class1',
        1: 'aerophone',
        2: 'aerophone',
        3: 'chordophone',
        4: 'aerophone'
    },
    'class2': {
        0: 'class2',
        1: 'aero_single-reed',
        2: 'aero_lip-vibrated',
        3: 'chrd_simple',
        4: 'aero_single-reed'
    },
    'mix1_instrument': {
        0: 'mix1_instrument',
        1: 'Saxophone',
        2: 'Trumpet',
        3: 'Piano',
        4: 'Clarinet'
    }
}

Thanks

1
  • you should get rid of the first row, because it's duplicating column names... Commented Nov 15, 2017 at 16:54

1 Answer 1

4

Here is a small demo:

Source DF:

In [43]: df
Out[43]:
     Energy  HamoPkStd       class1             class2 mix1_instrument
0 -2.961480  14.391206    aerophone   aero_single-reed       Saxophone
1 -3.522993  20.306125  chordophone  aero_lip-vibrated         Trumpet
2 -3.409359   9.727358    aerophone        chrd_simple           Piano

Labels encoding:

In [44]: %paste
from sklearn.preprocessing import LabelBinarizer, LabelEncoder

str_cols = df.columns[df.columns.str.contains('(?:class|instrument)')]
clfs = {c:LabelEncoder() for c in str_cols}

for col, clf in clfs.items():
    df[col] = clfs[col].fit_transform(df[col])
## -- End pasted text --

Result - all text/string columns have been converted to numbers, so we can feed it to Neural Networks:

In [45]: df
Out[45]:
     Energy  HamoPkStd  class1  class2  mix1_instrument
0 -2.961480  14.391206       0       1                1
1 -3.522993  20.306125       1       0                2
2 -3.409359   9.727358       0       2                0

Inverse transfomration:

In [48]: clfs['class1'].inverse_transform(df['class1'])
Out[48]: array(['aerophone', 'chordophone', 'aerophone'], dtype=object)

In [49]: clfs['mix1_instrument'].inverse_transform(df['mix1_instrument'])
Out[49]: array(['Saxophone', 'Trumpet', 'Piano'], dtype=object)
Sign up to request clarification or add additional context in comments.

8 Comments

will this impact the accuracy of the model? considering the information is extremely useful? I'm just trying to get a grasp of the theory behind this for my own understanding
@Gareth, can you provide a small reproducible data set? If you have text in some columns and this is important information that must be treated by your model, then you want first to binarize it (convert it to numbers) and only then feed it to your model
these attributes contain _ (underscores) so when I run this code and then my model I am receiving the same error
@Gareth, as i've written in the answer - those underscores will be converted to NaN values...
I have included sample data in my question
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.