Problem in Code " Could not convert string to float"

Question

I am learning the linear regression from a github link "https://github.com/Anubhav1107/Machine_Learning_A-Z/blob/master/Part%202%20-%20Regression/Section%205%20-%20Multiple%20Linear%20Regression/multiple_linear_regression.py"

but when I tried making it, this occurs:

ValueError                                Traceback (most recent call last)
<ipython-input-26-860be404cdc9> in <module>()
      1 sc_y = StandardScaler()
----> 2 y_train = sc_y.fit_transform(y_train)

4 frames
/usr/local/lib/python3.6/dist-packages/numpy/core/numeric.py in asarray(a, dtype, order)
    536 
    537     """
--> 538     return array(a, dtype, copy=False, order=order)
    539 
    540 

ValueError: could not convert string to float: 'Florida'

I am running it on Google Colab, I have already converted the Categorical Features, so I don't understand what the problem is.

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()


# Splitting the dataset into the Training set and Test set

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)

Please show a sample of your y_train

desertnaut
– desertnaut

2019-09-28 18:58:39 +00:00
Commented Sep 28, 2019 at 18:58 — desertnaut
– desertnaut, Commented Sep 28, 2019 at 18:58

desertnaut · Accepted Answer · 2019-09-28 20:03:10Z

There is a reason why in How to create a Minimal, Reproducible Example we ask that:

Make sure all information necessary to reproduce the problem is included in the question itself

and not in some external file, parts of which you may or you may have not executed correctly.

I am saying this because I cannot reproduce your error; executing the relevant parts of the linked code works OK here:

import numpy as np
import pandas as pd
import sklearn
sklearn.__version__
# '0.21.3'

# Importing the dataset
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()

# Avoiding the Dummy Variable Trap
X = X[:, 1:]

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split  # model_selection here, due to newer version of scikit_learn
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# FutureWarning here, irrelevant to the issue

At this stage, we have:

y_train
# result:
array([ 96778.92,  96479.51, 105733.54,  96712.8 , 124266.9 , 155752.6 ,
       132602.65,  64926.08,  35673.41, 101004.64, 129917.04,  99937.59,
        97427.84, 126992.93,  71498.49, 118474.03,  69758.98, 152211.77,
       134307.35, 107404.34, 156991.12, 125370.37,  78239.91,  14681.4 ,
       191792.06, 141585.52,  89949.14, 108552.04, 156122.51, 108733.99,
        90708.19, 111313.02, 122776.86, 149759.96,  81005.76,  49490.75,
       182901.99, 192261.83,  42559.73,  65200.33])

which I bet is not the case with your (not shown) full code.

Modifying slightly the last line below to y_train.reshape(-1,1) (again, irrelevant to the issue - if not we get a different error, asking to do so), we have:

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train.reshape(-1,1))  # reshape here

which works OK, giving

y_train
# result
array([[-0.31304376],
       [-0.32044287],
       [-0.09175449],
       [-0.31467774],
       [ 0.3662475 ],
       [ 1.14433163],
       [ 0.57224308],
       [-1.10020076],
       [-1.82310158],
       [-0.20861649],
       [ 0.50587547],
       [-0.23498575],
       [-0.29700745],
       [ 0.43361398],
       [-0.93778138],
       [ 0.22309235],
       [-0.98076868],
       [ 1.05682957],
       [ 0.61437014],
       [-0.05046517],
       [ 1.17493831],
       [ 0.39351679],
       [-0.77118537],
       [-2.34186247],
       [ 2.03494965],
       [ 0.79423047],
       [-0.48182335],
       [-0.02210286],
       [ 1.15347296],
       [-0.01760646],
       [-0.46306547],
       [ 0.04612731],
       [ 0.32942519],
       [ 0.9962397 ],
       [-0.70283485],
       [-1.4816433 ],
       [ 1.81525556],
       [ 2.04655875],
       [-1.65292476],
       [-1.09342341]])

It certainly seems that, instead of y = dataset.iloc[:, 4].values, you have asked for y = dataset.iloc[:, 3].values, which gives:

dataset.iloc[:, 3].values
# result:
array(['New York', 'California', 'Florida', 'New York', 'Florida',
       'New York', 'California', 'Florida', 'New York', 'California',
       'Florida', 'California', 'Florida', 'California', 'Florida',
       'New York', 'California', 'New York', 'Florida', 'New York',
       'California', 'New York', 'Florida', 'Florida', 'New York',
       'California', 'Florida', 'New York', 'Florida', 'New York',
       'Florida', 'New York', 'California', 'Florida', 'California',
       'New York', 'Florida', 'California', 'New York', 'California',
       'California', 'Florida', 'California', 'New York', 'California',
       'New York', 'Florida', 'California', 'New York', 'California'],
      dtype=object)

With this change, the above code indeed gives:

y_train
# result:
array(['Florida', 'New York', 'Florida', 'California', 'Florida',
       'Florida', 'Florida', 'New York', 'New York', 'New York',
       'New York', 'Florida', 'California', 'California', 'California',
       'California', 'New York', 'New York', 'California', 'California',
       'New York', 'New York', 'California', 'California', 'California',
       'Florida', 'California', 'New York', 'California', 'Florida',
       'Florida', 'New York', 'New York', 'California', 'California',
       'Florida', 'New York', 'New York', 'California', 'California'],
      dtype=object)

and eventually:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-4a9512e0c95c> in <module>
      5 X_test = sc_X.transform(X_test)
      6 sc_y = StandardScaler()
----> 7 y_train = sc_y.fit_transform(y_train.reshape(-1,1))

[...]
ValueError: could not convert string to float: 'Florida'

Thanks much. I understood my mistake. And sorry for the incomplete question

Collectives™ on Stack Overflow

Problem in Code " Could not convert string to float"

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related