In pandas
This error (or a very similar error) commonly appears when changing the dtype of a pandas column from object to float using astype() or apply(). The cause is there are non-numeric strings that cannot be converted into floats. One solution is to use pd.to_numeric() instead, with errors='coerce' passed. This replaces non-numeric values such as the literal string 'id' to NaN.
df = pd.DataFrame({'col': ['id', '1.5', '2.4']})
df['col'] = df['col'].astype(float) # <---- ValueError: could not convert string to float: 'id'
df['col'] = df['col'].apply(lambda x: float(x)) # <---- ValueError
df['col'] = pd.to_numeric(df['col'], errors='coerce') # <---- OK
# ^^^^^^^^^^^^^^^ <--- converts non-numbers to NaN
0 NaN
1 1.5
2 2.4
Name: col, dtype: float64
pd.to_numeric() works only on individual columns, so if you need to change the dtype of multiple columns in one go (similar to how .astype(float) may be used), then passing it to apply() should do the job.
df = pd.DataFrame({'col1': ['id', '1.5', '2.4'], 'col2': ['10.2', '21.3', '20.6']})
df[['col1', 'col2']] = df.apply(pd.to_numeric, errors='coerce')
col1 col2
0 NaN 10.2
1 1.5 21.3
2 2.4 20.6
Sometimes there are thousands separator commas, which throws a similar error:
ValueError: could not convert string to float: '2,000.4'
in which case, first removing them before the pd.to_numeric() call solves the issue.
df = pd.DataFrame({'col': ['id', '1.5', '2,000.4']})
df['col'] = df['col'].replace(regex=',', value='')
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^ <--- remove commas
df['col'] = pd.to_numeric(df['col'], errors='coerce')
0 NaN
1 1.5
2 2000.4
Name: col, dtype: float64
In scikit-learn
This error is also raised when you fit data containing strings to models that expects numeric data. One example is various scalers e.g. StandardScaler(). In that case, the solution is to process the data by one-hot or label encoding the text input into a numeric input. Below is an example where a string input is one-hot encoded first and fed into a scaler model.
from sklearn.preprocessing import StandardScaler, OneHotEncoder
data = [['a'], ['b'], ['c']]
sc = StandardScaler().fit(data) # <--- ValueError: could not convert string to float: 'a'
data = OneHotEncoder().fit_transform(data).toarray()
sc = StandardScaler().fit(data) # <--- OK
ValueError: could not convert string to float:can occur when reading a dataframe from acsvfile with types asdf = df[['p']].astype({'p': float}). If thecsvwas recorded with empty spaces, python will not recognize the space character as a nan. You will need to overwrite empty cells with NaN withdf = df.replace(r'^\s*$', np.nan, regex=True)