4

I started learning maching learning on Python using Pandas and Sklearn. I tried to use the LinearRegression().fit method :

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split 
house_data = pd.read_csv(r"C:\Users\yassine\Desktop\ml\OC-tp-ML\house_data.csv")
y = house_data[["price"]] 
x = house_data[["surface","arrondissement"]] 
X = house_data.iloc[:, 1:3].values  
x_train, x_test, y_train, y_test = train_test_split (x, y, test_size=0.25, random_state=1) 
model = LinearRegression()
model.fit(x_train, y_train) 

When I run the code, I have this message :

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Can You help me please.

4
  • The error tells you the problem, you have NaN values, infinite values, or extremely large values that scikit can't handle. Check for NaN rows in your data and try to remove them Commented Dec 13, 2018 at 16:11
  • house_data.info(), check the null value Commented Dec 13, 2018 at 16:12
  • 1
    I got this :house_data.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 827 entries, 0 to 826 Data columns (total 3 columns): price 827 non-null int64 surface 822 non-null float64 arrondissement 822 non-null float64 dtypes: float64(2), int64(1) memory usage: 19.5 KB Commented Dec 13, 2018 at 16:14
  • Please do not use the comments space for posting code & results - edit & update your post instead Commented Dec 13, 2018 at 16:24

1 Answer 1

4

Machine learning models may require you to impute the data as part of your data cleaning process. Linear regression cares a lot about the yhat, so I usually start with imputing the mean. If you aren't comfortable imputing the missing data, you can drop the observations that contain NaN (provided you only have a small proportion of NaN observations.)

Imputing the mean can look like this:

df = df.fillna(df.mean())

Imputing to zero can look like this:

df = df.fillna(0)

Imputing to a custom result can look like:

df = df.fillna(my_func(args))

Dropping altogether can look like:

df = df.dropna()

Prepping so that inf may be caught by these methods ahead of time can look like:

df.replace([np.inf, -np.inf], np.nan)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.