I'm performing a linear regression on a dataset (Excel file) which consists of a Date column, a scores column and additional column called Predictions with NaN values which will be used to store the predicted values.
I have found that my independent variable, X, contains timestamps which I was actually expecting...? Perhaps I'm doing something wrong, or actually missing something out..?
Top of the original dataset:
Date Score
0 2019-05-01 4.607744
1 2019-05-02 4.709202
2 2019-05-03 4.132390
3 2019-05-05 4.747308
4 2019-05-07 4.745926
Create the independent data set (X)
Convert the dataframe to a numpy array
X = np.array(df.drop(['Prediction'],1))
Remove the last '30' rows
X = X[:-forecast_out]
print(X)
Example of output:
[[Timestamp('2019-05-01 00:00:00') 4.607744342064972]
[Timestamp('2019-05-02 00:00:00') 4.709201914086133]
[Timestamp('2019-05-03 00:00:00') 4.132389742485806]
[Timestamp('2019-05-05 00:00:00') 4.74730802483691]
[Timestamp('2019-05-07 00:00:00') 4.7459264970444615]
[Timestamp('2019-05-08 00:00:00') 4.595303054619376]
Create the dependent data set (y) Convert the dataframe to a numpy array
y = np.array(df['Prediction'])
Get all of the y values except the last '30' rows
y = y[:-forecast_out]
print(y)
Some of the output:
[4.63738251 4.34354486 5.12284464 4.2751933 4.53362196 4.32665058
4.77433793 4.37496465 4.31239161 4.90445026 4.81738271 3.99114536
5.21672369 4.4932632 4.46858993 3.93271862 4.55618508 4.11493084
4.02430584 4.11672606 4.19725244 4.3088558 4.98277563 4.97960989
Split the data into 80% training and 20% testing
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Create and train the Linear Regression Model
lr = LinearRegression()
Train the model
lr.fit(x_train, y_train)
The error:
TypeError: float() argument must be a string or a number, not 'Timestamp'
Clearly the dataset X doesn't like having the timestamp, and like I say, I wasn't really expecting it.
Any help on removing it (or perhaps I need it!?) would be great. As you can, see I'm simply looking to perform a simple regression analysis