0

I'm performing a linear regression on a dataset (Excel file) which consists of a Date column, a scores column and additional column called Predictions with NaN values which will be used to store the predicted values.

I have found that my independent variable, X, contains timestamps which I was actually expecting...? Perhaps I'm doing something wrong, or actually missing something out..?

Top of the original dataset:

       Date    Score
0 2019-05-01 4.607744 
1 2019-05-02 4.709202 
2 2019-05-03 4.132390 
3 2019-05-05 4.747308 
4 2019-05-07 4.745926 

Create the independent data set (X)
Convert the dataframe to a numpy array

X = np.array(df.drop(['Prediction'],1))

Remove the last '30' rows

X = X[:-forecast_out]
print(X)

Example of output:

[[Timestamp('2019-05-01 00:00:00') 4.607744342064972]
[Timestamp('2019-05-02 00:00:00') 4.709201914086133]
[Timestamp('2019-05-03 00:00:00') 4.132389742485806]
[Timestamp('2019-05-05 00:00:00') 4.74730802483691]
[Timestamp('2019-05-07 00:00:00') 4.7459264970444615]
[Timestamp('2019-05-08 00:00:00') 4.595303054619376]

Create the dependent data set (y) Convert the dataframe to a numpy array

y = np.array(df['Prediction'])

Get all of the y values except the last '30' rows

y = y[:-forecast_out]
print(y)

Some of the output:

[4.63738251 4.34354486 5.12284464 4.2751933  4.53362196 4.32665058
 4.77433793 4.37496465 4.31239161 4.90445026 4.81738271 3.99114536
 5.21672369 4.4932632  4.46858993 3.93271862 4.55618508 4.11493084
 4.02430584 4.11672606 4.19725244 4.3088558  4.98277563 4.97960989

Split the data into 80% training and 20% testing

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Create and train the Linear Regression Model

lr = LinearRegression()

Train the model

lr.fit(x_train, y_train)

The error:

TypeError: float() argument must be a string or a number, not 'Timestamp'

Clearly the dataset X doesn't like having the timestamp, and like I say, I wasn't really expecting it.

Any help on removing it (or perhaps I need it!?) would be great. As you can, see I'm simply looking to perform a simple regression analysis

2
  • Are the timestamps in the same column in the Excel file as well? Commented Aug 19, 2019 at 15:32
  • @MRL drop 'Date' AND 'Prediction' during the initialization of X. Commented Aug 19, 2019 at 15:48

2 Answers 2

1

Do not include the Timestamps (Date) in your creation of 'X'.

The data set is already ordered, so do you really need the time stamps? Another option, try reassigning the index. In either case, I think, do not try to pass Timestamps as argument-data.

Implement changes at this step:

X = np.array(df.drop(['Prediction'],1))

Do something like:

 X = np.array(df.drop(['Date', 'Prediction'],1))
Sign up to request clarification or add additional context in comments.

1 Comment

Cheers. Not sure why the timestamps were being passed as data, I've never had that issue before.
0

I think the problem could be solved by using the date timestamp as an index field instead. You can try reset_index to re-assign index.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.