Removing the Timestamp from a NumPy Array

Question

I'm performing a linear regression on a dataset (Excel file) which consists of a Date column, a scores column and additional column called Predictions with NaN values which will be used to store the predicted values.

I have found that my independent variable, X, contains timestamps which I was actually expecting...? Perhaps I'm doing something wrong, or actually missing something out..?

Top of the original dataset:

       Date    Score
0 2019-05-01 4.607744 
1 2019-05-02 4.709202 
2 2019-05-03 4.132390 
3 2019-05-05 4.747308 
4 2019-05-07 4.745926

Create the independent data set (X)
Convert the dataframe to a numpy array

X = np.array(df.drop(['Prediction'],1))

Remove the last '30' rows

X = X[:-forecast_out]
print(X)

Example of output:

[[Timestamp('2019-05-01 00:00:00') 4.607744342064972]
[Timestamp('2019-05-02 00:00:00') 4.709201914086133]
[Timestamp('2019-05-03 00:00:00') 4.132389742485806]
[Timestamp('2019-05-05 00:00:00') 4.74730802483691]
[Timestamp('2019-05-07 00:00:00') 4.7459264970444615]
[Timestamp('2019-05-08 00:00:00') 4.595303054619376]

Create the dependent data set (y) Convert the dataframe to a numpy array

y = np.array(df['Prediction'])

Get all of the y values except the last '30' rows

y = y[:-forecast_out]
print(y)

Some of the output:

[4.63738251 4.34354486 5.12284464 4.2751933  4.53362196 4.32665058
 4.77433793 4.37496465 4.31239161 4.90445026 4.81738271 3.99114536
 5.21672369 4.4932632  4.46858993 3.93271862 4.55618508 4.11493084
 4.02430584 4.11672606 4.19725244 4.3088558  4.98277563 4.97960989

Split the data into 80% training and 20% testing

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Create and train the Linear Regression Model

lr = LinearRegression()

Train the model

lr.fit(x_train, y_train)

The error:

TypeError: float() argument must be a string or a number, not 'Timestamp'

Clearly the dataset X doesn't like having the timestamp, and like I say, I wasn't really expecting it.

Any help on removing it (or perhaps I need it!?) would be great. As you can, see I'm simply looking to perform a simple regression analysis

Are the timestamps in the same column in the Excel file as well? — Maarten Fabré
– Maarten Fabré, Commented Aug 19, 2019 at 15:32
@MRL drop 'Date' AND 'Prediction' during the initialization of X. — Matthew E. Miller
– Matthew E. Miller, Commented Aug 19, 2019 at 15:48

Matthew E. Miller · Accepted Answer · 2019-08-19 15:41:46Z

1

Do not include the Timestamps (Date) in your creation of 'X'.

The data set is already ordered, so do you really need the time stamps? Another option, try reassigning the index. In either case, I think, do not try to pass Timestamps as argument-data.

Implement changes at this step:

X = np.array(df.drop(['Prediction'],1))

Do something like:

 X = np.array(df.drop(['Date', 'Prediction'],1))

answered Aug 19, 2019 at 15:41

Matthew E. Miller

5671 gold badge5 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

MRL Over a year ago

Cheers. Not sure why the timestamps were being passed as data, I've never had that issue before.

Gokul Krishnan R · Accepted Answer · 2019-08-19 15:37:35Z

0

I think the problem could be solved by using the date timestamp as an index field instead. You can try reset_index to re-assign index.

answered Aug 19, 2019 at 15:37

Gokul Krishnan R

836 bronze badges

Collectives™ on Stack Overflow

Removing the Timestamp from a NumPy Array

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related