Question:
- How to tune hyperparameters of random forest with panel data in python?
- Is there an already implemented package and function?
I have looked for answers among others in:
- https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9
- https://stats.stackexchange.com/questions/326228/cross-validation-with-time-series
- https://stats.stackexchange.com/questions/369397/correct-cross-validation-procedure-for-single-model-applied-to-panel-data
which all lead me to the current state of code.
Problem:
I am trying to predict the quantity of each product sold in a week. I have ~5000 products (grouped in categories) and 1.5-year history. As there are so many products, creating an individual model for each product does not seem to make sense, thus one big model taking into account also a category of a product.
I understand the idea of time-sensitive cross-validation and nested cross-validation but lack the ability to implement them efficently.
Example data:
import pandas as pd
import numpy as np
from random import seed
from random import randint
seed(1)
Panel_data = pd.DataFrame({
'Product': ["A", "B"] * 10,
'Time': [ele for ele in range(1, 11) for i in range(2)],
'Z': [randint(0, 10) for ele in range(1, 21)],
'X': [randint(0, 10) for ele in range(1, 21)]})
Panel_data['Y'] = Panel_data['X'] + [randint(0, 10) for ele in range(1, 21)]
My current approach to nested sliding window CV (as described in link)
I have created a loop, that correctly allows the model to learn on data, then using RandomizedSearchCV from sklearn.model_selection package I find the best hyperparameters in a given subset. After iterating through all time id's in a data, I select median hyperparameters.
This approach is very time-consuming! I was wondering if it is the correct approach and if there is a better way to do it?
from sklearn.metrics import mean_squared_error
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
import statistics
from statistics import mode
rf_beseline = RandomForestRegressor(n_estimators = 2, random_state = 42)
OLS_baseline = linear_model.LinearRegression()
random_grid = {'n_estimators': [100, 500],
'max_features': ['auto', 'sqrt'],
'max_depth': [3, None],
'min_samples_split': [3, 10],
'min_samples_leaf': [3, 10]}
rf_MSE = list()
n_estimators = list()
min_samples_split = list()
min_samples_leaf = list()
max_features = list()
max_depth = list()
for i in range(1, 10):
print(i)
X_train = Panel_data.loc[Panel_data['Time'] == i, ['X', 'Z']]
Y_train = Panel_data.loc[Panel_data['Time'] == i, 'Y']
X_test = Panel_data.loc[Panel_data['Time'] == i + 1, ['X', 'Z']]
Y_test = Panel_data.loc[Panel_data['Time'] == i + 1, 'Y']
#random forest
rf_beseline.fit(X_train, Y_train)
y_pred = rf_beseline.predict(X_test)
mse = mean_squared_error(Y_test, y_pred)
rf_MSE = rf_MSE + [mse]
# hiperparamiters
rf_rs = RandomizedSearchCV(estimator = rf_beseline, param_distributions = random_grid, n_iter = 5, cv = 2, verbose = 2, random_state = 42, n_jobs = -1)
rf_rs.fit(X_train, Y_train)
n_estimators = n_estimators + [rf_rs.best_params_.get('n_estimators')]
min_samples_split = min_samples_split + [rf_rs.best_params_.get('min_samples_split')]
min_samples_leaf = min_samples_leaf + [rf_rs.best_params_.get('min_samples_leaf')]
max_features = max_features + [rf_rs.best_params_.get('max_features')]
max_depth = max_depth + [rf_rs.best_params_.get('max_depth')]
np.mean(rf_MSE)
#selected hiperparamiters
sel_n_estimators = np.median(n_estimators)
sel_min_samples_split = np.median(min_samples_split)
sel_min_samples_leaf = np.median(min_samples_leaf)
sel_max_features = mode(max_features)
sel_max_depth = None if np.median([100 if v is None else v for v in max_depth]) == 100 else np.median([100 if v is None else v for v in max_depth])
rf_best = RandomForestRegressor(n_estimators = sel_n_estimators,
random_state = 42,
min_samples_split = sel_min_samples_split,
min_samples_leaf = sel_min_samples_leaf,
max_features = sel_max_features,
max_depth = sel_max_depth,
bootstrap = True)
My expectation and hope:
I hope there is already implemented easily to use function as RandomizedSearchCV, which will work faster than for loop implemented by me