0

Question:

  1. How to tune hyperparameters of random forest with panel data in python?
  2. Is there an already implemented package and function?

I have looked for answers among others in:

  1. https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9
  2. https://stats.stackexchange.com/questions/326228/cross-validation-with-time-series
  3. https://stats.stackexchange.com/questions/369397/correct-cross-validation-procedure-for-single-model-applied-to-panel-data

which all lead me to the current state of code.

Problem:

I am trying to predict the quantity of each product sold in a week. I have ~5000 products (grouped in categories) and 1.5-year history. As there are so many products, creating an individual model for each product does not seem to make sense, thus one big model taking into account also a category of a product.

I understand the idea of time-sensitive cross-validation and nested cross-validation but lack the ability to implement them efficently.

Example data:

import pandas as pd
import numpy as np
from random import seed
from random import randint

seed(1)
Panel_data = pd.DataFrame({
    'Product': ["A", "B"] * 10,
    'Time': [ele for ele in range(1, 11) for i in range(2)],
    'Z': [randint(0, 10) for ele in range(1, 21)],
    'X': [randint(0, 10) for ele in range(1, 21)]})

Panel_data['Y'] = Panel_data['X'] + [randint(0, 10) for ele in range(1, 21)]

My current approach to nested sliding window CV (as described in link)

I have created a loop, that correctly allows the model to learn on data, then using RandomizedSearchCV from sklearn.model_selection package I find the best hyperparameters in a given subset. After iterating through all time id's in a data, I select median hyperparameters.

This approach is very time-consuming! I was wondering if it is the correct approach and if there is a better way to do it?

from sklearn.metrics import mean_squared_error
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
import statistics
from statistics import mode
    
rf_beseline = RandomForestRegressor(n_estimators = 2, random_state = 42)
OLS_baseline = linear_model.LinearRegression()

random_grid = {'n_estimators': [100, 500],
               'max_features': ['auto', 'sqrt'],
               'max_depth': [3, None],
               'min_samples_split': [3, 10],
               'min_samples_leaf': [3, 10]}

rf_MSE = list()

n_estimators = list()
min_samples_split = list()
min_samples_leaf = list()
max_features = list()
max_depth = list()

for i in range(1, 10):
    print(i)
    X_train = Panel_data.loc[Panel_data['Time'] == i, ['X', 'Z']]
    Y_train = Panel_data.loc[Panel_data['Time'] == i, 'Y']
    X_test = Panel_data.loc[Panel_data['Time'] == i + 1, ['X', 'Z']]
    Y_test = Panel_data.loc[Panel_data['Time'] == i + 1, 'Y']
    
    #random forest
    rf_beseline.fit(X_train, Y_train)
    y_pred = rf_beseline.predict(X_test) 
    mse = mean_squared_error(Y_test, y_pred)
    rf_MSE = rf_MSE + [mse]
    
    # hiperparamiters
    rf_rs = RandomizedSearchCV(estimator = rf_beseline, param_distributions = random_grid, n_iter = 5, cv = 2, verbose = 2, random_state = 42, n_jobs = -1)
    rf_rs.fit(X_train, Y_train)
    n_estimators = n_estimators + [rf_rs.best_params_.get('n_estimators')]
    min_samples_split = min_samples_split + [rf_rs.best_params_.get('min_samples_split')]
    min_samples_leaf = min_samples_leaf + [rf_rs.best_params_.get('min_samples_leaf')]
    max_features = max_features + [rf_rs.best_params_.get('max_features')]
    max_depth = max_depth + [rf_rs.best_params_.get('max_depth')]


np.mean(rf_MSE)


#selected hiperparamiters
sel_n_estimators = np.median(n_estimators)
sel_min_samples_split = np.median(min_samples_split)
sel_min_samples_leaf = np.median(min_samples_leaf)
sel_max_features = mode(max_features)
sel_max_depth = None if  np.median([100 if v is None else v for v in max_depth]) == 100 else  np.median([100 if v is None else v for v in max_depth])



rf_best = RandomForestRegressor(n_estimators = sel_n_estimators, 
                                random_state = 42,
                                min_samples_split = sel_min_samples_split, 
                                min_samples_leaf = sel_min_samples_leaf,
                                max_features = sel_max_features, 
                                max_depth = sel_max_depth,
                                bootstrap = True) 

My expectation and hope:

I hope there is already implemented easily to use function as RandomizedSearchCV, which will work faster than for loop implemented by me

2 Answers 2

1

1. Using for loop to generate data

The loop determines how train/test data are generated. This has nothing to do with the RandomizedSearchCV. It is normal that RandomizedSearchCV might give us good (lucky) or bad model params as this is only random.

Here is an example implementation using optuna to optimize parameters. The data is still generated by your loop. Important is to create our objective function and return mse our objective value.

"""
Using optuna hyperparameter optimizer.

Ref: https://github.com/optuna/optuna
"""

import time
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
import optuna


Panel_data = pd.DataFrame({
    'Product': ["A", "B"] * 10,
    'Time': [ele for ele in range(1, 11) for i in range(2)],
    'Z': [randint(0, 10) for ele in range(1, 21)],
    'X': [randint(0, 10) for ele in range(1, 21)]})

Panel_data['Y'] = Panel_data['X'] + [randint(0, 10) for ele in range(1, 21)]


def objective(trial):
    # Define model with init values from optuna.
    rf_model = RandomForestRegressor(
        n_estimators = trial.suggest_int('n_estimators', 100, 500),
        min_samples_split = trial.suggest_int('min_samples_split', 3, 10),
        min_samples_leaf = trial.suggest_int('min_samples_leaf', 3, 10),
        max_features = trial.suggest_categorical("max_features", ["auto", "sqrt"]),
        max_depth = trial.suggest_int('max_depth', 3, 10),
        bootstrap = True,
        random_state = 42
    )
    
    allmse = []
    
    # Create datasets in CV scheme considering timeseries data.
    for i in range(1, 10):
        X_train = Panel_data.loc[Panel_data['Time'] == i, ['Z', 'X']]
        Y_train = Panel_data.loc[Panel_data['Time'] == i, 'Y']
        X_test = Panel_data.loc[Panel_data['Time'] == i + 1, ['Z', 'X']]
        Y_test = Panel_data.loc[Panel_data['Time'] == i + 1, 'Y']
        
        # Fit the train data.    
        rf_model.fit(X_train, Y_train)
        
        # Test the model with test data.        
        y_pred = rf_model.predict(X_test)
        
        # Save the mse.
        mse = mean_squared_error(Y_test, y_pred)
        allmse.append(mse)
        
    return np.mean(allmse)  # Send mse as feedback to optuna sampler


def optuna_tune():
    t0 = time.perf_counter()
    
    num_trials = 30  # more is better especially if num param is high and param range is also high.
    sampler = optuna.samplers.TPESampler(seed=1)  # TPE is optuna default sampler, others cmaes, skopt, etc
    
    study = optuna.create_study(sampler=sampler, direction='minimize')
    study.optimize(objective, n_trials=num_trials)
    
    # Show the best params and mse value
    best_params = study.best_params
    print(f'best params: {study.best_params}')
    print(f'best mean value: {study.best_value}')  
    
    print(f'elapse: {time.perf_counter() - t0:0.1f}s')
    

# Start
optuna_tune()

Output:

...

[I 2021-12-08 13:29:24,229] Trial 29 finished with value: 44.890960282703766 and parameters: {'n_estimators': 272, 'min_samples_split': 3, 'min_samples_leaf': 9, 'max_features': 'auto', 'max_depth': 9}. Best is trial 3 with value: 44.515468624442995.
best params: {'n_estimators': 156, 'min_samples_split': 4, 'min_samples_leaf': 9, 'max_features': 'auto', 'max_depth': 8}
best mean value: 44.515468624442995
elapse: 74.8s

2. TimeSeriesSplit

Another method to prepare time series data is by the TimeSeriesSplit() from sklearn. It generates data differently at least on default. It expands the train data but maintains the sequence, see example below.

   Product  Time  Z   X   Y
0        A     1  2   4   9
1        B     1  2   9  12
2        A     2  5   3   3
3        B     2  2   4   5
4        A     3  2  10  11
5        B     3  9   2   3
6        A     4  8  10  18
7        B     4  4   0   6
8        A     5  3   5   7
9        B     5  8   1   8
10       A     6  6   6  12
11       B     6  7   6   8
12       A     7  7  10  17
13       B     7  8   8  15
14       A     8  8   2  10
15       B     8  4   9  18
16       A     9  2   7   7
17       B     9  8   7  16
18       A    10  9   8  11
19       B    10  8   7  16
X_train: [[2 4]]
Y_train: [9]
X_test: [[2 9]]
Y_test: [12]

Then for the next fold, it takes 2 in the train and so on. It expands the train window.

X_train: [[2 4]
 [2 9]]
Y_train: [ 9 12]
X_test: [[5 3]]
Y_test: [3]

...

I use this scheme with optuna optimizer code below.

"""
Using optuna hyperparameter optimizer and sklearn TimeSeriesSplit

Ref: 
    https://github.com/optuna/optuna
    https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html
"""

import time
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import TimeSeriesSplit
import optuna


Panel_data = pd.DataFrame({
    'Product': ["A", "B"] * 10,
    'Time': [ele for ele in range(1, 11) for i in range(2)],
    'Z': [randint(0, 10) for ele in range(1, 21)],
    'X': [randint(0, 10) for ele in range(1, 21)]})

Panel_data['Y'] = Panel_data['X'] + [randint(0, 10) for ele in range(1, 21)]
print(Panel_data.to_string())


def objective(trial):
    # Define model with init values from optuna.
    rf_model = RandomForestRegressor(
        n_estimators = trial.suggest_int('n_estimators', 100, 500),
        min_samples_split = trial.suggest_int('min_samples_split', 3, 10),
        min_samples_leaf = trial.suggest_int('min_samples_leaf', 3, 10),
        max_features = trial.suggest_categorical("max_features", ["auto", "sqrt"]),
        max_depth = trial.suggest_int('max_depth', 3, 10),
        bootstrap = True,
        random_state = 42
    )
    
    allmse = []
    
    tscv = TimeSeriesSplit(gap=0, max_train_size=None, n_splits=19, test_size=None)
    X = np.array(Panel_data[['Z', 'X']])
    y = np.array(Panel_data[['Y']])
    
    # Create datasets in CV scheme considering timeseries data.
    for train_index, test_index in tscv.split(X):
        X_train, X_test = X[train_index], X[test_index]
        Y_train, Y_test = y[train_index], y[test_index]
        
        Y_train = Y_train.ravel()
        Y_test = Y_test.ravel()
        
        print(f'X_train: {X_train}')
        print(f'Y_train: {Y_train}')
        print(f'X_test: {X_test}')
        print(f'Y_test: {Y_test}')
        
        # Fit the train data.    
        rf_model.fit(X_train, Y_train)
        
        # Test the model with test data.        
        y_pred = rf_model.predict(X_test)
        
        # Save the mse.
        mse = mean_squared_error(Y_test, y_pred)
        allmse.append(mse)
        
    return np.mean(allmse)  # Send mse as feedback to optuna sampler


def optuna_tune():
    t0 = time.perf_counter()
    
    num_trials = 20  # more is better especially if num param is high and param range is also high.
    sampler = optuna.samplers.TPESampler(seed=1)  # TPE is optuna default sampler, others cmaes, skopt, etc
    
    study = optuna.create_study(sampler=sampler, direction='minimize')
    study.optimize(objective, n_trials=num_trials)
    
    # Show the best params and mse value
    best_params = study.best_params
    print(f'best params: {study.best_params}')
    print(f'best mean value: {study.best_value}')  
    
    print(f'elapse: {time.perf_counter() - t0:0.1f}s')
    

# Start
optuna_tune()

Output:

[I 2021-12-08 15:06:44,324] A new study created in memory with name: no-name-20410ee9-790a-4ad1-8930-50baee3faefc
   Product  Time  Z   X   Y
0        A     1  2   4   9
1        B     1  2   9  12
2        A     2  5   3   3
3        B     2  2   4   5
4        A     3  2  10  11
5        B     3  9   2   3
6        A     4  8  10  18
7        B     4  4   0   6
8        A     5  3   5   7
9        B     5  8   1   8
10       A     6  6   6  12
11       B     6  7   6   8
12       A     7  7  10  17
13       B     7  8   8  15
14       A     8  8   2  10
15       B     8  4   9  18
16       A     9  2   7   7
17       B     9  8   7  16
18       A    10  9   8  11
19       B    10  8   7  16
X_train: [[2 4]]
Y_train: [9]
X_test: [[2 9]]
Y_test: [12]
X_train: [[2 4]
 [2 9]]
Y_train: [ 9 12]
X_test: [[5 3]]
Y_test: [3]
X_train: [[2 4]
 [2 9]
 [5 3]]
Y_train: [ 9 12  3]
X_test: [[2 4]]
Y_test: [5]
X_train: [[2 4]
 [2 9]
 [5 3]
 [2 4]]
Y_train: [ 9 12  3  5]
X_test: [[ 2 10]]
Y_test: [11]

...

[I 2021-12-08 15:08:39,685] Trial 19 finished with value: 25.44589482974616 and parameters: {'n_estimators': 452, 'min_samples_split': 6, 'min_samples_leaf': 6, 'max_features': 'auto', 'max_depth': 10}. Best is trial 8 with value: 21.126358478438924.
best params: {'n_estimators': 215, 'min_samples_split': 4, 'min_samples_leaf': 3, 'max_features': 'auto', 'max_depth': 5}
best mean value: 21.126358478438924
elapse: 115.4s

Your result may vary as panel data is generated randomly.

Sign up to request clarification or add additional context in comments.

Comments

0

There's a fantastic package called optuna which is used for hyper-parameter tuning in an intelligent way.

In short; you specify a range for each hyper-parameter and then optuna choses the next pair of hyper-parameters to test, based on the results from the previous set of hyper-parameters i.e like bayesian-optimization.

This video gives a great overview (example starts at around 5:00 minutes) of how to use it.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.