Imagine we have multiple time-series observations for multiple entities, and we want to perform hyper-parameter tuning on a single model, splitting the data in a time-series cross-validation fashion.
To my knowledge, there isn't a straightforward solution to performing this hyper-parameter tuning operation within the scikit-learn framework. There exists the functionality to do this with a single time-series using TimeSeriesSplit, however this doesn't work for multiple entities.
As a simple example imagine we have a dataframe:
from itertools import product
# create a dataframe
countries = ['ESP','FRA']
periods = list(range(10))
df = pd.DataFrame(list(product(countries,periods)), columns = ['country','period'])
df['target'] = np.concatenate((np.repeat(1, 10), np.repeat(0, 10)))
df['a_feature'] = np.random.randn(20, 1)
# this produces the following dataframe:
country,period,target,a_feature
ESP,0,1,0.08
ESP,1,1,-2.0
ESP,2,1,0.1
ESP,3,1,-0.59
ESP,4,1,-0.83
ESP,5,1,0.05
ESP,6,1,0.05
ESP,7,1,0.42
ESP,8,1,0.04
ESP,9,1,2.17
FRA,0,0,-0.44
FRA,1,0,-0.48
FRA,2,0,0.82
FRA,3,0,-1.64
FRA,4,0,0.19
FRA,5,0,0.6
FRA,6,0,-0.73
FRA,7,0,-0.5
FRA,8,0,1.11
FRA,9,0,-0.75
And we want to train a single model across Spain and France so that we take all the data up to a certain period, and then predict using that trained model the next period for both Spain and France. And we want to assess which set of hyper-parameters work best for performance.
How to do hyper-parameter tuning with panel data in an time-series cross-validation framework?
Similar questions have been asked here: