Random forest getting mse by tuning two hyperparameters using a for loop

Question

I'm developping a model to predict the target variable using the RandomForestRegressor from scikit.

I have developped a function to get the mse as below:

def get_mse(n_estimators, max_leaf_nodes, X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=n_estimators, max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(X_train, y_train)
    preds_val = model.predict(X_valid)
    mse = mean_squared_error(y_valid, preds_val, squared = False)
    return(mse)

I would like to use a for loop to get the best mse scores by combining a list of values for n_estimators and max_leaf_nodes

Below are the code that I wrote:

n_estimators = [100,150,200,250]
max_leaf_nodes = [10, 50, 100, 200]

for n_estimators,max_leaf_nodes in zip(n_estimators,max_leaf_nodes):
    my_mse = get_mse(n_estimators,max_leaf_nodes, X_train, X_valid, y_train, y_valid)
    print("N_estimators: %d  \t\t Max leaf nodes: %d  \t\t Mean Squared Error:  %d" %(n_estimators, max_leaf_nodes, my_mse))

But when I run this for loop, it always return a mse of 0 for each combination of two hyperparameters.

I have tried my function by using the following code and it returns with the correct mse:

get_mse(200, 100, X_train, X_valid, y_train, y_valid)

I'm wondering why my for loop is not working properly by returning me always a 0 mse.

Could someone can help me to solve this issue ?

Thank you

Did you try replacing %d with %f for mse in the format string? If the mean squared error is a float between 0 and 1, using %d will always print zero. — hilberts_drinking_problem
– hilberts_drinking_problem, Commented Aug 17, 2021 at 19:05
You should consider using sklearn facilities, like GridSearch scikit-learn.org/stable/modules/generated/… or something similar. — Rodrigo Laguna
– Rodrigo Laguna, Commented Aug 17, 2021 at 19:20

afsharov · Accepted Answer · 2021-08-18 08:35:35Z

There are mainly two things to consider:

First, do not shadow the names you already used to declare the list of values (n_estimators and max_leaf_nodes). Instead, make them clearly distinguishable:

n_estimators_list = [100, 150, 200, 250]
max_leaf_nodes_list = [10, 50, 100, 200]

for n_estimators, max_leaf_nodes in zip(n_estimators_list, max_leaf_nodes_list):
...

Secondly, as pointed out in the comments above, you should replace the %d formatter for mse with %f since values between 0 and 1 would otherwise be formatted as 0:

print("N_estimators: %d  \t\t Max leaf nodes: %d  \t\t Mean Squared Error:  %f" %(n_estimators, max_leaf_nodes, my_mse))

Personally, I would recommend using one of the newer string formatting options, for example Python 3's f-strings, to avoid such mishaps:

print(f"N_estimators: {n_estimators}  \t\t Max leaf nodes: {max_leaf_nodes}  \t\t Mean Squared Error:  {my_mse}")

A last note that has also been already mentioned in the comments: for hyperparameter tuning, you could use GridSearchCV which is a pre-implemented functionality to find the best hyperparameters using an exhaustive search over a pre-defined grid. Example usage:

param_grid = {
   'n_estimators': [100, 150, 200, 250],
   'max_leaf_nodes': [10, 50, 100, 200]
}

gs = GridSearchCV(
    estimator=RandomForestRegressor(),
    param_grid=param_grid,
    scoring='neg_root_mean_squared_error'
)

gs.fit(X, y)
print(gs.best_params_)

The advantage is that this implementation is battle-proven, provides many readily available values and statistics to inspect the result, and uses cross-validation. Furthermore, it will explore all possible hyperparameter combinations (in contrast to your own loop which does not).

You can read more about GridSearchCV in its documentation.

Collectives™ on Stack Overflow

Random forest getting mse by tuning two hyperparameters using a for loop

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related