1

I'm developping a model to predict the target variable using the RandomForestRegressor from scikit.

I have developped a function to get the mse as below:

def get_mse(n_estimators, max_leaf_nodes, X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=n_estimators, max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(X_train, y_train)
    preds_val = model.predict(X_valid)
    mse = mean_squared_error(y_valid, preds_val, squared = False)
    return(mse)

I would like to use a for loop to get the best mse scores by combining a list of values for n_estimators and max_leaf_nodes

Below are the code that I wrote:

n_estimators = [100,150,200,250]
max_leaf_nodes = [10, 50, 100, 200]

for n_estimators,max_leaf_nodes in zip(n_estimators,max_leaf_nodes):
    my_mse = get_mse(n_estimators,max_leaf_nodes, X_train, X_valid, y_train, y_valid)
    print("N_estimators: %d  \t\t Max leaf nodes: %d  \t\t Mean Squared Error:  %d" %(n_estimators, max_leaf_nodes, my_mse))

But when I run this for loop, it always return a mse of 0 for each combination of two hyperparameters.

I have tried my function by using the following code and it returns with the correct mse:

get_mse(200, 100, X_train, X_valid, y_train, y_valid)

I'm wondering why my for loop is not working properly by returning me always a 0 mse.

Could someone can help me to solve this issue ?

Thank you

2
  • 2
    Did you try replacing %d with %f for mse in the format string? If the mean squared error is a float between 0 and 1, using %d will always print zero. Commented Aug 17, 2021 at 19:05
  • 3
    You should consider using sklearn facilities, like GridSearch scikit-learn.org/stable/modules/generated/… or something similar. Commented Aug 17, 2021 at 19:20

1 Answer 1

2

There are mainly two things to consider:

First, do not shadow the names you already used to declare the list of values (n_estimators and max_leaf_nodes). Instead, make them clearly distinguishable:

n_estimators_list = [100, 150, 200, 250]
max_leaf_nodes_list = [10, 50, 100, 200]

for n_estimators, max_leaf_nodes in zip(n_estimators_list, max_leaf_nodes_list):
...

Secondly, as pointed out in the comments above, you should replace the %d formatter for mse with %f since values between 0 and 1 would otherwise be formatted as 0:

print("N_estimators: %d  \t\t Max leaf nodes: %d  \t\t Mean Squared Error:  %f" %(n_estimators, max_leaf_nodes, my_mse))

Personally, I would recommend using one of the newer string formatting options, for example Python 3's f-strings, to avoid such mishaps:

print(f"N_estimators: {n_estimators}  \t\t Max leaf nodes: {max_leaf_nodes}  \t\t Mean Squared Error:  {my_mse}")

A last note that has also been already mentioned in the comments: for hyperparameter tuning, you could use GridSearchCV which is a pre-implemented functionality to find the best hyperparameters using an exhaustive search over a pre-defined grid. Example usage:

param_grid = {
   'n_estimators': [100, 150, 200, 250],
   'max_leaf_nodes': [10, 50, 100, 200]
}

gs = GridSearchCV(
    estimator=RandomForestRegressor(),
    param_grid=param_grid,
    scoring='neg_root_mean_squared_error'
)

gs.fit(X, y)
print(gs.best_params_)

The advantage is that this implementation is battle-proven, provides many readily available values and statistics to inspect the result, and uses cross-validation. Furthermore, it will explore all possible hyperparameter combinations (in contrast to your own loop which does not).

You can read more about GridSearchCV in its documentation.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.