How to tune parameters in Random Forest, using Scikit Learn?

Question

class sklearn.ensemble.RandomForestClassifier(n_estimators=10,
                                              criterion='gini', 
                                              max_depth=None,
                                              min_samples_split=2,
                                              min_samples_leaf=1, 
                                              min_weight_fraction_leaf=0.0, 
                                              max_features='auto', 
                                              max_leaf_nodes=None, 
                                              bootstrap=True, 
                                              oob_score=False,
                                              n_jobs=1, 
                                              random_state=None,
                                              verbose=0, 
                                              warm_start=False, 
                                              class_weight=None)

I'm using a random forest model with 9 samples and about 7000 attributes. Of these samples, there are 3 categories that my classifier recognizes.

I know this is far from ideal conditions but I'm trying to figure out which attributes are the most important in feature predictions. Which parameters would be the best to tweak for optimizing feature importance?

I tried different n_estimators and noticed that the amount of "significant features" (i.e. nonzero values in the feature_importances_ array) increased dramatically.

I've read through the documentation but if anyone has any experience in this, I would like to know which parameters are the best to tune and a brief explanation why.

Why are you using something like RF for 9 samples? There are just so many things that can go wrong here. For one you can go down the multiple hypothesis path to explain your data. Your tree estimators will have super high diversity and horrible accuracy. I could go on. Basically the biggest problem with RF on small data sets is that they are almost completely non interpretable black boxes, the split in feature space and sample space are done randomly. — Sid
– Sid, Commented May 1, 2019 at 19:38
Agreed. I would do this much differently now with more experience. — O.rka
– O.rka, Commented May 1, 2019 at 20:16

Randy Olson · Accepted Answer · 2016-08-09 16:44:04Z

70

From my experience, there are three features worth exploring with the sklearn RandomForestClassifier, in order of importance:

n_estimators
max_features
criterion

n_estimators is not really worth optimizing. The more estimators you give it, the better it will do. 500 or 1000 is usually sufficient.

max_features is worth exploring for many different values. It may have a large impact on the behavior of the RF because it decides how many features each tree in the RF considers at each split.

criterion may have a small impact, but usually the default is fine. If you have the time, try it out.

Make sure to use sklearn's GridSearch (preferably GridSearchCV, but your data set size is too small) when trying out these parameters.

If I understand your question correctly, though, you only have 9 samples and 3 classes? Presumably 3 samples per class? It's very, very likely that your RF is going to overfit with that little amount of data, unless they are good, representative records.

edited Aug 9, 2016 at 16:44

answered Mar 20, 2016 at 2:46

Randy Olson

3,2312 gold badges28 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

O.rka Over a year ago

thanks a lot! what I was doing before was iteratively instantiating a model, taking the non-zero attributes of the "feature_importances_" array, adding them to a counter, taking the most popular ones. Is that a naive way? Should I base it more on variable importance.

Chris Tang · Accepted Answer · 2020-04-12 16:33:40Z

22

The crucial parts are usually three elements:

number of estimators - usually bigger the forest the better, there is small chance of overfitting here
max depth of each tree (default none, leading to full tree) - reduction of the maximum depth helps fighting with overfitting
max features per split (default sqrt(d)) - you might one to play around a bit as it significantly alters behaviour of the whole tree. sqrt heuristic is usually a good starting point but an actual sweet spot might be somewhere else

edited Apr 12, 2020 at 16:33

Chris Tang

5747 silver badges18 bronze badges

answered Mar 19, 2016 at 23:20

lejlot

67k9 gold badges138 silver badges168 bronze badges

2 Comments

Austin Over a year ago

Hi, would you please tell me how number of features effects variance and overfitting?

rishi jain Over a year ago

what is d in sqrt(d) in max features per split? @lejlot - can you pls explain?

Anant Gupta · Accepted Answer · 2017-08-01 15:42:05Z

7

This wonderful article has a detailed explanation of tunable parameters, how to track performance vs speed trade-off, some practical tips, and how to perform grid-search.

answered Aug 1, 2017 at 15:42

Anant Gupta

5161 gold badge8 silver badges14 bronze badges

Comments

Liu Bei · Accepted Answer · 2017-09-02 22:51:43Z

1

n_estimators is good one as others said. It is also good at dealing with the overfitting when increasing it.

But I think min_sample_split is also helpful when dealing with overfitting occurred in a small-sample but big-features dataset.

answered Sep 2, 2017 at 22:51

Liu Bei

6153 gold badges11 silver badges22 bronze badges

Comments

Michael James Kali Galarnyk · Accepted Answer · 2025-05-28 01:55:09Z

The most impactful parameters to tune in RandomForestClassifier for identifying feature importance and improving model generalization are:

n_estimators
The number of decision trees in the forest. More trees can improve accuracy but increase training time. 100 or more is typically good. Higher numbers tend to reduce variance and gives more reliable importance rankings—especially useful when working with noisy or high-dimensional data.

max_features
One of the most important parameters to tune. It determines how many features are considered at each split. The default 'sqrt' (for classification) is a good starting point, but trying values like 'log2' or even a float (e.g. 0.4) can change which features are favored. This randomness is what makes Random Forests different from Bagged Trees:

The key difference:

Bagged Trees consider all features at each split.
Random Forests consider a random subset of features at each split.

max_depth
Limits how deep each tree can grow. Shallower trees can reduce overfitting. Try setting it to something like 10 or 30 instead of using the default (None).

Other useful parameters to consider:

min_samples_split and min_samples_leaf
Help regularize the tree. Increasing these prevents the model from learning tiny, overfitted patterns.
bootstrap=True
This is the default for Random Forests. Create multiple bootstrap datasets by randomly drawing samples with replacement from the original dataset. The image below shows what sampling with replacement looks like. If False, the whole dataset is used. You can try setting it to False just for comparison.

How to tune

You can use GridSearchCV or RandomizedSearchCV. A minimal tuning grid:

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_features': ['sqrt', 'log2', 0.4],
    'max_depth': [None, 10, 30],
}

### Understanding Feature Importance

It is worth noting that Random Forests provide multiple ways to evaluate feature importance:

- **1. Mean Decrease in Impurity (MDI)**  
  Also known as Gini importance, this sums the reduction in impurity brought by each feature across all trees. It can run fast, but it can bias toward high-cardinality features (features with many unique values).

```python
importances = reg.feature_importances_
feature_names = X.columns
sorted_idx = np.argsort(importances)[::-1]

for i in sorted_idx:
    print(f"{feature_names[i]}: {importances[i]:.3f}")

2. Permutation Importance
Measures the drop in model performance when a feature's values are randomly shuffled. This method captures interactions and correlation, making it more reliable (though slower).

from sklearn.inspection import permutation_importance

perm_importance = permutation_importance(reg, X_test, y_test, n_repeats=10, random_state=0)
sorted_idx = perm_importance.importances_mean.argsort()[::-1]

for i in sorted_idx:
    print(f"{X.columns[i]}: {perm_importance.importances_mean[i]:.3f}")

Images and code taken from: https://towardsdatascience.com/understanding-random-forest-using-python-scikit-learn/

Collectives™ on Stack Overflow

How to tune parameters in Random Forest, using Scikit Learn?

5 Answers 5

1 Comment

2 Comments

Comments

Comments

How to tune

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

2 Comments

Comments

Comments

How to tune

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related