Searching for best random forest parameter by for loop in r

Question

everyone, i am trying to searching best parameter by for loop. However, the result is really confusing me. The following code should provide same result, since the parameter "mtry" are the same.

       gender Partner   tenure Churn
3521     Male      No 0.992313   Yes
2525.1   Male      No 4.276666    No
567      Male     Yes 2.708050    No
8381   Female      No 4.202127   Yes
6258   Female      No 0.000000   Yes
6569     Male     Yes 2.079442    No
27410  Female      No 1.550804   Yes
6429   Female      No 1.791759   Yes
412    Female     Yes 3.828641    No
4655   Female     Yes 3.737670    No

RFModel = randomForest(Churn ~ .,
                     data = ggg,
                     ntree = 30,
                     mtry = 2,
                     importance = TRUE,
                     replace = FALSE)
print(RFModel$confusion)

    No Yes class.error
No   4   1         0.2
Yes  1   4         0.2

for(i in c(2)){
   RFModel = randomForest(Churn ~ .,
                     data = Trainingds,
                     ntree = 30,
                     mtry = i,
                     importance = TRUE,
                     replace = FALSE)
   print(RFModel$confusion)
}

     No Yes class.error
No   3   2         0.4
Yes  2   3         0.4

Code1 and code2 should provide same output.

Isn't the result of a randomForest random?

R. Schifini
– R. Schifini

2017-03-04 21:10:43 +00:00
Commented Mar 4, 2017 at 21:10 — R. Schifini
– R. Schifini, Commented Mar 4, 2017 at 21:10

eipi10 · Accepted Answer · 2017-03-05 01:04:21Z

You'll get slightly different results each time, because randomness is built into the algorithm. To build each tree, the algorithm resamples the data frame and also randomly selects mtry columns to build the tree from the resampled data frame. If you want models built with the same parameters (e.g., mtry, ntree) to give the same result each time, you need to set a random seed.

For example, let's run randomForest 10 times and check the mean of the mean square error from each run. Note that the mean mse is different each time:

library(randomForest)

replicate(10, mean(randomForest(mpg ~ ., data=mtcars)$mse))

[1] 5.998530 6.307782 5.791657 6.125588 5.868717 5.845616 5.427208 6.112762 5.777624 6.150021

If you run the above code, you'll get another 10 values that are different from the values above.

If you want to be able to reproduce the results a given model run with the same parameters (e.g., mtry and ntree) then you can set a random seed. For example:

set.seed(5)
mean(randomForest(mpg ~ ., data=mtcars)$mse)

[1] 6.017737

You'll get the same result if you use the same seed value, but different results otherwise. Using a larger value of ntree will reduce, but not eliminate the variability between model runs.

UPDATE: When I run your code with the data sample you provided, I don't always get the same results each time. Even with replace=TRUE, which results in the data frame being sampled without replacement, the columns selected to build the tree can be different each time:

> randomForest(Churn ~ .,
+              data = ggg,
+              ntree = 30,
+              mtry = 2,
+              importance = TRUE,
+              replace = FALSE)

Call:
 randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2,      importance = TRUE, replace = FALSE) 
               Type of random forest: classification
                     Number of trees: 30
No. of variables tried at each split: 2

        OOB estimate of  error rate: 30%
Confusion matrix:
    No Yes class.error
No   3   2         0.4
Yes  1   4         0.2
> randomForest(Churn ~ .,
+              data = ggg,
+              ntree = 30,
+              mtry = 2,
+              importance = TRUE,
+              replace = FALSE)

Call:
 randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2,      importance = TRUE, replace = FALSE) 
               Type of random forest: classification
                     Number of trees: 30
No. of variables tried at each split: 2

        OOB estimate of  error rate: 20%
Confusion matrix:
    No Yes class.error
No   4   1         0.2
Yes  1   4         0.2
> randomForest(Churn ~ .,
+              data = ggg,
+              ntree = 30,
+              mtry = 2,
+              importance = TRUE,
+              replace = FALSE)

Call:
 randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2,      importance = TRUE, replace = FALSE) 
               Type of random forest: classification
                     Number of trees: 30
No. of variables tried at each split: 2

        OOB estimate of  error rate: 30%
Confusion matrix:
    No Yes class.error
No   3   2         0.4
Yes  1   4         0.2

Here's a similar set of results with the built-in iris data frame:

> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,
+              replace = FALSE)

Call:
 randomForest(formula = Species ~ ., data = iris, ntree = 30,      mtry = 2, importance = TRUE, replace = FALSE) 
               Type of random forest: classification
                     Number of trees: 30
No. of variables tried at each split: 2

        OOB estimate of  error rate: 3.33%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         50          0         0        0.00
versicolor      0         47         3        0.06
virginica       0          2        48        0.04
> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,
+              replace = FALSE)

Call:
 randomForest(formula = Species ~ ., data = iris, ntree = 30,      mtry = 2, importance = TRUE, replace = FALSE) 
               Type of random forest: classification
                     Number of trees: 30
No. of variables tried at each split: 2

        OOB estimate of  error rate: 4.67%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         50          0         0        0.00
versicolor      0         47         3        0.06
virginica       0          4        46        0.08
> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,
+              replace = FALSE)

Call:
 randomForest(formula = Species ~ ., data = iris, ntree = 30,      mtry = 2, importance = TRUE, replace = FALSE) 
               Type of random forest: classification
                     Number of trees: 30
No. of variables tried at each split: 2

        OOB estimate of  error rate: 6%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         50          0         0        0.00
versicolor      0         47         3        0.06
virginica       0          6        44        0.12

You can also look at the trees generated by each model run and they will, in general, be different. For example say I run the following code three times, storing the results in objects m1, m2, and m3.

randomForest(Churn ~ .,
             data = ggg,
             ntree = 30,
             mtry = 2,
             importance = TRUE,
             replace = FALSE)

Now let's look at the first four trees for each model object, which I've pasted in below. The output is a list. You can see that the first tree is different for each model run. The second tree is the same for the first two model runs, but different for the third, and so on.

check.trees = lapply(1:4, function(i) {
  lapply(list(m1=m1,m2=m2,m3=m3), function(model) getTree(model, i, labelVar=TRUE))
  })

check.trees

[[1]]
[[1]]$m1
  left daughter right daughter split var split point status prediction
1             2              3   Partner    1.000000      1       <NA>
2             4              5    gender    1.000000      1       <NA>
3             0              0      <NA>    0.000000     -1         No
4             0              0      <NA>    0.000000     -1        Yes
5             6              7    tenure    2.634489      1       <NA>
6             0              0      <NA>    0.000000     -1        Yes
7             0              0      <NA>    0.000000     -1         No

[[1]]$m2
  left daughter right daughter split var split point status prediction
1             2              3    gender    1.000000      1       <NA>
2             0              0      <NA>    0.000000     -1        Yes
3             4              5    tenure    1.850182      1       <NA>
4             0              0      <NA>    0.000000     -1        Yes
5             0              0      <NA>    0.000000     -1         No

[[1]]$m3
  left daughter right daughter split var split point status prediction
1             2              3    tenure    2.249904      1       <NA>
2             0              0      <NA>    0.000000     -1        Yes
3             0              0      <NA>    0.000000     -1         No


[[2]]
[[2]]$m1
  left daughter right daughter split var split point status prediction
1             2              3   Partner           1      1       <NA>
2             0              0      <NA>           0     -1        Yes
3             0              0      <NA>           0     -1         No

[[2]]$m2
  left daughter right daughter split var split point status prediction
1             2              3   Partner           1      1       <NA>
2             0              0      <NA>           0     -1        Yes
3             0              0      <NA>           0     -1         No

[[2]]$m3
  left daughter right daughter split var split point status prediction
1             2              3   Partner           1      1       <NA>
2             4              5    gender           1      1       <NA>
3             0              0      <NA>           0     -1         No
4             0              0      <NA>           0     -1        Yes
5             0              0      <NA>           0     -1         No


[[3]]
[[3]]$m1
  left daughter right daughter split var split point status prediction
1             2              3   Partner           1      1       <NA>
2             4              5    gender           1      1       <NA>
3             0              0      <NA>           0     -1         No
4             0              0      <NA>           0     -1        Yes
5             0              0      <NA>           0     -1        Yes

[[3]]$m2
  left daughter right daughter split var split point status prediction
1             2              3   Partner           1      1       <NA>
2             0              0      <NA>           0     -1        Yes
3             0              0      <NA>           0     -1         No

[[3]]$m3
  left daughter right daughter split var split point status prediction
1             2              3    tenure    2.129427      1       <NA>
2             0              0      <NA>    0.000000     -1        Yes
3             0              0      <NA>    0.000000     -1         No


[[4]]
[[4]]$m1
  left daughter right daughter split var split point status prediction
1             2              3    tenure    1.535877      1       <NA>
2             0              0      <NA>    0.000000     -1        Yes
3             4              5    tenure    4.015384      1       <NA>
4             0              0      <NA>    0.000000     -1         No
5             6              7    tenure    4.239396      1       <NA>
6             0              0      <NA>    0.000000     -1        Yes
7             0              0      <NA>    0.000000     -1         No

[[4]]$m2
  left daughter right daughter split var split point status prediction
1             2              3   Partner           1      1       <NA>
2             0              0      <NA>           0     -1        Yes
3             0              0      <NA>           0     -1         No

[[4]]$m3
  left daughter right daughter split var split point status prediction
1             2              3   Partner           1      1       <NA>
2             0              0      <NA>           0     -1        Yes
3             0              0      <NA>           0     -1         No

But if i run the first code 10 times, i got the same confusion matrix.
Please provide some sample data that works with your code and reproduces the issue you're having. Use dput to provide the data sample.
You are totally right. I really appreciate your response. Once I add set.seed() at first line in code1 and inside for loop in code2, i get the same result. Thank u so much.

Collectives™ on Stack Overflow

Searching for best random forest parameter by for loop in r

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related