You'll get slightly different results each time, because randomness is built into the algorithm. To build each tree, the algorithm resamples the data frame and also randomly selects mtry columns to build the tree from the resampled data frame. If you want models built with the same parameters (e.g., mtry, ntree) to give the same result each time, you need to set a random seed.
For example, let's run randomForest 10 times and check the mean of the mean square error from each run. Note that the mean mse is different each time:
library(randomForest)
replicate(10, mean(randomForest(mpg ~ ., data=mtcars)$mse))
[1] 5.998530 6.307782 5.791657 6.125588 5.868717 5.845616 5.427208 6.112762 5.777624 6.150021
If you run the above code, you'll get another 10 values that are different from the values above.
If you want to be able to reproduce the results a given model run with the same parameters (e.g., mtry and ntree) then you can set a random seed. For example:
set.seed(5)
mean(randomForest(mpg ~ ., data=mtcars)$mse)
[1] 6.017737
You'll get the same result if you use the same seed value, but different results otherwise. Using a larger value of ntree will reduce, but not eliminate the variability between model runs.
UPDATE: When I run your code with the data sample you provided, I don't always get the same results each time. Even with replace=TRUE, which results in the data frame being sampled without replacement, the columns selected to build the tree can be different each time:
> randomForest(Churn ~ .,
+ data = ggg,
+ ntree = 30,
+ mtry = 2,
+ importance = TRUE,
+ replace = FALSE)
Call:
randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2
OOB estimate of error rate: 30%
Confusion matrix:
No Yes class.error
No 3 2 0.4
Yes 1 4 0.2
> randomForest(Churn ~ .,
+ data = ggg,
+ ntree = 30,
+ mtry = 2,
+ importance = TRUE,
+ replace = FALSE)
Call:
randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2
OOB estimate of error rate: 20%
Confusion matrix:
No Yes class.error
No 4 1 0.2
Yes 1 4 0.2
> randomForest(Churn ~ .,
+ data = ggg,
+ ntree = 30,
+ mtry = 2,
+ importance = TRUE,
+ replace = FALSE)
Call:
randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2
OOB estimate of error rate: 30%
Confusion matrix:
No Yes class.error
No 3 2 0.4
Yes 1 4 0.2
Here's a similar set of results with the built-in iris data frame:
> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,
+ replace = FALSE)
Call:
randomForest(formula = Species ~ ., data = iris, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2
OOB estimate of error rate: 3.33%
Confusion matrix:
setosa versicolor virginica class.error
setosa 50 0 0 0.00
versicolor 0 47 3 0.06
virginica 0 2 48 0.04
> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,
+ replace = FALSE)
Call:
randomForest(formula = Species ~ ., data = iris, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2
OOB estimate of error rate: 4.67%
Confusion matrix:
setosa versicolor virginica class.error
setosa 50 0 0 0.00
versicolor 0 47 3 0.06
virginica 0 4 46 0.08
> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,
+ replace = FALSE)
Call:
randomForest(formula = Species ~ ., data = iris, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2
OOB estimate of error rate: 6%
Confusion matrix:
setosa versicolor virginica class.error
setosa 50 0 0 0.00
versicolor 0 47 3 0.06
virginica 0 6 44 0.12
You can also look at the trees generated by each model run and they will, in general, be different. For example say I run the following code three times, storing the results in objects m1, m2, and m3.
randomForest(Churn ~ .,
data = ggg,
ntree = 30,
mtry = 2,
importance = TRUE,
replace = FALSE)
Now let's look at the first four trees for each model object, which I've pasted in below. The output is a list. You can see that the first tree is different for each model run. The second tree is the same for the first two model runs, but different for the third, and so on.
check.trees = lapply(1:4, function(i) {
lapply(list(m1=m1,m2=m2,m3=m3), function(model) getTree(model, i, labelVar=TRUE))
})
check.trees
[[1]]
[[1]]$m1
left daughter right daughter split var split point status prediction
1 2 3 Partner 1.000000 1 <NA>
2 4 5 gender 1.000000 1 <NA>
3 0 0 <NA> 0.000000 -1 No
4 0 0 <NA> 0.000000 -1 Yes
5 6 7 tenure 2.634489 1 <NA>
6 0 0 <NA> 0.000000 -1 Yes
7 0 0 <NA> 0.000000 -1 No
[[1]]$m2
left daughter right daughter split var split point status prediction
1 2 3 gender 1.000000 1 <NA>
2 0 0 <NA> 0.000000 -1 Yes
3 4 5 tenure 1.850182 1 <NA>
4 0 0 <NA> 0.000000 -1 Yes
5 0 0 <NA> 0.000000 -1 No
[[1]]$m3
left daughter right daughter split var split point status prediction
1 2 3 tenure 2.249904 1 <NA>
2 0 0 <NA> 0.000000 -1 Yes
3 0 0 <NA> 0.000000 -1 No
[[2]]
[[2]]$m1
left daughter right daughter split var split point status prediction
1 2 3 Partner 1 1 <NA>
2 0 0 <NA> 0 -1 Yes
3 0 0 <NA> 0 -1 No
[[2]]$m2
left daughter right daughter split var split point status prediction
1 2 3 Partner 1 1 <NA>
2 0 0 <NA> 0 -1 Yes
3 0 0 <NA> 0 -1 No
[[2]]$m3
left daughter right daughter split var split point status prediction
1 2 3 Partner 1 1 <NA>
2 4 5 gender 1 1 <NA>
3 0 0 <NA> 0 -1 No
4 0 0 <NA> 0 -1 Yes
5 0 0 <NA> 0 -1 No
[[3]]
[[3]]$m1
left daughter right daughter split var split point status prediction
1 2 3 Partner 1 1 <NA>
2 4 5 gender 1 1 <NA>
3 0 0 <NA> 0 -1 No
4 0 0 <NA> 0 -1 Yes
5 0 0 <NA> 0 -1 Yes
[[3]]$m2
left daughter right daughter split var split point status prediction
1 2 3 Partner 1 1 <NA>
2 0 0 <NA> 0 -1 Yes
3 0 0 <NA> 0 -1 No
[[3]]$m3
left daughter right daughter split var split point status prediction
1 2 3 tenure 2.129427 1 <NA>
2 0 0 <NA> 0.000000 -1 Yes
3 0 0 <NA> 0.000000 -1 No
[[4]]
[[4]]$m1
left daughter right daughter split var split point status prediction
1 2 3 tenure 1.535877 1 <NA>
2 0 0 <NA> 0.000000 -1 Yes
3 4 5 tenure 4.015384 1 <NA>
4 0 0 <NA> 0.000000 -1 No
5 6 7 tenure 4.239396 1 <NA>
6 0 0 <NA> 0.000000 -1 Yes
7 0 0 <NA> 0.000000 -1 No
[[4]]$m2
left daughter right daughter split var split point status prediction
1 2 3 Partner 1 1 <NA>
2 0 0 <NA> 0 -1 Yes
3 0 0 <NA> 0 -1 No
[[4]]$m3
left daughter right daughter split var split point status prediction
1 2 3 Partner 1 1 <NA>
2 0 0 <NA> 0 -1 Yes
3 0 0 <NA> 0 -1 No
randomForestrandom?