Why would different random forest implementations in R yield different results?

Question

I acknowledge that this is a somewhat difficult question to ask anyone except the person who wrote them, but I'm obtaining persistently different results across three different versions of a random forest in R.

The three methods in question are the randomForest package, the "rf" method in caret, and the ranger package. Code is included below.

The data in question is one example; I see similar things across other specifications of similar data.

LHS variable: party identification (Dem, Rep, Indep.). Right hand side predictors are demographics. In order to try to figure out what the heck was going on with some bizarre results in the randomForest package, I tried implementing the same model in the other two methods. What I found is that they do NOT reproduce that particular anomaly; this is especially odd because as far as I can tell, the rf method in caret is just an indirect usage of the randomForest package.

The three specifications I run in each implementation are (1) the three category classification, (2) removing the independent category, and (3) the same as 2 but scrambling a single observation to "Independent" to keep three categories in the model, which should produce similar results to 2. As far as I can tell, in no cases should there be any over or under sampling to explain the results.

I also notice the following trends:

The randomForest package is the only one that goes completely haywire with only two categories.
The ranger package consistently identifies (both correctly and incorrectly) more observation as independents.
The ranger package is always slightly worse in terms of overall predictive accuracy.
The caret package is similar in overall accuracy to randomForest (slightly higher), but consistently is better at the more common class and worse at the less common class. This is weird because as far as I can tell I'm not implementing any over or undersampling in either case, and because I think caret is relying upon the randomForest package.

Below I've included both code and confusion matrices showing the differences in question. Rerunning the code produces similar trends in the confusion matrices each time; this isn't a "well any individual run might produce odd results" issue.

Does anyone have any idea why these packages would consistently produce slightly different (and in the case of the linked issue in randomForest, VERY different) results in general, or even better, why they would be different in this particular way? For example, is there some sort of sample weighting / stratification going on in the package of these packages that I should be aware of?

Code:

num_trees=1001
var_split=3

load("three_cat.Rda")
rf_three_cat  <-randomForest(party_id_3_cat~{RHS Vars},
                         data=three_cat,
                         ntree=num_trees,
                         mtry=var_split,
                         type="classification",
                         importance=TRUE,confusion=TRUE)

two_cat<-subset(three_cat,party_id_3_cat!="2. Independents")    
two_cat$party_id_3_cat<-droplevels(two_cat$party_id_3_cat)
rf_two_cat    <-randomForest(party_id_3_cat~{RHS Vars},
                         data=two_cat,
                         ntree=num_trees,
                         mtry=var_split,
                         type="classification",
                         importance=TRUE,confusion=TRUE)
scramble_independent<-subset(three_cat,party_id_3_cat!="2. Independents")
scramble_independent[1,19]<-"2. Independents"
scramble_independent<- data.frame(lapply(scramble_independent, as.factor), stringsAsFactors=TRUE)
rf_scramble<-randomForest(party_id_3_cat~{RHS Vars},
                      data=scramble_independent,
                      ntree=num_trees,
                      mtry=var_split,
                      type="classification",
                      importance=TRUE,confusion=TRUE)

ranger_2<-ranger(formula=party_id_3_cat~{RHS Vars},
             data=two_cat,
             num.trees=num_trees,mtry=var_split)
ranger_3<-ranger(formula=party_id_3_cat~{RHS Vars},
             data=three_cat,
             num.trees=num_trees,mtry=var_split)
ranger_scram<-ranger(formula=party_id_3_cat~{RHS Vars},
                 data=scramble_independent,
                 num.trees=num_trees,mtry=var_split)

rfControl <- trainControl(method = "none", number = 1, repeats = 1)
rfGrid <- expand.grid(mtry = c(3))
rf_caret_3        <- train(party_id_3_cat~{RHS Vars},
                      data=three_cat,
                      method="rf", ntree=num_trees,
                      type="classification",
                      importance=TRUE,confusion=TRUE,
                      trControl = rfControl, tuneGrid = rfGrid)
rf_caret_2        <- train(party_id_3_cat~{RHS Vars},
                data = two_cat,
                method = "rf",ntree=num_trees,
                type="classification",
                importance=TRUE,confusion=TRUE,
                trControl = rfControl, tuneGrid = rfGrid)
rf_caret_scramble <- train(party_id_3_cat~{RHS Vars},
                      data = scramble_independent,
                      method = "rf",ntree=num_trees,
                      type="classification",
                      importance=TRUE,confusion=TRUE,
                      trControl = rfControl, tuneGrid = rfGrid)

rf_three_cat$confusion
ranger_3$confusion.matrix
rf_caret_3$finalModel["confusion"]

rf_two_cat$confusion
ranger_2$confusion.matrix
rf_caret_2$finalModel["confusion"]

rf_scramble$confusion
ranger_scram$confusion.matrix
rf_caret_scramble$finalModel["confusion"]

Results (formatting slightly modified for comparability):

> rf_three_cat$confusion
                                     1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                                 1121               3                                697   0.3844042
2. Independents                                                   263               7                                261   0.9868173
3. Republicans (including leaners)                                509               9                               1096   0.3209418                        

> ranger_3$confusion.matrix
                                   1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                               1128              46                                647   0.3805601
2. Independents                                                 263              23                                245   0.9566855
3. Republicans (including leaners)                              572              31                               1011   0.3736059

> rf_caret_3$finalModel["confusion"]
                                     1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                                 1268               0                                553   0.3036793
2. Independents                                                   304               0                                227   1.0000000
3. Republicans (including leaners)                                606               0                               1008   0.3754647

> rf_two_cat$confusion
                                     1. Democrats (including leaners) 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                                 1775                                 46   0.0252608
3. Republicans (including leaners)                               1581                                 33   0.9795539

> ranger_2$confusion.matrix
                                   1. Democrats (including leaners) 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                               1154                                667   0.3662823
3. Republicans (including leaners)                              590                               1024   0.3655514

> rf_caret_2$finalModel["confusion"]
                                   1. Democrats (including leaners) 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                               1315                                  506   0.2778693
3. Republicans (including leaners)                              666                                  948   0.4126394

> rf_scramble$confusion
                                     1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                               1104               0                                717   0.3937397
2. Independents                                                   0               0                                  1   1.0000000
3. Republicans (including leaners)                              501               0                               1112   0.3106014

> ranger_scram$confusion.matrix
                                   1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners)
1. Democrats (including leaners)                               1159               0                               662  0.3635365
2. Independents                                                   0               0                                 1  1.0000000
3. Republicans (including leaners)                              577               0                              1036  0.3577185

> rf_caret_scramble$finalModel["confusion"]
                                   1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                               1315               0                                506   0.2778693
2. Independents                                                   0               0                                  1   1.0000000
3. Republicans (including leaners)                              666               0                                947   0.4128952

Just FYI, caret method rf uses randomForest package under the hood: topepo.github.io/caret/train-models-by-tag.html#random-forest — desertnaut
– desertnaut, Commented Sep 10, 2018 at 20:07
Yes, I'm aware. As I mentioned in point of difference number 4, that's part of why I find them producing different results so surprising. — CRS1834
– CRS1834, Commented Sep 10, 2018 at 20:46

user2974951 · Accepted Answer · 2018-09-11 09:33:00Z

1

First of all, the random forest algorithm is... random, so some variation is expected by default. Secondly and more importantly, the algorithms are different, that is they use different steps which is why you get different results.

You should take a look at how they perform splits (which criterion: gini, extra, etc.) and if these are random (extremely randomized trees), how they sample the bootstrap samples (with / without replacement) and what proportion, mtry or how many variables are selected at each split, max depth or max cases in nodes, and so on.

answered Sep 11, 2018 at 9:33

user2974951

10.4k2 gold badges21 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Why would different random forest implementations in R yield different results?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related