I acknowledge that this is a somewhat difficult question to ask anyone except the person who wrote them, but I'm obtaining persistently different results across three different versions of a random forest in R.
The three methods in question are the randomForest package, the "rf" method in caret, and the ranger package. Code is included below.
The data in question is one example; I see similar things across other specifications of similar data.
LHS variable: party identification (Dem, Rep, Indep.). Right hand side predictors are demographics. In order to try to figure out what the heck was going on with some bizarre results in the randomForest package, I tried implementing the same model in the other two methods. What I found is that they do NOT reproduce that particular anomaly; this is especially odd because as far as I can tell, the rf method in caret is just an indirect usage of the randomForest package.
The three specifications I run in each implementation are (1) the three category classification, (2) removing the independent category, and (3) the same as 2 but scrambling a single observation to "Independent" to keep three categories in the model, which should produce similar results to 2. As far as I can tell, in no cases should there be any over or under sampling to explain the results.
I also notice the following trends:
- The randomForest package is the only one that goes completely haywire with only two categories.
- The ranger package consistently identifies (both correctly and incorrectly) more observation as independents.
- The ranger package is always slightly worse in terms of overall predictive accuracy.
- The caret package is similar in overall accuracy to randomForest (slightly higher), but consistently is better at the more common class and worse at the less common class. This is weird because as far as I can tell I'm not implementing any over or undersampling in either case, and because I think caret is relying upon the randomForest package.
Below I've included both code and confusion matrices showing the differences in question. Rerunning the code produces similar trends in the confusion matrices each time; this isn't a "well any individual run might produce odd results" issue.
Does anyone have any idea why these packages would consistently produce slightly different (and in the case of the linked issue in randomForest, VERY different) results in general, or even better, why they would be different in this particular way? For example, is there some sort of sample weighting / stratification going on in the package of these packages that I should be aware of?
Code:
num_trees=1001
var_split=3
load("three_cat.Rda")
rf_three_cat <-randomForest(party_id_3_cat~{RHS Vars},
data=three_cat,
ntree=num_trees,
mtry=var_split,
type="classification",
importance=TRUE,confusion=TRUE)
two_cat<-subset(three_cat,party_id_3_cat!="2. Independents")
two_cat$party_id_3_cat<-droplevels(two_cat$party_id_3_cat)
rf_two_cat <-randomForest(party_id_3_cat~{RHS Vars},
data=two_cat,
ntree=num_trees,
mtry=var_split,
type="classification",
importance=TRUE,confusion=TRUE)
scramble_independent<-subset(three_cat,party_id_3_cat!="2. Independents")
scramble_independent[1,19]<-"2. Independents"
scramble_independent<- data.frame(lapply(scramble_independent, as.factor), stringsAsFactors=TRUE)
rf_scramble<-randomForest(party_id_3_cat~{RHS Vars},
data=scramble_independent,
ntree=num_trees,
mtry=var_split,
type="classification",
importance=TRUE,confusion=TRUE)
ranger_2<-ranger(formula=party_id_3_cat~{RHS Vars},
data=two_cat,
num.trees=num_trees,mtry=var_split)
ranger_3<-ranger(formula=party_id_3_cat~{RHS Vars},
data=three_cat,
num.trees=num_trees,mtry=var_split)
ranger_scram<-ranger(formula=party_id_3_cat~{RHS Vars},
data=scramble_independent,
num.trees=num_trees,mtry=var_split)
rfControl <- trainControl(method = "none", number = 1, repeats = 1)
rfGrid <- expand.grid(mtry = c(3))
rf_caret_3 <- train(party_id_3_cat~{RHS Vars},
data=three_cat,
method="rf", ntree=num_trees,
type="classification",
importance=TRUE,confusion=TRUE,
trControl = rfControl, tuneGrid = rfGrid)
rf_caret_2 <- train(party_id_3_cat~{RHS Vars},
data = two_cat,
method = "rf",ntree=num_trees,
type="classification",
importance=TRUE,confusion=TRUE,
trControl = rfControl, tuneGrid = rfGrid)
rf_caret_scramble <- train(party_id_3_cat~{RHS Vars},
data = scramble_independent,
method = "rf",ntree=num_trees,
type="classification",
importance=TRUE,confusion=TRUE,
trControl = rfControl, tuneGrid = rfGrid)
rf_three_cat$confusion
ranger_3$confusion.matrix
rf_caret_3$finalModel["confusion"]
rf_two_cat$confusion
ranger_2$confusion.matrix
rf_caret_2$finalModel["confusion"]
rf_scramble$confusion
ranger_scram$confusion.matrix
rf_caret_scramble$finalModel["confusion"]
Results (formatting slightly modified for comparability):
> rf_three_cat$confusion
1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1121 3 697 0.3844042
2. Independents 263 7 261 0.9868173
3. Republicans (including leaners) 509 9 1096 0.3209418
> ranger_3$confusion.matrix
1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1128 46 647 0.3805601
2. Independents 263 23 245 0.9566855
3. Republicans (including leaners) 572 31 1011 0.3736059
> rf_caret_3$finalModel["confusion"]
1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1268 0 553 0.3036793
2. Independents 304 0 227 1.0000000
3. Republicans (including leaners) 606 0 1008 0.3754647
> rf_two_cat$confusion
1. Democrats (including leaners) 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1775 46 0.0252608
3. Republicans (including leaners) 1581 33 0.9795539
> ranger_2$confusion.matrix
1. Democrats (including leaners) 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1154 667 0.3662823
3. Republicans (including leaners) 590 1024 0.3655514
> rf_caret_2$finalModel["confusion"]
1. Democrats (including leaners) 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1315 506 0.2778693
3. Republicans (including leaners) 666 948 0.4126394
> rf_scramble$confusion
1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1104 0 717 0.3937397
2. Independents 0 0 1 1.0000000
3. Republicans (including leaners) 501 0 1112 0.3106014
> ranger_scram$confusion.matrix
1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners)
1. Democrats (including leaners) 1159 0 662 0.3635365
2. Independents 0 0 1 1.0000000
3. Republicans (including leaners) 577 0 1036 0.3577185
> rf_caret_scramble$finalModel["confusion"]
1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1315 0 506 0.2778693
2. Independents 0 0 1 1.0000000
3. Republicans (including leaners) 666 0 947 0.4128952
rfusesrandomForestpackage under the hood: topepo.github.io/caret/train-models-by-tag.html#random-forest