running multiple models using for-loop in r

Question

I'm trying to run a loop which generates 5 random samples, and then 5 different RandomForest models.

I getting troubles over the second part (running the models); I can't approach the dependent variable (nam$eR in the following code):

numS <- 5 # number of samples
dataS <- ERC3
rfModels <- list()

for(j in 1:numS) {

print(j)
set.seed(j+1)
nam <- paste("RFs", j, sep = "")
assign(nam, dataS[sample(nrow(dataS),100000),]) # Random sample of 100,000 rows.

namM <- paste("RFfit", j, sep = "")
assign(namM, randomForest(as.factor(nam$eR)~., data=nam[,-231], importance = TRUE))

rfModels[[j]] <- namM

}

Thank you in advance!

Nick Criswell · Accepted Answer · 2017-01-23 02:00:11Z

2

I am not sure if this will work exactly for your case since I don't have sample data, but if you were to do what I'm thinking you are looking for with the mtcars data set, it would be something like this...First, it might be best to have a list of data frames to house the data you are running the model on. This can be done as follows:

library(dplyr)
library(randomForest)

dfs <- list() #home for the list of dataframes on which to run a randomforest

set.seed(1)
for(i in 1:5){
  dfs[[i]] <- sample_n(mtcars, size = 10, replace = FALSE)
}

(Per the comments, a slicker way to do this would be to go with

  dfs_slicker_approach <- lapply(seq(5), 
                                 function(i) sample_n(mtcars, size = 10, replace = FALSE))

)

The dfs list now contains a list of data.frames which contain 10 randomly selected rows from the mtcars data set. (Obviously, you'll want to update this to fit your needs.)

Then we run the randomForest function on this list using the lapply function as follows:

rfs <- lapply(dfs, function(m) randomForest(mpg ~ ., 
                                            data = m, importance = TRUE ))

Again, change the syntax to select the columns you are interested in predicting on. The rfs list now contains all of our randomForest objects. You can again access these using lapply. For instnace, if we want the predicted values, we can do this as follows: (We'll subset to only the first set of predictions to avoid printing a a lot of info)

> lapply(rfs, as.data.frame(predict))[1]
[[1]]
                       value
Merc 230            22.85464
Merc 450SE          17.61810
Fiat 128            22.31571
Porsche 914-2       23.95909
Valiant             21.28786
Pontiac Firebird    15.93824
Ford Pantera L      21.20373
Chrysler Imperial   14.40740
Lincoln Continental 16.43074
Mazda RX4 Wag       21.18467

edited Jan 23, 2017 at 2:00

answered Jan 22, 2017 at 14:56

Nick Criswell

1,7532 gold badges16 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

staove7 Over a year ago

I probably missed something; how do you run over the 5 data sets? And.. can you please tell me what the part of function(m).. tks!

Nick Criswell Over a year ago

lapply lets us apply a function on each element in a list. We made a list of data frames which we called dfs and then used lapply on that list to hit each data frame with the randomForest(mpg ~ . , data = m, importance = TRUE) function. In this case, m is just a place holder/bookkeeping tool so we know that we are passing each full data frame element from our dfs list to the data= argument of randomForest.

staove7 Over a year ago

great answer-Tks.. I wrote the following line:

lapply(dataL, function(m) randomForest(as.factor(eR) ~ .,                                              data = m[,-231], importance = TRUE ))

but it tells me _ Error in is.factor(x) : object 'eR' not found_. Can you help me with that..

Parfait Over a year ago

No need to initialize rfs if assigning to lapply and dfs could be assigned to an lapply call.

Parfait Over a year ago

And consider: dfs <- lapply(seq(5), function(i) sample_n(mtcars, size = 10, replace = FALSE))

|

Jake Kaupp · Accepted Answer · 2017-01-22 18:33:06Z

2

While not deviating from Nick's solution, here is an approach using the tidyverse workflow. Highlights are: readable code via pipes, using dplyr verbs and purrr functionals and keeping data, models and predictions in a nice tidy tibble.

library(randomForest)
library(tidyverse)

set.seed(42)

analysis <- rerun(5, sample_n(mtcars, size = 10, replace = FALSE)) %>% 
  tibble(data = .) %>% 
  rownames_to_column("model_number") %>% 
  mutate(models = map(data, ~randomForest(mpg ~ ., data = .x, importance = TRUE))) %>% 
  mutate(predict = map(models, ~predict(.x)))

You can then get what you want when you need it....

comparison <-  analysis %>% 
mutate(actual = map(data, "mpg")) %>% 
unnest(predict, actual)

comparison

# A tibble: 50 × 3
   model_number  predict actual
          <chr>    <dbl>  <dbl>
1             1 14.10348   14.7
2             1 16.78987   15.0
3             1 15.14636   17.3
4             1 15.81265   15.5
5             1 24.11492   21.5
6             1 24.24701   22.8
7             1 15.84953   10.4
8             1 21.72781   32.4
9             1 21.78105   21.0
10            1 15.58614   16.4
# ... with 40 more rows

... and see the results easily.

ggplot(comparison, aes(actual, predict)) +
  geom_point() +
  facet_wrap(~model_number, nrow = 1)

answered Jan 22, 2017 at 18:33

Jake Kaupp

8,0922 gold badges28 silver badges36 bronze badges

2 Comments

staove7 Over a year ago

Very (very!) nice way.. I'll try it. Just one thing- do you think this will be good also for large data sets (say 100,000 observations and more); I'm talking about the visualization part.

Jake Kaupp Over a year ago

It would get a little messy, but you could look at other methods to reduce the data to a reasonable amount to visually compare.

Collectives™ on Stack Overflow

running multiple models using for-loop in r

2 Answers 2

9 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related