Little introduction to the question :
I am developing an ecophysiological model, and I use a reference class list called S that store every object the model need for input/output (e.g. meteo, physiological parameters etc...).
This list contains 5 objects (see example below):
- two dataframes, S$Table_Day (the outputs from the model) and S$Met_c(the meteo in input), which both have variables in columns, and observations (input or output) in row.
- a list of parameters S$Parameters.
- a matrix
- a vector
The model runs many functions with a daily time step. Each day is computed in a for loop that runs from the first day i=1 to the last day i=n. This list is passed to the functions that often take data from S$Met_c and/or S$Parameters in input and compute something that is stored in S$Table_Day, using indexes (the ith day). S is a Reference Class list because they avoid copy on modification, which is very important considering the number of computations.
The question itself :
As the model is very slow, I am trying to decrease computation time by micro-benchmarking different solutions.
Today I found something surprising when comparing two solutions to store my data. Storing data by indexing in one of the preallocated dataframes is longer than storing it into an undeclared vector. After reading this, I thought preallocating memory was always faster, but it seems that R performs more operations while modifying by index (probably comparing the length, type etc...).
My question is : is there a better way to perform such operations ? In other words, is there a way for me to use/store more efficiently the inputs/outputs (in a data.frame, a list of vector or else) to keep track of all computations of each day ? For example would it be better to use many vectors (one for each variable) and regroup them in more complex objects (e.g. list of dataframe) at then end ?
By the way, am I right to use Reference Classes to avoid copy of the big objects in S while passing it to functions and modify it from within them ?
Reproducible example for the comparison:
SimulationClass <- setRefClass("Simulation",
fields = list(Table_Day = "data.frame",
Met_c= "data.frame",
PerCohortFruitDemand_c="matrix",
Parameters= "list",
Zero_then_One="vector"))
S= SimulationClass$new()
# Initializing the table with dummy numbers :
S$Table_Day= data.frame(one= 1:10000, two= rnorm(n = 10000), three= runif(n = 10000),Bud_dd= rep(0,10000))
S$Met_c= data.frame(DegreeDays= rnorm(n=10000, mean = 10, sd = 1))
f1= function(i){
a= cumsum(S$Met_c$DegreeDays[i:(i-1000)])
}
f2= function(i){
S$Table_Day$Bud_dd[(i-1000):i]= cumsum(S$Met_c$DegreeDays[i:(i-1000)])
}
res= microbenchmark(f1(1000),f2(1000),times = 10000)
autoplot(res)
Also if someone has any experience in programming such models, I am deeply interested in any advice for model development.


data.tableS$Table_Daydataframe for complex tasks, but do you think using a data.table instead of a data.frame is faster to modify values by indexing ? (I am not talking about performing computations but rather store the results in the table).