Most efficient way (fastest) to modify a data.frame using indexing

Question

Little introduction to the question :

I am developing an ecophysiological model, and I use a reference class list called S that store every object the model need for input/output (e.g. meteo, physiological parameters etc...). This list contains 5 objects (see example below):
- two dataframes, S$Table_Day (the outputs from the model) and S$Met_c(the meteo in input), which both have variables in columns, and observations (input or output) in row.
- a list of parameters S$Parameters.
- a matrix
- a vector

The model runs many functions with a daily time step. Each day is computed in a for loop that runs from the first day i=1 to the last day i=n. This list is passed to the functions that often take data from S$Met_c and/or S$Parameters in input and compute something that is stored in S$Table_Day, using indexes (the ith day). S is a Reference Class list because they avoid copy on modification, which is very important considering the number of computations.

The question itself :

As the model is very slow, I am trying to decrease computation time by micro-benchmarking different solutions.
Today I found something surprising when comparing two solutions to store my data. Storing data by indexing in one of the preallocated dataframes is longer than storing it into an undeclared vector. After reading this, I thought preallocating memory was always faster, but it seems that R performs more operations while modifying by index (probably comparing the length, type etc...).

My question is : is there a better way to perform such operations ? In other words, is there a way for me to use/store more efficiently the inputs/outputs (in a data.frame, a list of vector or else) to keep track of all computations of each day ? For example would it be better to use many vectors (one for each variable) and regroup them in more complex objects (e.g. list of dataframe) at then end ?

By the way, am I right to use Reference Classes to avoid copy of the big objects in S while passing it to functions and modify it from within them ?

Reproducible example for the comparison:

SimulationClass <- setRefClass("Simulation",
                           fields = list(Table_Day = "data.frame",
                                         Met_c= "data.frame",
                                         PerCohortFruitDemand_c="matrix",
                                         Parameters= "list",
                                         Zero_then_One="vector"))
S= SimulationClass$new()
# Initializing the table with dummy numbers : 
S$Table_Day= data.frame(one= 1:10000, two= rnorm(n = 10000), three= runif(n = 10000),Bud_dd= rep(0,10000))
S$Met_c= data.frame(DegreeDays= rnorm(n=10000, mean = 10, sd = 1))


f1= function(i){
    a= cumsum(S$Met_c$DegreeDays[i:(i-1000)])
}

f2= function(i){
    S$Table_Day$Bud_dd[(i-1000):i]= cumsum(S$Met_c$DegreeDays[i:(i-1000)])
}

res= microbenchmark(f1(1000),f2(1000),times = 10000)
autoplot(res)

And the result :

Also if someone has any experience in programming such models, I am deeply interested in any advice for model development.

if you are interested in performances regarding data frames you should have a look to the library data.table — Julien Navarre
– Julien Navarre, Commented Feb 9, 2018 at 10:40
@JulienNavarre yes I already use it to perform computations on the S$Table_Day dataframe for complex tasks, but do you think using a data.table instead of a data.frame is faster to modify values by indexing ? (I am not talking about performing computations but rather store the results in the table). — rvezy
– rvezy, Commented Feb 9, 2018 at 10:49
Oh yeah! Check out data.table vignettes on this matter, you should make sure to do it the best way, but it is 155 times faster — Emmanuel-Lin
– Emmanuel-Lin, Commented Feb 9, 2018 at 12:48
Is the 10,000 row range you gave in your example representative of your actual data size? The fastest method will likely differ depending on the size. — Matt Summersgill
– Matt Summersgill, Commented Feb 9, 2018 at 19:03
@MattSummersgill yes absolutely. It is 10,000 to 15,000 long. S$Table_Day has 157 columns, and S$Met_c has 48 columns (and same number of rows). — rvezy
– rvezy, Commented Feb 9, 2018 at 19:18

Glorfindel · Accepted Answer · 2023-10-14 08:04:40Z

3

I read more about the question, and I'll just write here for prosperity some of the solutions that were proposed on other posts.

Apparently, reading and writing are both worth to consider when trying to reduce the computation time of assignation to a data.frame by index. The sources are all found in other discussions:

Several solutions appeared relevant :

Use a matrix instead of a data.frame if possible to leverage in place modification (Advanced R).
Use a list instead of a data.frame, because [<-.data.frame is not a primitive function (Advanced R).
Write functions in C++ and use Rcpp ([from this source] 1)
Use .subset2 to read instead of [ (third source)
Use data.table as recommanded by @JulienNavarre and @Emmanuel-Lin and the different sources, and use either set for data.frame or := if using a data.table is not a problem.
Use [[ instead of [ when possible (index by one value only). This one is not very effective, and very restrictive, so I removed it from the following comparison.

Here is the analysis of performance using the different solutions :

The code :

# Loading packages :
library(data.table)
library(microbenchmark)
library(ggplot2)

# Creating dummy data :
SimulationClass <- setRefClass("Simulation",
                               fields = list(Table_Day = "data.frame",
                                             Met_c= "data.frame",
                                             PerCohortFruitDemand_c="matrix",
                                             Parameters= "list",
                                             Zero_then_One="vector"))
S= SimulationClass$new()
S$Table_Day= data.frame(one= 1:10000, two= rnorm(n = 10000), three= runif(n = 10000),Bud_dd= rep(0,10000))
S$Met_c= data.frame(DegreeDays= rnorm(n=10000, mean = 10, sd = 1))

# Transforming data objects into simpler forms :
mat= as.matrix(S$Table_Day)
Slist= as.list(S$Table_Day)
Metlist= as.list(S$Met_c)
MetDT= as.data.table(S$Met_c)
SDT= as.data.table(S$Table_Day)

# Setting up the functions for the tests :
f1= function(i){
    S$Table_Day$Bud_dd[i]= cumsum(S$Met_c$DegreeDays[i])
}
f2= function(i){
    mat[i,4]= cumsum(S$Met_c$DegreeDays[i])
} 
f3= function(i){
    mat[i,4]= cumsum(.subset2(S$Met_c, "DegreeDays")[i])
} 
f4= function(i){
    Slist$Bud_dd[i]= cumsum(.subset2(S$Met_c, "DegreeDays")[i])
}
f5= function(i){
    Slist$Bud_dd[i]= cumsum(Metlist$DegreeDays[i])
}
f6= function(i){
    set(S$Table_Day, i=as.integer(i), j="Bud_dd", cumsum(S$Met_c$DegreeDays[i]))
}
f7= function(i){
    set(S$Table_Day, i=as.integer(i), j="Bud_dd", MetDT[i,cumsum(DegreeDays)])
}
f8= function(i){
    SDT[i,Bud_dd := MetDT[i,cumsum(DegreeDays)]]
}


i= 6000:6500
res= microbenchmark(f1(i),f3(i),f4(i),f5(i),f7(i),f8(i), times = 10000)
autoplot(res)

And the resulting autoplot :

With f1 the reference base assignment, f2 using a matrix instead of a data.frame, f3 using the combination of .subset2 and matrix, f4 using a list and .subset2, f5 using two lists (both reading and writing), f6 using data.table::set, f7 using data.table::set and data.table for cumulative sum, and f8using data.table :=.

As we can see the best solution is to use lists for reading and writing. This is pretty surprising to see that data.table is the worst solution. I believe I did something wrong with it, because it is supposed to be the best. If you can improve it, please tell me.

edited Oct 14, 2023 at 8:04

Glorfindel

22.8k13 gold badges97 silver badges124 bronze badges

answered Feb 9, 2018 at 16:58

rvezy

5754 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Matt Summersgill Over a year ago

Very thorough! A couple comments 1: It Looks like f2 and f6 didn't run in the results you posted. 2: Did you mean to use SDT instead of S$Table_Day as the x argument of set() in f6 and f7? 3: the as.integer() coercion isn't actually necessary here.

rvezy Over a year ago

@Matt Summersgill, thank you very much for your comment ! In response to your comments : 1: I replaced the image (I picked the wrong one), thank you.

rvezy Over a year ago

Sorry, missed some info on my last comment and I can't edit it. Here is the full version : 1: I replaced the image (I picked the wrong one), thank you. 2: No, I used deliberately S$Table_Day for f6 and f7 as set() is compatible with data.frames. The purpose was to test the power of set() without interfering with the data structure for this table. And then test := (which is the equivalent), but at the expense of transforming S$Table into a data.table. 3: Indeed. I used it to avoid a warning (my i was not an integer at first)

Hugh Over a year ago

You should check your results are equivalent. See my_check in ?microbenchmark.

Collectives™ on Stack Overflow

Most efficient way (fastest) to modify a data.frame using indexing

Little introduction to the question :

The question itself :

1 Answer 1

The code :

And the resulting autoplot :

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Little introduction to the question :

The question itself :

1 Answer 1

The code :

And the resulting autoplot :

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related