Using sapply vs. for to efficiently write to preallocated data structures

Question

Assume I have a preallocated data structure that I want to write into for the sake of performance vs. growing the data structure over time. First I tried this using sapply:

set.seed(1)
count <- 5
pre <- numeric(count)

sapply(1:count, function(i) {
  pre[i] <- rnorm(1)
})
pre
# [1] 0 0 0 0 0


for(i in 1:count) {
  pre[i] <- rnorm(1)
}
pre
# [1] -0.8204684  0.4874291  0.7383247  0.5757814 -0.3053884

I assume this is because the anonymous function in sapply is in a different scope (or is it environment in R?) and as a result the pre object isn't the same. The for loop exists in the same scope/environment and so it works as expected.

I've generally tried to adopt the R mechanisms for iteration with apply functions vs. for, but I don't see a way around it here. Is there something different I should be doing or a better idiom for this type of operation?

As noted, my example is highly contrived, I have no interested in generaring normal deviates. Instead my actual code is dealing with a 4 column 1.5 million row dataframe. Previously I was relying on growing and merging to get a final dataframe and decided to try to avoid merges and preallocate based on benchmarking.

Joshua Ulrich · Accepted Answer · 2012-10-19 13:57:46Z

7

sapply isn't meant to be used like that. It already pre-allocates the result.

Regardless, the for loop is not likely the source of slow performance; it's probably because you're repeatedly subsetting a data.frame. For example:

set.seed(21)
N <- 1e4
d <- data.frame(n=1:N, s=sample(letters, N, TRUE))
l <- as.list(d)
set.seed(21)
system.time(for(i in 1:N) { d$n[i] <- rnorm(1); d$s <- sample(letters,1) })
#   user  system elapsed 
#   6.12    0.00    6.17 
set.seed(21)
system.time(for(i in 1:N) { l$n[i] <- rnorm(1); l$s <- sample(letters,1) })
#   user  system elapsed 
#   0.14    0.00    0.14 
D <- as.data.frame(l, stringsAsFactors=FALSE)
identical(d,D)
# [1] TRUE

So you should loop over individual vectors and combine them into a data.frame after the loop.

answered Oct 19, 2012 at 13:57

Joshua Ulrich

177k33 gold badges357 silver badges429 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Matthew Plourde · Accepted Answer · 2012-10-19 12:40:25Z

The apply family isn't intended for side-effect producing tasks, like changing the state of a variable. These functions are meant to simply return values, which you then assign to a variable. This is consistent with the functional paradigm that R partially subscribes to. If you're using these functions as intended, preallocation doesn't come up much, and that's part of their appeal. You could easily do this without preallocating: p <- sapply(1:count, function(i) rnorm(1)). But this example is a little artificial---p <- rnorm(5) is what you would use.

If your actual problem is different than this and you're having problems with efficiency, look into vapply. It's just like sapply, but allows you to specify the resulting data type, giving it a speed advantage. If that fails to help, check out the packages data.table or ff.

Gavin Simpson · Accepted Answer · 2012-10-19 12:21:41Z

2

Yes, you are essentially changing a pre that is local to the anonymous function which will itself return the result of the last evaluation (a vector of length 1), hence sapply() returns the correct solution as a vector (because it accumulates the individual length 1 vectors) but it doesn't change the pre in the global workspace.

You can work round this by using the <<- operator:

set.seed(1)
count <- 5
pre <- numeric(count)

sapply(1:count, function(i) {
  pre[i] <<- rnorm(1)
})
> pre
[1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078

Which has changed pre, but I would avoid doing this for various reasons.

In this case I don't think there is much to be gained from pre-allocating pre in the sapply() case.

Also, for this example both are terribly inefficient; just get rnorm() to generate count random numbers. But I guess the example was just to illustrate the point?

answered Oct 19, 2012 at 12:21

Gavin Simpson

176k28 gold badges405 silver badges461 bronze badges

1 Comment

Gavin Simpson Over a year ago

IS it your experience that preallocating helps for sapply() in your real world problem (i.e. the big data set you mentioned)? Be interested to hear your experiences there.

John · Accepted Answer · 2012-10-19 12:44:16Z

I'm not sure what you're asking. The traditional idiom for sapply in this case would be

pre <- sapply( 1:count, function(x) rnorm(1) )

There you don't have to preallocate at all but you're not restricted from using a preallocated variable.

I'm guessing things would be much clearer if you put up your actual loop you want to change. You say you're having performance issues and you might get an answer here that can really optimize things a lot. There are a few answerers who love such challenges.

It also sounds like you have a long function or loop. The apply family functions are primarily meant for expressiveness and allow you to make it more clear when mixing vectorized functions and things that cannot be. Multiple small sapply calls mixed with vectorized functions is much faster than one big loop in R.

Collectives™ on Stack Overflow

Using sapply vs. for to efficiently write to preallocated data structures

4 Answers 4

Comments

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related