6

Assume I have a preallocated data structure that I want to write into for the sake of performance vs. growing the data structure over time. First I tried this using sapply:

set.seed(1)
count <- 5
pre <- numeric(count)

sapply(1:count, function(i) {
  pre[i] <- rnorm(1)
})
pre
# [1] 0 0 0 0 0


for(i in 1:count) {
  pre[i] <- rnorm(1)
}
pre
# [1] -0.8204684  0.4874291  0.7383247  0.5757814 -0.3053884

I assume this is because the anonymous function in sapply is in a different scope (or is it environment in R?) and as a result the pre object isn't the same. The for loop exists in the same scope/environment and so it works as expected.

I've generally tried to adopt the R mechanisms for iteration with apply functions vs. for, but I don't see a way around it here. Is there something different I should be doing or a better idiom for this type of operation?

As noted, my example is highly contrived, I have no interested in generaring normal deviates. Instead my actual code is dealing with a 4 column 1.5 million row dataframe. Previously I was relying on growing and merging to get a final dataframe and decided to try to avoid merges and preallocate based on benchmarking.

0

4 Answers 4

7

sapply isn't meant to be used like that. It already pre-allocates the result.

Regardless, the for loop is not likely the source of slow performance; it's probably because you're repeatedly subsetting a data.frame. For example:

set.seed(21)
N <- 1e4
d <- data.frame(n=1:N, s=sample(letters, N, TRUE))
l <- as.list(d)
set.seed(21)
system.time(for(i in 1:N) { d$n[i] <- rnorm(1); d$s <- sample(letters,1) })
#   user  system elapsed 
#   6.12    0.00    6.17 
set.seed(21)
system.time(for(i in 1:N) { l$n[i] <- rnorm(1); l$s <- sample(letters,1) })
#   user  system elapsed 
#   0.14    0.00    0.14 
D <- as.data.frame(l, stringsAsFactors=FALSE)
identical(d,D)
# [1] TRUE

So you should loop over individual vectors and combine them into a data.frame after the loop.

Sign up to request clarification or add additional context in comments.

Comments

3

The apply family isn't intended for side-effect producing tasks, like changing the state of a variable. These functions are meant to simply return values, which you then assign to a variable. This is consistent with the functional paradigm that R partially subscribes to. If you're using these functions as intended, preallocation doesn't come up much, and that's part of their appeal. You could easily do this without preallocating: p <- sapply(1:count, function(i) rnorm(1)). But this example is a little artificial---p <- rnorm(5) is what you would use.

If your actual problem is different than this and you're having problems with efficiency, look into vapply. It's just like sapply, but allows you to specify the resulting data type, giving it a speed advantage. If that fails to help, check out the packages data.table or ff.

Comments

2

Yes, you are essentially changing a pre that is local to the anonymous function which will itself return the result of the last evaluation (a vector of length 1), hence sapply() returns the correct solution as a vector (because it accumulates the individual length 1 vectors) but it doesn't change the pre in the global workspace.

You can work round this by using the <<- operator:

set.seed(1)
count <- 5
pre <- numeric(count)

sapply(1:count, function(i) {
  pre[i] <<- rnorm(1)
})
> pre
[1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078

Which has changed pre, but I would avoid doing this for various reasons.

In this case I don't think there is much to be gained from pre-allocating pre in the sapply() case.


Also, for this example both are terribly inefficient; just get rnorm() to generate count random numbers. But I guess the example was just to illustrate the point?

1 Comment

IS it your experience that preallocating helps for sapply() in your real world problem (i.e. the big data set you mentioned)? Be interested to hear your experiences there.
1

I'm not sure what you're asking. The traditional idiom for sapply in this case would be

pre <- sapply( 1:count, function(x) rnorm(1) )

There you don't have to preallocate at all but you're not restricted from using a preallocated variable.

I'm guessing things would be much clearer if you put up your actual loop you want to change. You say you're having performance issues and you might get an answer here that can really optimize things a lot. There are a few answerers who love such challenges.

It also sounds like you have a long function or loop. The apply family functions are primarily meant for expressiveness and allow you to make it more clear when mixing vectorized functions and things that cannot be. Multiple small sapply calls mixed with vectorized functions is much faster than one big loop in R.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.