2

I'm trying to build a data frame that is entirely composed of 1s and 0s. It should be randomly built except for the fact that each column needs to add up to a specified value.

I would know how to do this if this was for just one data frame, but it needs to be built into a function, where in said function it will be done as an iterative process, up to 1000x.

4
  • 1
    If you can do it for one data frame, write your function to do that and then put it in a for loop or run replicate on it. You'll need to be much more specific, show what you've tried, and show sample input and desired output for this to be a good question. Commented May 18, 2015 at 17:17
  • 2
    You could use sample. For example, suppose, you want to create a vector of length 10 that sum to 5. i.e. 5 1s. v1 <- numeric(10); v1[sample(10, 5, replace=FALSE)] <- 1 using replicate as @Gregor suggested, this can be looped. But, I am not sure whether the specified value is different for different columns. So, you may need to show some example to clear the confusion Commented May 18, 2015 at 17:27
  • I guess neither answer said it explicitly, but using a matrix instead of a data.frame is essential here. By the way, there's a "performance" tag you might consider adding to the question if it is your primary concern. Commented May 18, 2015 at 18:19
  • Not sure why this is almost closed as unclear. josilber and I seem to agree on an interpretation of it... Commented May 19, 2015 at 1:02

2 Answers 2

3

An efficient approach would be to shuffle a vector with the appropriate number of 1s and 0s for each column. You could define the following function to generate a matrix with a specified number of rows and the number of 1s in each column:

build.mat <- function(nrow, csums) {
  sapply(csums, function(x) sample(rep(c(0, 1), c(nrow-x, x))))
}
set.seed(144)
build.mat(5, 0:5)
#      [,1] [,2] [,3] [,4] [,5] [,6]
# [1,]    0    0    0    0    1    1
# [2,]    0    0    0    1    0    1
# [3,]    0    0    0    0    1    1
# [4,]    0    1    1    1    1    1
# [5,]    0    0    1    1    1    1

To build a list, you might use lapply over the desired column sums for each matrix:

cslist <- list(1:3, c(4, 2))
set.seed(144)
lapply(cslist, build.mat, nrow=5)
# [[1]]
#      [,1] [,2] [,3]
# [1,]    0    1    1
# [2,]    0    0    0
# [3,]    0    0    0
# [4,]    0    1    1
# [5,]    1    0    1
# 
# [[2]]
#      [,1] [,2]
# [1,]    0    0
# [2,]    1    0
# [3,]    1    1
# [4,]    1    0
# [5,]    1    1
Sign up to request clarification or add additional context in comments.

3 Comments

Yeah, this is the right way to go, I think, though I wouldn't use nrow (since it's a function name). Also, the OP claims that this must be done many times, so if the number of rows is small and constant across these runs, it might make sense to precompute and store the vectors outside the function for quicker access: vecs <- sapply(setNames(0:n,0:n),function(x)rep(0:1, c(n-x, x))); apply(vecs[,as.character(csums)],2,sample)
@Frank good catch -- I removed the overwrite of nrow. I guess there might be some use cases were your proposal speeds things up, but for long, narrow data frames it would probably make the code much slower, since you would need to allocate a huge matrix vecs, most of which you would probably never use.
Changed my mind; I guess it depends on the situation whether akrun's approach is faster. (Added an answer explaining.)
2

If there are many more zeros than ones or vice versa, @akrun's approach may be faster:

build_01_mat <- function(n,n1s){
  nc        <- length(n1s)
  zerofirst <- sum(n1s) < n*nc/2

  tochange  <- if (zerofirst) n1s else n-n1s

  mat       <- matrix(if (zerofirst) 0L else 1L,n,nc)

  mat[cbind(
    unlist(c(sapply((1:nc)[tochange>0],function(col)sample(1:n,tochange[col])))),
    rep(1:nc,tochange)
  )] <- if (zerofirst) 1L else 0L
  mat
}

set.seed(1)
build_01_mat(5,c(1,3,0))
#      [,1] [,2] [,3]
# [1,]    0    0    0
# [2,]    1    1    0
# [3,]    0    1    0
# [4,]    0    1    0
# [5,]    0    0    0

Some benchmarks:

require(rbenchmark)

# similar numbers of zeros and ones
benchmark(
  permute=build.mat(1e7,1e7/2),
  replace=build_01_mat(1e7,1e7/2),replications=10)[1:5]
#      test replications elapsed relative user.self
# 1 permute           10    7.68    1.126      6.59
# 2 replace           10    6.82    1.000      6.27

# many more zeros than ones
benchmark(
  permute=build.mat(1e6,rep(10,20)),
  replace=build_01_mat(1e6,rep(10,20)),replications=10)[1:5]
#      test replications elapsed relative user.self
# 1 permute           10   10.28    3.779      8.51
# 2 replace           10    2.72    1.000      2.23

# many more ones than zeros
benchmark(
  permute=build.mat(1e6,1e6-rep(10,20)),
  replace=build_01_mat(1e6,1e6-rep(10,20)),replications=10)[1:5]
#      test replications elapsed relative user.self
# 1 permute           10   10.94    4.341      9.28
# 2 replace           10    2.52    1.000      2.09

2 Comments

Hmm, I guess you could take this a step further and also catch the case where there are many more ones than zeros with an if statement that checks if there are more ones or zeros and then decides if it should start with all 0s and replace the few 1s or start with all 1s and replace the few 0s.
@josilber Okay, I've made that change. The benchmarks match up with intuition pretty well. You'd have to have about a 75-25 split before the "replace" method would be 2x as fast. I'm sure both methods could be sped up some, so I'd probably still go for the "permute" approach unless my data was super lopsided.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.