R random binary dataframe with fixed column sums

Question

I'm trying to build a data frame that is entirely composed of 1s and 0s. It should be randomly built except for the fact that each column needs to add up to a specified value.

I would know how to do this if this was for just one data frame, but it needs to be built into a function, where in said function it will be done as an iterative process, up to 1000x.

If you can do it for one data frame, write your function to do that and then put it in a for loop or run replicate on it. You'll need to be much more specific, show what you've tried, and show sample input and desired output for this to be a good question. — Gregor Thomas
– Gregor Thomas, Commented May 18, 2015 at 17:17
You could use sample. For example, suppose, you want to create a vector of length 10 that sum to 5. i.e. 5 1s. v1 <- numeric(10); v1[sample(10, 5, replace=FALSE)] <- 1 using replicate as @Gregor suggested, this can be looped. But, I am not sure whether the specified value is different for different columns. So, you may need to show some example to clear the confusion — akrun
– akrun, Commented May 18, 2015 at 17:27
I guess neither answer said it explicitly, but using a matrix instead of a data.frame is essential here. By the way, there's a "performance" tag you might consider adding to the question if it is your primary concern. — Frank
– Frank, Commented May 18, 2015 at 18:19
Not sure why this is almost closed as unclear. josilber and I seem to agree on an interpretation of it... — Frank
– Frank, Commented May 19, 2015 at 1:02

josliber · Accepted Answer · 2015-05-18 17:46:21Z

3

An efficient approach would be to shuffle a vector with the appropriate number of 1s and 0s for each column. You could define the following function to generate a matrix with a specified number of rows and the number of 1s in each column:

build.mat <- function(nrow, csums) {
  sapply(csums, function(x) sample(rep(c(0, 1), c(nrow-x, x))))
}
set.seed(144)
build.mat(5, 0:5)
#      [,1] [,2] [,3] [,4] [,5] [,6]
# [1,]    0    0    0    0    1    1
# [2,]    0    0    0    1    0    1
# [3,]    0    0    0    0    1    1
# [4,]    0    1    1    1    1    1
# [5,]    0    0    1    1    1    1

To build a list, you might use lapply over the desired column sums for each matrix:

cslist <- list(1:3, c(4, 2))
set.seed(144)
lapply(cslist, build.mat, nrow=5)
# [[1]]
#      [,1] [,2] [,3]
# [1,]    0    1    1
# [2,]    0    0    0
# [3,]    0    0    0
# [4,]    0    1    1
# [5,]    1    0    1
# 
# [[2]]
#      [,1] [,2]
# [1,]    0    0
# [2,]    1    0
# [3,]    1    1
# [4,]    1    0
# [5,]    1    1

edited May 18, 2015 at 17:46

answered May 18, 2015 at 17:30

josliber

44.4k12 gold badges104 silver badges136 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Frank Over a year ago

Yeah, this is the right way to go, I think, though I wouldn't use nrow (since it's a function name). Also, the OP claims that this must be done many times, so if the number of rows is small and constant across these runs, it might make sense to precompute and store the vectors outside the function for quicker access: vecs <- sapply(setNames(0:n,0:n),function(x)rep(0:1, c(n-x, x))); apply(vecs[,as.character(csums)],2,sample)

josliber Over a year ago

@Frank good catch -- I removed the overwrite of nrow. I guess there might be some use cases were your proposal speeds things up, but for long, narrow data frames it would probably make the code much slower, since you would need to allocate a huge matrix vecs, most of which you would probably never use.

Frank Over a year ago

Changed my mind; I guess it depends on the situation whether akrun's approach is faster. (Added an answer explaining.)

Frank · Accepted Answer · 2015-05-18 18:45:53Z

2

If there are many more zeros than ones or vice versa, @akrun's approach may be faster:

build_01_mat <- function(n,n1s){
  nc        <- length(n1s)
  zerofirst <- sum(n1s) < n*nc/2

  tochange  <- if (zerofirst) n1s else n-n1s

  mat       <- matrix(if (zerofirst) 0L else 1L,n,nc)

  mat[cbind(
    unlist(c(sapply((1:nc)[tochange>0],function(col)sample(1:n,tochange[col])))),
    rep(1:nc,tochange)
  )] <- if (zerofirst) 1L else 0L
  mat
}

set.seed(1)
build_01_mat(5,c(1,3,0))
#      [,1] [,2] [,3]
# [1,]    0    0    0
# [2,]    1    1    0
# [3,]    0    1    0
# [4,]    0    1    0
# [5,]    0    0    0

Some benchmarks:

require(rbenchmark)

# similar numbers of zeros and ones
benchmark(
  permute=build.mat(1e7,1e7/2),
  replace=build_01_mat(1e7,1e7/2),replications=10)[1:5]
#      test replications elapsed relative user.self
# 1 permute           10    7.68    1.126      6.59
# 2 replace           10    6.82    1.000      6.27

# many more zeros than ones
benchmark(
  permute=build.mat(1e6,rep(10,20)),
  replace=build_01_mat(1e6,rep(10,20)),replications=10)[1:5]
#      test replications elapsed relative user.self
# 1 permute           10   10.28    3.779      8.51
# 2 replace           10    2.72    1.000      2.23

# many more ones than zeros
benchmark(
  permute=build.mat(1e6,1e6-rep(10,20)),
  replace=build_01_mat(1e6,1e6-rep(10,20)),replications=10)[1:5]
#      test replications elapsed relative user.self
# 1 permute           10   10.94    4.341      9.28
# 2 replace           10    2.52    1.000      2.09

edited May 18, 2015 at 18:45

answered May 18, 2015 at 18:08

Frank

66.9k8 gold badges104 silver badges190 bronze badges

2 Comments

josliber Over a year ago

Hmm, I guess you could take this a step further and also catch the case where there are many more ones than zeros with an if statement that checks if there are more ones or zeros and then decides if it should start with all 0s and replace the few 1s or start with all 1s and replace the few 0s.

Frank Over a year ago

@josilber Okay, I've made that change. The benchmarks match up with intuition pretty well. You'd have to have about a 75-25 split before the "replace" method would be 2x as fast. I'm sure both methods could be sped up some, so I'd probably still go for the "permute" approach unless my data was super lopsided.

Collectives™ on Stack Overflow

R random binary dataframe with fixed column sums

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related