Optimizing a function in R

Question

Let M be a list of character vectors of strings from a set called G, and P and Q are matrices with rows corresponding to each element of G:

M <- list(a=sample(LETTERS, 10), b=sample(LETTERS, 5), 
          c=sample(LETTERS, 15), d=sample(LETTERS, 8))
G <- LETTERS
Ncol <- 5
P <- matrix(rnorm(length(G) * Ncol), ncol=Ncol)
Q <- matrix(rnorm(length(G) * Ncol), ncol=Ncol)
rownames(P) <- rownames(Q) <- G

Let t_p and t_q be arbitrary thresholds:

t_p <- 0.5
t_q <- -0.5

For each element m of M, and each number i = 1…Ncol I would like to know how many of the values in P and Q fulfill one of the following conditions:

both P[,i] and Q[,i] are smaller than t_p and t_q, respectively
both P[,i] and Q[,i] are larger than t_p and t_q, respectively
none of the above

In other words, for the element m <- "a" and i <- 1 I need the following numbers:

i <- 1
m <- "a"
n1 <- sum(P[ M[[m]] %in% G, i ] < t_p & Q[ M[[m]] %in% G, i ] < t_q)
n2 <- sum(P[ M[[m]] %in% G, i ] > t_p & Q[ M[[m]] %in% G, i ] > t_q)

(the third number is trivially derived by subtracting n1 + n2 from length(M[[m]])).

The result should be a list with an element for each column i of P and Q, being a matrix with a row for each element of M and three columns corresponding to the numbers mentioned above.

Here is how I solved this problem:

Pl1 <- P > t_p
Pl2 <- P < t_p
Ql1 <- Q > t_q
Ql2 <- Q < t_q
cond1 <- Pl1 & Ql1
cond2 <- Pl2 & Ql2

## given m, calculate for each column i
calc_for_m <- function(m) {
  sel <- G %in% m
  Nsel <- length(m)
  sel.cond1 <- cond1[sel, ]
  res.cond1 <- colSums(sel.cond1)
  sel.cond2 <- cond2[sel, ]
  res.cond2 <- colSums(sel.cond2)
  cbind(cond1=res.cond1, cond2=res.cond2, 
       cond3=Nsel - (res.cond1 + res.cond2))
}

Yl <- lapply(M, calc_for_m)
Yl <- simplify2array(Yl)
res <- lapply(1:Ncol, function(i) t(Yl[i,,]))

However, given that in real world case G is a set of tens to hundreds of thousands items, M is a list of length of thousands with each element being a vector of thousands, the above solution appears to be somewhat on the slow side. Is there a better (more elegant and faster) way of solving this problem?

Cole · Accepted Answer · 2019-08-02 11:49:41Z

Your approach is already pretty optimized. I've made this as an answer just to give you some ideas.

Another approach is to do everything at once without any loops.

# parameters
arr_ind <- match(unlist(M), G)

cond1[arr_ind,]
cond2[arr_ind,]

This is the only part of my solution that's faster, but only barely.

calc_for_m2 <- function(m) {
  sel <- G %in% m
  sel.cond1 <- cond1[sel, ]
  sel.cond2 <- cond2[sel, ]
}

microbenchmark(
access_lapply = Yl <- lapply(M, calc_for_m2)
, access_arr_ind = {
arr_ind <- match(unlist(M), G)
cond1[arr_ind,]
cond2[arr_ind,]
}
)
Unit: microseconds
           expr  min   lq   mean median    uq   max neval
  access_lapply 27.1 28.1 30.516   28.7 29.40 144.6   100
 access_arr_ind 22.8 23.9 25.516   24.4 24.95  96.5   100

Now, unfortunately, I spend the rest of my time splitting up the data.

arr_ind <- match(unlist(M), G)
grp_ind <- rep(seq_along(M), grp_len)

res.cond1 <- xtabs(cond1[arr_ind,] ~ grp_ind)
res.cond2 <- xtabs(cond2[arr_ind,] ~ grp_ind)

grp_len <- unname(lapply(M, length))
res.cond3 <- sweep(-(res.cond1 + res.cond2), 1, unlist(grp_len), '+')

res2 <- cbind(cond1 = unlist(res.cond1), cond2 = unlist(res.cond2), cond3 = unlist(res.cond3))

All that work for something that is still 20 times slower than the original:

Unit: microseconds
     expr    min      lq     mean  median      uq    max neval
 original   98.3  106.10  192.255  120.80  128.15 7005.4   100
     cole 2113.7 2146.65 2234.289 2165.45 2205.25 5915.4   100

I was also getting into rbind during the xtabs and while promising, it's still closer to 15 times slower.

res_1_2 <- xtabs(rbind(cond1[arr_ind, ], cond2[arr_ind,]) ~ rep(grp_ind,2) + rep(1:2, each = length(grp_ind)) )

Good luck!

Collectives™ on Stack Overflow

Optimizing a function in R

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related