0

Let M be a list of character vectors of strings from a set called G, and P and Q are matrices with rows corresponding to each element of G:

M <- list(a=sample(LETTERS, 10), b=sample(LETTERS, 5), 
          c=sample(LETTERS, 15), d=sample(LETTERS, 8))
G <- LETTERS
Ncol <- 5
P <- matrix(rnorm(length(G) * Ncol), ncol=Ncol)
Q <- matrix(rnorm(length(G) * Ncol), ncol=Ncol)
rownames(P) <- rownames(Q) <- G

Let t_p and t_q be arbitrary thresholds:

t_p <- 0.5
t_q <- -0.5

For each element m of M, and each number i = 1…Ncol I would like to know how many of the values in P and Q fulfill one of the following conditions:

  • both P[,i] and Q[,i] are smaller than t_p and t_q, respectively
  • both P[,i] and Q[,i] are larger than t_p and t_q, respectively
  • none of the above

In other words, for the element m <- "a" and i <- 1 I need the following numbers:

i <- 1
m <- "a"
n1 <- sum(P[ M[[m]] %in% G, i ] < t_p & Q[ M[[m]] %in% G, i ] < t_q)
n2 <- sum(P[ M[[m]] %in% G, i ] > t_p & Q[ M[[m]] %in% G, i ] > t_q)

(the third number is trivially derived by subtracting n1 + n2 from length(M[[m]])).

The result should be a list with an element for each column i of P and Q, being a matrix with a row for each element of M and three columns corresponding to the numbers mentioned above.

Here is how I solved this problem:

Pl1 <- P > t_p
Pl2 <- P < t_p
Ql1 <- Q > t_q
Ql2 <- Q < t_q
cond1 <- Pl1 & Ql1
cond2 <- Pl2 & Ql2

## given m, calculate for each column i
calc_for_m <- function(m) {
  sel <- G %in% m
  Nsel <- length(m)
  sel.cond1 <- cond1[sel, ]
  res.cond1 <- colSums(sel.cond1)
  sel.cond2 <- cond2[sel, ]
  res.cond2 <- colSums(sel.cond2)
  cbind(cond1=res.cond1, cond2=res.cond2, 
       cond3=Nsel - (res.cond1 + res.cond2))
}

Yl <- lapply(M, calc_for_m)
Yl <- simplify2array(Yl)
res <- lapply(1:Ncol, function(i) t(Yl[i,,]))

However, given that in real world case G is a set of tens to hundreds of thousands items, M is a list of length of thousands with each element being a vector of thousands, the above solution appears to be somewhat on the slow side. Is there a better (more elegant and faster) way of solving this problem?

1 Answer 1

1

Your approach is already pretty optimized. I've made this as an answer just to give you some ideas.

Another approach is to do everything at once without any loops.

# parameters
arr_ind <- match(unlist(M), G)

cond1[arr_ind,]
cond2[arr_ind,]

This is the only part of my solution that's faster, but only barely.

calc_for_m2 <- function(m) {
  sel <- G %in% m
  sel.cond1 <- cond1[sel, ]
  sel.cond2 <- cond2[sel, ]
}

microbenchmark(
access_lapply = Yl <- lapply(M, calc_for_m2)
, access_arr_ind = {
arr_ind <- match(unlist(M), G)
cond1[arr_ind,]
cond2[arr_ind,]
}
)
Unit: microseconds
           expr  min   lq   mean median    uq   max neval
  access_lapply 27.1 28.1 30.516   28.7 29.40 144.6   100
 access_arr_ind 22.8 23.9 25.516   24.4 24.95  96.5   100

Now, unfortunately, I spend the rest of my time splitting up the data.

arr_ind <- match(unlist(M), G)
grp_ind <- rep(seq_along(M), grp_len)

res.cond1 <- xtabs(cond1[arr_ind,] ~ grp_ind)
res.cond2 <- xtabs(cond2[arr_ind,] ~ grp_ind)

grp_len <- unname(lapply(M, length))
res.cond3 <- sweep(-(res.cond1 + res.cond2), 1, unlist(grp_len), '+')

res2 <- cbind(cond1 = unlist(res.cond1), cond2 = unlist(res.cond2), cond3 = unlist(res.cond3))

All that work for something that is still 20 times slower than the original:

Unit: microseconds
     expr    min      lq     mean  median      uq    max neval
 original   98.3  106.10  192.255  120.80  128.15 7005.4   100
     cole 2113.7 2146.65 2234.289 2165.45 2205.25 5915.4   100

I was also getting into rbind during the xtabs and while promising, it's still closer to 15 times slower.

res_1_2 <- xtabs(rbind(cond1[arr_ind, ], cond2[arr_ind,]) ~ rep(grp_ind,2) + rep(1:2, each = length(grp_ind)) )

Good luck!

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.