Let M be a list of character vectors of strings from a set called G, and P and Q are matrices with rows corresponding to each element of G:
M <- list(a=sample(LETTERS, 10), b=sample(LETTERS, 5),
c=sample(LETTERS, 15), d=sample(LETTERS, 8))
G <- LETTERS
Ncol <- 5
P <- matrix(rnorm(length(G) * Ncol), ncol=Ncol)
Q <- matrix(rnorm(length(G) * Ncol), ncol=Ncol)
rownames(P) <- rownames(Q) <- G
Let t_p and t_q be arbitrary thresholds:
t_p <- 0.5
t_q <- -0.5
For each element m of M, and each number i = 1…Ncol I would like to know how many of the values in P and Q fulfill one of the following conditions:
- both P[,i] and Q[,i] are smaller than t_p and t_q, respectively
- both P[,i] and Q[,i] are larger than t_p and t_q, respectively
- none of the above
In other words, for the element m <- "a" and i <- 1 I need the following numbers:
i <- 1
m <- "a"
n1 <- sum(P[ M[[m]] %in% G, i ] < t_p & Q[ M[[m]] %in% G, i ] < t_q)
n2 <- sum(P[ M[[m]] %in% G, i ] > t_p & Q[ M[[m]] %in% G, i ] > t_q)
(the third number is trivially derived by subtracting n1 + n2 from length(M[[m]])).
The result should be a list with an element for each column i of P and Q, being a matrix with a row for each element of M and three columns corresponding to the numbers mentioned above.
Here is how I solved this problem:
Pl1 <- P > t_p
Pl2 <- P < t_p
Ql1 <- Q > t_q
Ql2 <- Q < t_q
cond1 <- Pl1 & Ql1
cond2 <- Pl2 & Ql2
## given m, calculate for each column i
calc_for_m <- function(m) {
sel <- G %in% m
Nsel <- length(m)
sel.cond1 <- cond1[sel, ]
res.cond1 <- colSums(sel.cond1)
sel.cond2 <- cond2[sel, ]
res.cond2 <- colSums(sel.cond2)
cbind(cond1=res.cond1, cond2=res.cond2,
cond3=Nsel - (res.cond1 + res.cond2))
}
Yl <- lapply(M, calc_for_m)
Yl <- simplify2array(Yl)
res <- lapply(1:Ncol, function(i) t(Yl[i,,]))
However, given that in real world case G is a set of tens to hundreds of thousands items, M is a list of length of thousands with each element being a vector of thousands, the above solution appears to be somewhat on the slow side. Is there a better (more elegant and faster) way of solving this problem?