I have a pairwise table of values, and I'm trying to find the fastest way to apply some function to various subsets of this table. I'm experimenting with data.table to see if it will suit my needs.
For example, I start with this vector of data points, which I convert to a pairwise distance matrix.
dat <- c(spA = 4, spB = 10, spC = 8, spD = 1, spE = 5, spF = 9)
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
It looks like this:
> pdist
spA spB spC spD spE spF
spA NA NA NA NA NA NA
spB 6 NA NA NA NA NA
spC 4 2 NA NA NA NA
spD 3 9 7 NA NA NA
spE 1 5 3 4 NA NA
spF 5 1 1 8 4 NA
Converting this table to a data.table
library(data.table)
pdist <- as.data.table(pdist, keep.rownames=TRUE)
setkey(pdist, rn)
> pdist
rn spA spB spC spD spE spF
1: spA NA NA NA NA NA NA
2: spB 6 NA NA NA NA NA
3: spC 4 2 NA NA NA NA
4: spD 3 9 7 NA NA NA
5: spE 1 5 3 4 NA NA
6: spF 5 1 1 8 4 NA
If I have some subset that I want to extract the values for,
sub <- c('spB', 'spF', 'spD')
I can do the following, which yields the submatrix that I am interested in:
> pdist[.(sub), sub, with=FALSE]
spB spF spD
1: NA NA NA
2: 1 NA 8
3: 9 NA NA
Now, how can I apply a function, for example taking the mean (but potentially a custom function), of all values in this subset? I can do it this way, but I wonder if there are better ways in line with data.table manipulation.
> mean(unlist(pdist[.(sub), sub, with=FALSE]), na.rm=TRUE)
[1] 6
UPDATE
Following up on this, I decided to see how different in performance a matrix vs a data.table approach would be:
dat <- runif(1000)
names(dat) <- paste0('sp', 1:1000)
spSub <- replicate(10000, sample(names(dat), 100), simplify=TRUE)
# calculate pairwise distance matrix
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
# convert to data.table
pdistDT <- as.data.table(pdist, keep.rownames='sp')
setkey(pdistDT, sp)
matMethod <- function(pdist, sub) {
return(mean(pdist[sub, sub], na.rm=TRUE))
}
dtMethod <- function(pdistDT, sub) {
return(mean(unlist(pdistDT[.(sub), sub, with=FALSE]), na.rm=TRUE))
}
> system.time(q1 <- lapply(spSub, function(x) matMethod(pdist, x)))
user system elapsed
18.116 0.154 18.317
> system.time(q2 <- lapply(spSub, function(x) dtMethod(pdistDT, x)))
user system elapsed
795.456 13.357 806.820
It appears that going through the data.table step here is leading to a big performance cost.
mean,mean(m[sub,sub], na.rm=TRUE). I don't think you gain anything by putting this in a data.table.RcppArmadilloorRcppEigenfor operating on subsets of matrices.