I am trying to efficiently code the computation of the KDN complexity measure, which involves a loop over all of the rows of a distance matrix and making some computations out of it.
I am trying to parallel this code with foreach and %dopar% functions, but I am not achieving any running time reduction. I am conscious that some parallel computations are not efficient because of memory management, but I don’t know if this is my case or if I am doing something wrong.
This is a reproducible example with digits data from rsvd package:
First, I call all the necessary packages, I read the digits data and then I get some useful information.
#######################
### NEEDED PACKAGES ###
#######################
library(dplyr)
library(parallelDist)
# for Parallel Processing
library(doParallel)
library(foreach)
# for digits data
library(rsvd)
#############
### DATA ###
#############
data(digits)
data = as.data.frame(digits)
# Dividing on X variables and Y target
dataX = data %>%
dplyr::select(-label)
dataY = data %>%
mutate(label=factor(label)) %>%
pull(label)
## number of data
n=dim(dataX)[1]
Then, I do some necessary computations prior to the KDN loop I want to efficiently parallel.
##############################
### PREVIOUS COMPUTATIONS ###
##############################
## number of available data in each class
n_data_classes=table(factor(dataY))
## number of data to be considered as neighbours in each class
k=0.05
k_neighbours_classes=ceiling(n_data_classes*k)
## DISTANCE MATRIX COMPUTATION
# this is time consuming but I'm not concerned about this
distance_matrix=as.matrix(parDist(scale(dataX)))
The KDN computation without no parallelization is the next one and it takes 12 secs.
#########################################
### COMPUTING KDN: NO PARALLELIZATION ###
#########################################
## KDN instance level computation
# inicialization of a vector to store KDN instance level values
kdn_instance=numeric(n)
system.time(
for (ix in 1:n){
## Gettig the class of ix data point
class_ix=dataY[ix]
## number of data to be considered as neighbours in this class
k_value=k_neighbours_classes[class_ix]
# we get the k_value nearest neighbors set of ix
distances_ix=distance_matrix[ix,]
distances_ix_ordered=order(distances_ix,decreasing = F)
knn_set_ix=distances_ix_ordered[2:(k_value+1)]
# Y value of the k_neighbors_set_ix
Y_value_knn_set_ix=dataY[knn_set_ix]
# Y value of ix data
Y_value_ix=dataY[ix]
# number of data in knn_set_ix with different Y value that ix
knn_set_ix_different_Y_value=length(Y_value_knn_set_ix[Y_value_knn_set_ix!=Y_value_ix])
kdn_instance[ix]=knn_set_ix_different_Y_value/k_value
}
)
# user system elapsed
# 12.29 0.37 12.67 secs
My attempt to parallel that loop is the following one, using foreach and %dopar%, which takes 35 secs.
######################################
### COMPUTING KDN: PARALLELIZATION ###
######################################
## Preparing for paralleling
# number of cores to use
n.cores <- parallel::detectCores() - 1
# we define the cluster and register it so it can be used by %dopar%
my.cluster <- parallel::makeCluster(n.cores,type = "PSOCK")
# register it to be used by %dopar%
doParallel::registerDoParallel(cl = my.cluster)
## KDN instance level computation
kdn_instance= NULL
#iterator
itx <- iter(distance_matrix, by = 'row')
system.time(
kdn_instance <- foreach(
ix = itx,
.combine = 'c'
) %dopar% {
## Gettig the class of ix data point
class_ix=dataY[ix]
## number of data to be considered as neighbours in this class
k_value=k_neighbours_classes[class_ix]
# we get the k_value nearest neighbors set of ix
distances_ix=distance_matrix[ix,]
distances_ix_ordered=order(distances_ix,decreasing = F)
knn_set_ix=distances_ix_ordered[2:(k_value+1)]
# Y value of the k_neighbors_set_ix
Y_value_knn_set_ix=dataY[knn_set_ix]
# Y value of ix data
Y_value_ix=dataY[ix]
# number of data in knn_set_ix with different Y value that ix
knn_set_ix_different_Y_value=length(Y_value_knn_set_ix[Y_value_knn_set_ix!=Y_value_ix])
knn_set_ix_different_Y_value/k_value}
)
parallel::stopCluster(cl = my.cluster)
# user system elapsed
# 12.38 4.64 35.14 secs
As can be seen, the parallel computation is taking more time than the not parallelized one.
My question is: is there something wrong with the parallel processing code? Is there a better way to do it? Maybe it should be done with another package.