I have 2 similar data sets.
d1 <- tribble(
~individual, ~X1, ~X2, ~X3,
"p1", "XX", "XY", "YY",
"p2", "XY", "XY", "YY",
"p3", "YY", "XX", "XX"
)
d2 <- tribble(
~individual, ~X1, ~X2, ~X3,
"p1", "XX", "XY", "YY",
"p2", "XY", "XY", "YY",
"p3", "YY", "XX", "XX",
"p4", "YY", "XX", "XX",
"p5", "YY", "XX", "XX"
)
I made a function to compare d1 to d2. The comparison takes each indavidual in d1 and compares ir to every indavidual in d2 by corrasponding columns. A score is given for each comparison. Then the mean of scores for each individual is reutrned.
scoreData <- function(d1, d2) {
require(tidyverse)
output <- data.frame() %>%
mutate("name1", "name2", "meanScore")
colNames <- names(d1)[-1]
for(i in 1:nrow(d1)){
name1 <- NULL
name1 <- d1$individual[i]
for(j in 1:nrow(d2)){
name2 <- NULL
name2 <- d2$individual[j]
scores <- NULL
for(k in 1:length(colName)){
col <- NULL
col <- colNames[k]
score = case_when(
d1[i,col] == "XX" && d2[j,col] == "XX" ~ 1.0,
d1[i,col] == "XX" && d2[j,col] == "XY" ~ 0.5,
d1[i,col] == "XX" && d2[j,col] == "YY" ~ 0.0,
d1[i,col] == "YY" && d2[j,col] == "XX" ~ 0.0,
d1[i,col] == "YY" && d2[j,col] == "XY" ~ 0.5,
d1[i,col] == "YY" && d2[j,col] == "YY" ~ 1.0,
d1[i,col] == "XY" && d2[j,col] == "XX" ~ 0.5,
d1[i,col] == "XY" && d2[j,col] == "XY" ~ 0.5,
d1[i,col] == "XY" && d2[j,col] == "YY" ~ 0.5
)
scores <- append(scores, score)
k = k + 1
}
meanScore <- mean(scores, na.rm = TRUE)
output <- rbind(output, cbind(name1, name2, meanScore))
j = j + 1
}
i = i + 1
}
return(output)
}
The problem is my real datasets are very large and I need to make my code more efficent. I know that the family of apply() functions are more efficent than using for loops in R. But, I am not sure how to use them to replicate this nested forloop. eventually, I would like to parellelize the apply functions to make this scoring function more efficient. Any ideas or help would be geatly appriciated.
case_when( d1[,col] == "XX" && d2[,col] == "XX" ~ 1.0,...no need to loop individual elements