0

I have 2 similar data sets.

d1 <- tribble(
  ~individual, ~X1, ~X2, ~X3,
  "p1", "XX", "XY", "YY",
  "p2", "XY", "XY", "YY",
  "p3", "YY", "XX", "XX"
)

d2 <- tribble(
  ~individual, ~X1, ~X2, ~X3,
  "p1", "XX", "XY", "YY",
  "p2", "XY", "XY", "YY",
  "p3", "YY", "XX", "XX",
  "p4", "YY", "XX", "XX",
  "p5", "YY", "XX", "XX"
)

I made a function to compare d1 to d2. The comparison takes each indavidual in d1 and compares ir to every indavidual in d2 by corrasponding columns. A score is given for each comparison. Then the mean of scores for each individual is reutrned.

scoreData <- function(d1, d2) {
 require(tidyverse)
 output <- data.frame() %>%
  mutate("name1", "name2", "meanScore")
    
 colNames <- names(d1)[-1]
    
 for(i in 1:nrow(d1)){
  name1 <- NULL
  name1 <- d1$individual[i]
  for(j in 1:nrow(d2)){
   name2 <- NULL
   name2 <- d2$individual[j]
   scores <- NULL
   for(k in 1:length(colName)){
    col <- NULL
    col <- colNames[k]
    score = case_when(
     d1[i,col] == "XX" && d2[j,col] == "XX" ~ 1.0,
     d1[i,col] == "XX" && d2[j,col] == "XY" ~ 0.5,
     d1[i,col] == "XX" && d2[j,col] == "YY" ~ 0.0,
     d1[i,col] == "YY" && d2[j,col] == "XX" ~ 0.0,
     d1[i,col] == "YY" && d2[j,col] == "XY" ~ 0.5,
     d1[i,col] == "YY" && d2[j,col] == "YY" ~ 1.0,
     d1[i,col] == "XY" && d2[j,col] == "XX" ~ 0.5,
     d1[i,col] == "XY" && d2[j,col] == "XY" ~ 0.5,
     d1[i,col] == "XY" && d2[j,col] == "YY" ~ 0.5
     )
    scores <- append(scores, score)
    k = k + 1
    }
   meanScore <- mean(scores, na.rm = TRUE)
   output <- rbind(output, cbind(name1, name2, meanScore))
   j = j + 1
   }
  i = i + 1
  }
 return(output)
 }

The problem is my real datasets are very large and I need to make my code more efficent. I know that the family of apply() functions are more efficent than using for loops in R. But, I am not sure how to use them to replicate this nested forloop. eventually, I would like to parellelize the apply functions to make this scoring function more efficient. Any ideas or help would be geatly appriciated.

1
  • first try to avoid the for-loops by using vectorization. Almost everything in R is vectorized. ie us case_when( d1[,col] == "XX" && d2[,col] == "XX" ~ 1.0,... no need to loop individual elements Commented Oct 21, 2022 at 5:37

1 Answer 1

1

One option to avoid the for loops would be to approach your task via a join which allows to vectorize the comparisons and the computation of the means:

library(dplyr)

d1 |>
  merge(d2, by = NULL, suffixes = c("", ".y")) |> 
  mutate(across(matches("^X\\d$"), list(score = function(x) {
    y <- cur_data()[[paste0(cur_column(), ".y")]]
    
    case_when(
      x == "XX" & y == "XX" ~ 1,
      x == "XX" & y == "YY" ~ 0,
      x == "YY" & y == "XX" ~ 0,
      x == "YY" & y== "YY" ~ 1,
      TRUE ~ .5
    )
  }))) |> 
  mutate(meanScore = rowMeans(across(ends_with("score")))) |> 
  select(name1 = individual, name2 = individual.y, meanScore)
#> # A tibble: 15 × 3
#>    name1 name2 meanScore
#>    <chr> <chr>     <dbl>
#>  1 p1    p1        0.833
#>  2 p1    p2        0.667
#>  3 p1    p3        0.167
#>  4 p1    p4        0.167
#>  5 p1    p5        0.167
#>  6 p2    p1        0.667
#>  7 p2    p2        0.667
#>  8 p2    p3        0.333
#>  9 p2    p4        0.333
#> 10 p2    p5        0.333
#> 11 p3    p1        0.167
#> 12 p3    p2        0.333
#> 13 p3    p3        1    
#> 14 p3    p4        1    
#> 15 p3    p5        1
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.