1
\$\begingroup\$

I have a dataframe of cases that score 0-1 on a group of binary attributes.

What I want to do is extract all possible ombinations of attribute triplets (e.g. A/B/C, A/B/D... out of A-E) and then sum for each possible combination triplet the number of times a case in the original dataframe matched those attributes.

Using dplyr logic as well as lapply I can solve this problem but the performance is very bad, especially for bigger dataframes and more possible attributes. My real dataframe leads to a test matrix of >1000 possible triplets and the function performs very bad on this.

Please help me optimize the code while ideally staying within the dplyr framework as much as possible.

library(tidyverse)


# Create a test data frame and vector of relevant variables
test_df <- data.frame(ID = c(1,2,3,4), Target = c(1,1,0,0),F_A = c(1,0,0,1),F_B = c(0,1,0,1),F_C = c(1,1,0,0),F_D = c(0,1,1,0),F_E = c(1,0,0,1))
invars = c("F_A","F_B","F_C","F_D","F_E")
NumOfElements = 3

# Create a full matrix of all relevant variables in NumOfElements-combinations
combn(invars,NumOfElements) %>%
  t() %>%
  as.data.frame() %>%
  rowid_to_column("ID") %>%
  select(ID, T1 = V1, T2 = V2, T3 = V3) %>%
  unite("Test",starts_with("T"),sep = "|",remove = FALSE,na.rm = TRUE) %>%
  {.} -> test_matrix


# Brute Force Function to calculate number of all IDs that fullfill the test rules
bruteForce_size = function(rule_iterator,source_df,invars){
  source_df %>%
    pivot_longer(cols = c(-ID,-Target), names_to = "Affinity", values_to = "Value") %>%
    mutate(Value = ifelse(Value ==1, Affinity,NA_character_)) %>%
    pivot_wider(names_from = Affinity, values_from = Value) %>%
    unite("Test",invars,sep = "|",remove = FALSE,na.rm = TRUE) %>%
    mutate(Size = as.numeric(rule_iterator == Test)) %$%
    sum(Size)
}

# Calculate and attach sizes to test_matrix
test_matrix %>%
  mutate(Size = unlist(lapply(Test, bruteForce_size, test_df)))
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

Maybe something like this, with base functions:

rez <- apply(test_matrix[, c('T1', 'T2', 'T3')], 1, function(x) {
  y <- test_df[, x]
  sum(rowSums(y) == 3)
}, simplify = T)
rez # vector

test_matrix$Size <- rez
test_matrix
\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.