Create a character variable with values conditional on previous variables (both their names and their values) with dplyr

Question

I have a dataframe with ~ 100 variables and several thousand observations. Some of these observations are disqualified from further analysis because of the values in some of their variables. Instead of just deleting these disqualified observations I want to create a character variable that indicates if the observation has been disqualified and if so, because of which variables (one observation can be disqualified because of multiple variables).

Most variables are numeric and can have one of the following values: -1, 0, 1. In addition to that, the variables that could potentially disqualify an observation can also take the value 99, which means disqualified.

# create example data
df <- data.frame(id = c(1:6),
                 AA_B = c(1, 0, NA, 1, -1, 99),
                 A_B_C = c(0, 0, 0, -1, 1, NA),
                 A_BB = c(-1, 99, 0, 0, -1, NA),
                 B_C = c(99, NA, 1, 99, 0, 99),
                 D_AC = c(1, 1, 1, 1, -1, -1))

If an observation is disqualified, the new variable "disqualify" should be something like Disqualified because of A_BB or Disqualified because of AA_B and B_C (depending on which variables caused the disqualification), otherwise it could be any other string or just NA. So, the result should look like the following:

> df
  id AA_B A_B_C A_BB B_C D_AC                           disqualify
1  1    1     0   -1  99    1          Disqualified because of B_C
2  2    0     0   99  NA    1         Disqualified because of A_BB
3  3   NA     0    0   1    1                                 <NA>
4  4    1    -1    0  99    1          Disqualified because of B_C
5  5   -1     1   -1   0   -1                                 <NA>
6  6   99    NA   NA  99   -1 Disqualified because of AA_B and B_C

I'm struggling to find a way to automatically include the variable names of those variables that caused the disqualification in the string of "disqualify". So far I have found the following solution, but this is a horrible peace of code and I am sure that there must be a better way to do this.

df <-
  df %>%
  mutate(disqualify = case_when(AA_B == 99 |
                                  A_BB == 99 | 
                                  B_C == 99 ~ paste("Disqualified because of",
                                                    case_when(AA_B == 99 & (is.na(A_BB) | A_BB != 99) & (is.na(B_C) | B_C != 99) ~ deparse(substitute(AA_B)),
                                                              AA_B == 99 & A_BB == 99 & (is.na(B_C) | B_C != 99) ~ paste(deparse(substitute(AA_B)), deparse(substitute(A_BB)), sep = " and "),
                                                              AA_B == 99 & A_BB == 99 & B_C == 99 ~ paste(deparse(substitute(AA_B)), deparse(substitute(A_BB)), deparse(substitute(B_C)), sep = " and "),
                                                              AA_B == 99 & (is.na(A_BB) | A_BB != 99) & B_C == 99 ~ paste(deparse(substitute(AA_B)), deparse(substitute(B_C)), sep = " and "),
                                                              (is.na(AA_B) | AA_B != 99) & A_BB == 99 & B_C == 99 ~ paste(deparse(substitute(A_BB)), deparse(substitute(B_C)), sep = " and "),
                                                              (is.na(AA_B) | AA_B != 99) & A_BB == 99 & (is.na(B_C) | B_C != 99) ~ deparse(substitute(A_BB)),
                                                              (is.na(AA_B) | AA_B != 99) & (is.na(A_BB) | A_BB != 99) & B_C == 99 ~ deparse(substitute(B_C))
                                                              ))))

If possible, I'd favour a dplyr solution that allows me to call the disqualifying variables by their variables names (no indexing).

And, as a cherry on top, it would be fantastic if there was a way to replace the variable names in my output variable by another string. So Disqualified because of A_BB could become Disqualified because of Weather.

Any help is appreciated!

A. Suliman · Accepted Answer · 2019-11-06 18:47:12Z

2

library(dplyr)
df %>%
   #Check for 99 in specific columns
   mutate(disqualify = apply(.[,c('AA_B','A_B_C','A_BB','B_C')], 1, function(x) ifelse(any(x==99), 
                                   paste0("Disqualified because of ", paste(names(x[!is.na(x) & x==99]), collapse = " and ")), 
                                   NA)))

  id AA_B A_B_C A_BB B_C D_AC                           disqualify
1  1    1     0   -1  99    1          Disqualified because of B_C
2  2    0     0   99  NA    1         Disqualified because of A_BB
3  3   NA     0    0   1    1                                 <NA>
4  4    1    -1    0  99    1          Disqualified because of B_C
5  5   -1     1   -1   0   -1                                 <NA>
6  6   99    NA   NA  99   -1 Disqualified because of AA_B and B_C

#Base R
df$disqualify <- apply(df[,c('AA_B','A_B_C','A_BB','B_C')], 1, function(x) ifelse(any(x==99), 
                                                             paste0("Disqualified because of ", paste(names(x[!is.na(x) & x==99]), collapse = " and ")), 
                                                             NA))

In base R we can apply a function over dataframe rows/columns depending on whether you pass 1 or 2. Here we need to apply a function into every row hence we used 1. see ?apply for more details.

edited Nov 6, 2019 at 18:47

answered Nov 6, 2019 at 11:55

A. Suliman

13.2k6 gold badges27 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

MarkusG Over a year ago

Great, thanks a lot! Is there a way to apply the function only to some specified variables? Because actually only four variables out of ~100 can potentially disqualify an observation. And in addition to that, some of the other variables may have a value of 99. (I could of course simply change the disqualifier value to something like -999999, but in order to improve my understanding of R I would like to learn how to apply this function to a predefined set of variables only.)

A. Suliman Over a year ago

@MarkusG sorry for the late reply, please check my update.

MarkusG Over a year ago

That's exactly what I was looking for. Thanks a lot the code and the explanation!

tmfmnk · Accepted Answer · 2019-11-06 11:43:59Z

One dplyr and tidyr option could be:

df %>%
 left_join(df %>%
            pivot_longer(names_to = "variables", values_to = "values", -id, values_drop_na = TRUE) %>%
            group_by(id) %>%
            summarise(disqualify = if_else(all(values != 99), 
                                           NA_character_, 
                                           paste("Disqualified because of", paste0(variables[values == 99], collapse = " and ")))),
           by = c("id" = "id"))

  id AA_B A_B_C A_BB B_C D_AC                           disqualify
1  1    1     0   -1  99    1          Disqualified because of B_C
2  2    0     0   99  NA    1         Disqualified because of A_BB
3  3   NA     0    0   1    1                                 <NA>
4  4    1    -1    0  99    1          Disqualified because of B_C
5  5   -1     1   -1   0   -1                                 <NA>
6  6   99    NA   NA  99   -1 Disqualified because of AA_B and B_C

Zahiro Mor · Accepted Answer · 2019-11-06 12:04:15Z

0

a base one-liner :

df$disqualify <- apply(df,1,function(x)paste(names(which(x==99)),collapse = " and "))
> df
  id AA_B A_B_C A_BB B_C D_AC      disqualify
1  1    1     0   -1  99    1             B_C
2  2    0     0   99  NA    1            A_BB
3  3   NA     0    0   1    1                
4  4    1    -1    0  99    1             B_C
5  5   -1     1   -1   0   -1                
6  6   99    NA   NA  99   -1   AA_B and B_C

to have it exactly the way you wanted it - you can add:

df$disqualify <- ifelse(test = df$disqualify=="", yes = NA, no = paste('Disqualified because of ',df$disqualify))

> df
  id AA_B A_B_C A_BB B_C D_AC                            disqualify
1  1    1     0   -1  99    1          Disqualified because of  B_C
2  2    0     0   99  NA    1         Disqualified because of  A_BB
3  3   NA     0    0   1    1                                  <NA>
4  4    1    -1    0  99    1          Disqualified because of  B_C
5  5   -1     1   -1   0   -1                                  <NA>
6  6   99    NA   NA  99   -1 Disqualified because of  AA_B and B_C

if you want to change the names of the columns - why don't do that before this operation with
names(df) <- c("id","Weather", "Climate","Name3"...)

edited Nov 6, 2019 at 12:04

answered Nov 6, 2019 at 11:49

Zahiro Mor

1,7181 gold badge17 silver badges30 bronze badges

1 Comment

tmfmnk Over a year ago

I think this one is not solving the problem. The problem is to create the disqualify column based on other columns.

Collectives™ on Stack Overflow

Create a character variable with values conditional on previous variables (both their names and their values) with dplyr

3 Answers 3

3 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related