3

I have a dataframe with ~ 100 variables and several thousand observations. Some of these observations are disqualified from further analysis because of the values in some of their variables. Instead of just deleting these disqualified observations I want to create a character variable that indicates if the observation has been disqualified and if so, because of which variables (one observation can be disqualified because of multiple variables).

Most variables are numeric and can have one of the following values: -1, 0, 1. In addition to that, the variables that could potentially disqualify an observation can also take the value 99, which means disqualified.

# create example data
df <- data.frame(id = c(1:6),
                 AA_B = c(1, 0, NA, 1, -1, 99),
                 A_B_C = c(0, 0, 0, -1, 1, NA),
                 A_BB = c(-1, 99, 0, 0, -1, NA),
                 B_C = c(99, NA, 1, 99, 0, 99),
                 D_AC = c(1, 1, 1, 1, -1, -1))

If an observation is disqualified, the new variable "disqualify" should be something like Disqualified because of A_BB or Disqualified because of AA_B and B_C (depending on which variables caused the disqualification), otherwise it could be any other string or just NA. So, the result should look like the following:

> df
  id AA_B A_B_C A_BB B_C D_AC                           disqualify
1  1    1     0   -1  99    1          Disqualified because of B_C
2  2    0     0   99  NA    1         Disqualified because of A_BB
3  3   NA     0    0   1    1                                 <NA>
4  4    1    -1    0  99    1          Disqualified because of B_C
5  5   -1     1   -1   0   -1                                 <NA>
6  6   99    NA   NA  99   -1 Disqualified because of AA_B and B_C

I'm struggling to find a way to automatically include the variable names of those variables that caused the disqualification in the string of "disqualify". So far I have found the following solution, but this is a horrible peace of code and I am sure that there must be a better way to do this.

df <-
  df %>%
  mutate(disqualify = case_when(AA_B == 99 |
                                  A_BB == 99 | 
                                  B_C == 99 ~ paste("Disqualified because of",
                                                    case_when(AA_B == 99 & (is.na(A_BB) | A_BB != 99) & (is.na(B_C) | B_C != 99) ~ deparse(substitute(AA_B)),
                                                              AA_B == 99 & A_BB == 99 & (is.na(B_C) | B_C != 99) ~ paste(deparse(substitute(AA_B)), deparse(substitute(A_BB)), sep = " and "),
                                                              AA_B == 99 & A_BB == 99 & B_C == 99 ~ paste(deparse(substitute(AA_B)), deparse(substitute(A_BB)), deparse(substitute(B_C)), sep = " and "),
                                                              AA_B == 99 & (is.na(A_BB) | A_BB != 99) & B_C == 99 ~ paste(deparse(substitute(AA_B)), deparse(substitute(B_C)), sep = " and "),
                                                              (is.na(AA_B) | AA_B != 99) & A_BB == 99 & B_C == 99 ~ paste(deparse(substitute(A_BB)), deparse(substitute(B_C)), sep = " and "),
                                                              (is.na(AA_B) | AA_B != 99) & A_BB == 99 & (is.na(B_C) | B_C != 99) ~ deparse(substitute(A_BB)),
                                                              (is.na(AA_B) | AA_B != 99) & (is.na(A_BB) | A_BB != 99) & B_C == 99 ~ deparse(substitute(B_C))
                                                              ))))

If possible, I'd favour a dplyr solution that allows me to call the disqualifying variables by their variables names (no indexing).

And, as a cherry on top, it would be fantastic if there was a way to replace the variable names in my output variable by another string. So Disqualified because of A_BB could become Disqualified because of Weather.

Any help is appreciated!

0

3 Answers 3

2
library(dplyr)
df %>%
   #Check for 99 in specific columns
   mutate(disqualify = apply(.[,c('AA_B','A_B_C','A_BB','B_C')], 1, function(x) ifelse(any(x==99), 
                                   paste0("Disqualified because of ", paste(names(x[!is.na(x) & x==99]), collapse = " and ")), 
                                   NA)))

  id AA_B A_B_C A_BB B_C D_AC                           disqualify
1  1    1     0   -1  99    1          Disqualified because of B_C
2  2    0     0   99  NA    1         Disqualified because of A_BB
3  3   NA     0    0   1    1                                 <NA>
4  4    1    -1    0  99    1          Disqualified because of B_C
5  5   -1     1   -1   0   -1                                 <NA>
6  6   99    NA   NA  99   -1 Disqualified because of AA_B and B_C

#Base R
df$disqualify <- apply(df[,c('AA_B','A_B_C','A_BB','B_C')], 1, function(x) ifelse(any(x==99), 
                                                             paste0("Disqualified because of ", paste(names(x[!is.na(x) & x==99]), collapse = " and ")), 
                                                             NA))

In base R we can apply a function over dataframe rows/columns depending on whether you pass 1 or 2. Here we need to apply a function into every row hence we used 1. see ?apply for more details.

Sign up to request clarification or add additional context in comments.

3 Comments

Great, thanks a lot! Is there a way to apply the function only to some specified variables? Because actually only four variables out of ~100 can potentially disqualify an observation. And in addition to that, some of the other variables may have a value of 99. (I could of course simply change the disqualifier value to something like -999999, but in order to improve my understanding of R I would like to learn how to apply this function to a predefined set of variables only.)
@MarkusG sorry for the late reply, please check my update.
That's exactly what I was looking for. Thanks a lot the code and the explanation!
2

One dplyr and tidyr option could be:

df %>%
 left_join(df %>%
            pivot_longer(names_to = "variables", values_to = "values", -id, values_drop_na = TRUE) %>%
            group_by(id) %>%
            summarise(disqualify = if_else(all(values != 99), 
                                           NA_character_, 
                                           paste("Disqualified because of", paste0(variables[values == 99], collapse = " and ")))),
           by = c("id" = "id"))

  id AA_B A_B_C A_BB B_C D_AC                           disqualify
1  1    1     0   -1  99    1          Disqualified because of B_C
2  2    0     0   99  NA    1         Disqualified because of A_BB
3  3   NA     0    0   1    1                                 <NA>
4  4    1    -1    0  99    1          Disqualified because of B_C
5  5   -1     1   -1   0   -1                                 <NA>
6  6   99    NA   NA  99   -1 Disqualified because of AA_B and B_C

Comments

0

a base one-liner :

df$disqualify <- apply(df,1,function(x)paste(names(which(x==99)),collapse = " and "))
> df
  id AA_B A_B_C A_BB B_C D_AC      disqualify
1  1    1     0   -1  99    1             B_C
2  2    0     0   99  NA    1            A_BB
3  3   NA     0    0   1    1                
4  4    1    -1    0  99    1             B_C
5  5   -1     1   -1   0   -1                
6  6   99    NA   NA  99   -1   AA_B and B_C

to have it exactly the way you wanted it - you can add:

df$disqualify <- ifelse(test = df$disqualify=="", yes = NA, no = paste('Disqualified because of ',df$disqualify))

> df
  id AA_B A_B_C A_BB B_C D_AC                            disqualify
1  1    1     0   -1  99    1          Disqualified because of  B_C
2  2    0     0   99  NA    1         Disqualified because of  A_BB
3  3   NA     0    0   1    1                                  <NA>
4  4    1    -1    0  99    1          Disqualified because of  B_C
5  5   -1     1   -1   0   -1                                  <NA>
6  6   99    NA   NA  99   -1 Disqualified because of  AA_B and B_C

if you want to change the names of the columns - why don't do that before this operation with
names(df) <- c("id","Weather", "Climate","Name3"...)

1 Comment

I think this one is not solving the problem. The problem is to create the disqualify column based on other columns.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.