1

I need to create a dummy variable (binary) from a character (string) variable The data that I have look like this:

dat <- tribble(
    ~pat_id, ~icd9_1, ~icd9_2,
    1, "414.01", "414.01",
    2, "411.89", NA,
    3, NA, "410.71",
    4, NA, NA,
    5, NA, "410.51",
    6, NA, "272.0, 410.71"
)
dat



# A tibble: 6 x 3
#         pat_id icd9_1        icd9_2
#          <dbl>  <chr>         <chr>
#              1 414.01        414.01
#              2 411.89          <NA>
#              3   <NA>        410.71
#              4   <NA>          <NA>
#              5   <NA>        410.51
#              6   <NA> 272.0, 410.71

I want to create three new binary variables:

icd9_bin_1 == binary (0/1) for icd9_1
icd9_bin_2 == binary (0/1) for icd9_2
icd9_bin == binary for either icd9_1 OR icd9_2

What is the fastest way to create these binary variables?

I've replaced NAs with 0, turned into a factor and then recoded, but that took forever.

# get structure
dat$icd9_1 %>% str()
# get rid of NAs (replace with 0s)
dat$icd9_1[is.na(dat$icd9_1 )] <- 0
# turn into factor
dat$icd9_1 <- factor(dat$icd9_1)
# get levels 
dat$icd9_1 %>% levels()
# use fct_collapse
dat %>%
    mutate(icd9_bin_1 = fct_collapse(
        icd9_1,
        `icd9` = c("411.89","414.01"),
        `no icd9 dx` = c("0")))
# A tibble: 6 x 4
#   pat_id icd9_1        icd9_2 icd9_bin_1
#    <dbl> <fctr>         <chr>     <fctr>
#        1 414.01        414.01       icd9
#        2 411.89          <NA>       icd9
#        3      0        410.71 no icd9 dx
#        4      0          <NA> no icd9 dx
#        5      0        410.51 no icd9 dx
#        6      0 272.0, 410.71 no icd9 dx

I'm looking for a more elegant solution. Ideas?

7
  • The first row should be the binary for either since it has both non-na columns. You have labeled it the same as the second row indicating column 9_1 only. Commented Jun 14, 2017 at 17:12
  • 1
    Do you just need dat$icd9_bin_1 <- if_else(is.na(dat$icd9_1), "no icd9 dx", "icd9")? I'm tired, so I'm probably missing something... Commented Jun 14, 2017 at 17:14
  • @PierreLafortune sorry about that--I was just giving an example of how I was creating the first binary variable, icd9_bin_1. After these two are created, I use mutate and if_else to create the binary for either icd9_1 or icd9_2 Commented Jun 14, 2017 at 17:19
  • Try dat[c('icd9_bin_1', 'icd9_bin_2')] <- paste(c('yes', 'no')[is.na(dat[-1]) + 1L], rep(names(dat[-1]), each=nrow(dat)), sep='-') Commented Jun 14, 2017 at 17:23
  • @Phil, yes that works (and is way fewer lines of code). I guess I was hoping for a dplyr solution that let me create all three variables in one pipe? The actual data has up to 50 different icd9 levels across several variables. Commented Jun 14, 2017 at 17:27

2 Answers 2

1

To create the binary values manually, just apply a function to each column and take the or of the columns to find rows where neither is NA.

is_not_na <- function(...) Negate(is.na)(...)

dat %>%
  mutate(icd9_bin_1 = icd9_1 %>% is_not_na() %>% as.numeric(),
         icd9_bin_2 = icd9_2 %>% is_not_na() %>% as.numeric(),
         icd9_bin = as.numeric(icd9_bin_1 | icd9_bin_2))
#> # A tibble: 6 x 6
#>   pat_id icd9_1        icd9_2 icd9_bin_1 icd9_bin_2 icd9_bin
#>    <dbl>  <chr>         <chr>      <dbl>      <dbl>    <dbl>
#> 1      1 414.01        414.01          1          1        1
#> 2      2 411.89          <NA>          1          0        1
#> 3      3   <NA>        410.71          0          1        1
#> 4      4   <NA>          <NA>          0          0        0
#> 5      5   <NA>        410.51          0          1        1
#> 6      6   <NA> 272.0, 410.71          0          1        1

If you had many, many of these columns, you could use mutate_at().

is_not_na_num <- function(...) as.numeric(Negate(is.na)(...))

# Make up a new column
dat$icd9_3 <- rev(dat$icd9_1)

# To use pattern matching...
data_auto <- dat %>%
  mutate_at(vars(matches("icd9")), funs(bin = is_not_na_num))
data_auto
#> # A tibble: 6 x 7
#>   pat_id icd9_1        icd9_2 icd9_3 icd9_1_bin icd9_2_bin icd9_3_bin
#>    <dbl>  <chr>         <chr>  <chr>      <dbl>      <dbl>      <dbl>
#> 1      1 414.01        414.01   <NA>          1          1          0
#> 2      2 411.89          <NA>   <NA>          1          0          0
#> 3      3   <NA>        410.71   <NA>          0          1          0
#> 4      4   <NA>          <NA>   <NA>          0          0          0
#> 5      5   <NA>        410.51 411.89          0          1          1
#> 6      6   <NA> 272.0, 410.71 414.01          0          1          1

(But to automate that final or you could use reduce()...)

bin_any <- data_auto %>%
  select(matches("_bin")) %>%
  purrr::reduce(~ as.numeric(.x | .y))
data_auto$icd9_bin <- bin_any
data_auto["icd9_bin"]
#> # A tibble: 6 x 1
#>   icd9_bin
#>      <dbl>
#> 1        1
#> 2        1
#> 3        1
#> 4        0
#> 5        1
#> 6        1
Sign up to request clarification or add additional context in comments.

1 Comment

thank you! I went with the pattern matching because the actual data is coded according to specific medical conditions (i.e. hypertension is htn_icd9_plst, htn_icd9_enc, etc. ). This function will be very useful! I also wanted to share the dummyVar function from the caret package.
0

As per your comments, if_else() is a dplyr function that plays well with mutate() if that's what you need:

dat <- dat %>%
  mutate(icd9_bin_1 = if_else(is.na(dat$icd9_1), "no icd9 dx", "icd9"),
         more...)

1 Comment

Yes, @Phil --this is similar to what I am currently using. This requires multiple steps (i.e. binary for each variable, then binary for either of the binary variables). I was hoping there was a solution that passed over the first step and looked across a set of variables (because technically they are lists?) and return 1 = at least one icd9 in any of these variables or 0 = NA in all of these variables.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.