0

I have a dataset consisting of 2 continuous variables X1, X2 with missing values in both, and I need to impute the missing data. I am working with the MICE package in R. The trouble is that the values in one column are conditional on the other, specifically X1 >= X2. However, when I run mice, values are imputed that violate this condition.

Here is a minimal working example:

library(MASS)
library(tidyverse)
library(mice)

p1 <- 0.7
p2 <- 0.65

sample_size <- 100                                       
sample_meanvector <- c(5, 5)                                   
sample_covariance_matrix <- matrix(c(10, 5, 2, 9), ncol = 2)
  
mvrnorm(
        n = sample_size,
        mu = sample_meanvector, 
        Sigma = sample_covariance_matrix) %>%
    data.frame() %>%
    as_tibble() %>%
    mutate(R1 = rbinom(sample_size, 1, p1)) %>%
    mutate(R2 = rbinom(sample_size, 1, p2)) %>%
    mutate(X1 = ifelse(R1 == 1, X1, NA)) %>%
    mutate(X2 = ifelse(R2 == 1, X2, NA)) %>%
    dplyr::select(X1, X2) %>%
    filter(X1 >= X2 | is.na(X1) | is.na(X2)) -> sample_data

sample_data %>% 
    ggplot(aes(x=X1,y=X2)) + 
        geom_point() + 
        geom_abline(slope = 1, intercept = 0, color = 'red')

unimputed data scatter plot

mice(sample_data, m=1) -> mids

complete(mids, 1) -> imputed_data

imputed_data %>%
    ggplot(aes(x=X1,y=X2)) + 
        geom_point() + 
        geom_abline(slope = 1, intercept = 0, color = 'red')

imputed data scatter plot

I understand that I need to use the post feature somehow but I cannot find detailed enough documentation on this feature, specifically to help in the situation where the imputed values are constrained by other imputed values in the same dataset. Please help.

1 Answer 1

0

The easiest solution to your problem is to use a different R package: smcfcs. For example:

library(smcfcs)
data <- pop
data[sample(nrow(data), size = 100), "wgt"] <- NA
data[sample(nrow(data), size = 100), "hgt"] <- NA
data$whr <- 100 * data$wgt / data$hgt
meth <- c("", "norm", "norm", "", "", "norm")
imps <- smcfcs(originaldata = data, meth = meth, smtype = "lm",
               smformula = "hc ~ age + hgt + wgt + whr")
fit <- lapply(imps$impDatasets, lm,
              formula = hc ~ age + hgt + wgt + whr)
summary(pool(fit))

If you do want to use mice, what is the specific conditioning that you need? The conditional imputation example in FIMD squeezes the imputed values within a certain range as follows:

library(mice)
data <- airquality[, 1:2]
post <- make.post(data)
post["Ozone"] <-
  "imp[[j]][, i] <- squeeze(imp[[j]][, i], c(1, 200))"
imp <- mice(data, method = "norm.nob", m = 1,
            maxit = 1, seed = 1, post = post)

Otherwise, take a look at the mice postprocessing vignette or this answer.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.