1

I want to impute missing values for few set of columns. The idea is for numeric variables I want to use the median to impute the NA and for categorical variables I want to use the mode to impute the NA. I did search for how to impute it separately for different set of columns and did not find.

My data is big with many columns so I have it in data.table. Since I am not sure how to do it in data.table, I tried below code base R. I have tried below code but somehow I am messing up with the column name identification it seems.

My data is large and with multiple variables. I am storing numeric variables in vector var_num and I am storing categorical variables in vector var_chr.

Please see my sample code below -

library(data.table)
set.seed(1200)
id <- 1:100
bills <- sample(c(1:20,NA),100,replace = T)
nos <- sample(c(1:80,NA),100,replace = T)
stru <- sample(c("A","B","C","D",NA),100,replace = T)
type <- sample(c(1:7,NA),100,replace = T)
value <- sample(c(100:1000,NA),100,replace = T)

df1 <- as.data.table(data.frame(id,bills,nos,stru,type,value))
class(df1)

var_num <- c("bills","nos","value")
var_chr <- c("stru","type")

impute <- function(x){
  #print(x)
  if(colnames(x) %in% var_num){
    x[is.na(x)] = median(x,na.rm = T)
  } else if (colnames(x) %in% var_chr){
    x[is.na(x)] = mode(x)
  } else {
    x #if not part of var_num and var_chr then nothing needs to be done and return the original value
  }
  return(x)
}


df1_imp_med <- data.frame(apply(df1,2,impute))

When I try to run the above it gives me error Error in if (colnames(x) %in% var_num) { : argument is of length zero

Please help me understand how I can correct this and achieve my requirement.

1
  • 1
    If you're using data.table you should consider taking advantage of its capabilites like update-by-reference using := assingment, or in this case possibly better suited, for + set to iterate over several columns. Commented Jul 17, 2018 at 10:43

4 Answers 4

7

As suggested in comments, you can use for-set combination in data.table for a faster imputation:

for(k in names(df1)){

      if(k %in% var_num){

        # impute numeric variables with median
        med <- median(df1[[k]],na.rm = T)
        set(x = df1, which(is.na(df1[[k]])), k, med)

    } else if(k %in% var_char){

        ## impute categorical variables with mode
        mode <- names(which.max(table(df1[[k]])))
        set(x = df1, which(is.na(df1[[k]])), k, mode)
    }
}
Sign up to request clarification or add additional context in comments.

4 Comments

thank you for your answer. I want to refer to the variables specified in var_num and var_chr.....you solution would do the imputation for all columns. But yes, it will be a good reference.
And here's the general solution I couldn't immediately come up with ;) Note for OP, with this method the "type" column of df1 needs to be changed to a factor or character to calculate the mode instead of median (as it is a numeric vector, but mode is desired)
@user1412 I made it more generic previously such that you don't need to hardcode the column names, just updated the answer.
@YOLO Thank you !!
3

It may or may not be worth your time coding up a single function for both of your use cases. A direct (but specific) solution is below -- note that mode may not be behaving as you expect, by reading ?mode.

library(data.table)

set.seed(1200)
df1 <- data.table(
id = 1:100,
bills = sample(c(1:20,NA),100,replace = T),
nos = sample(c(1:80,NA),100,replace = T),
stru = sample(c("A","B","C","D",NA),100,replace = T),
type = sample(c(as.character(1:7),NA),100,replace = T),
value = sample(c(100:1000,NA),100,replace = T)
)

# Function to calculate the most frequent object in a vector:
getMode <- function(myvector) {
    mytable <- table(myvector)
    return(names(mytable)[which.max(mytable)])
}

# replace na values by reference, with `:=`
df1[is.na(bills), bills := median(df1[,bills], na.rm=T)]
df1[is.na(nos), nos := median(df1[,nos], na.rm=T)]
df1[is.na(value), value := median(df1[,value], na.rm=T)]
df1[is.na(stru), stru := getMode(df1[,stru])]
df1[is.na(type), type := getMode(df1[,type])]

1 Comment

Thank you for your answer. Yes I got that mode is different in R and using the combination of names(which(table....see my answer. As you mentioned this would be a lengthy way of doing it as there are many variables....
0

I managed to get a working solution. One of the key things was to refer to the variables specified in var_num and var_chr for numeric and categorical imputation. Variables that are not specified in these vectors need not be imputed.

Challenge I was facing is to refer to them in the function. I dropped the idea of writing the function and managed to write a for loop as below -

df1 <- as.data.frame(df1)

for (var in 1:ncol(df1)) {
  if (names(df1[var]) %in% var_num) {
    df1[is.na(df1[,var]),var] <- median(df1[,var], na.rm = TRUE)
  } else if (names(df1[var]) %in% var_chr) {
    df1[is.na(df1[,var]),var] <- names(which.max(table(df1[,var])))
  }
}

This for loop does the needed imputation.

If there is more simpler and concise way of achieving this do let me know. Maybe some apply family may do the trick.

Comments

0

Another option using lapply

lapply(c(var_num, var_chr), function(x){ 
  imp.fun <- ifelse(x %in% var_num
                   , function(x) median(x, na.rm = T) 
                   , function(x) names(which.max(table(x))))
  df1[is.na(df1[[x]]), (x) := imp.fun(df1[[x]])]})

1 Comment

That code doesn't run, but there's the similar imp = df1[, c(lapply(.SD[, ..var_num], median, na.rm = TRUE), lapply(.SD[, ..var_chr], getMode))]; for (k in c(var_num, var_chr)) df1[is.na(get(k)), (k) := imp[[k]]][] (getMode borrowed from caw's answer)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.