r data.table impute missing values for multiple set of columns

Question

I want to impute missing values for few set of columns. The idea is for numeric variables I want to use the median to impute the NA and for categorical variables I want to use the mode to impute the NA. I did search for how to impute it separately for different set of columns and did not find.

My data is big with many columns so I have it in data.table. Since I am not sure how to do it in data.table, I tried below code base R. I have tried below code but somehow I am messing up with the column name identification it seems.

My data is large and with multiple variables. I am storing numeric variables in vector var_num and I am storing categorical variables in vector var_chr.

Please see my sample code below -

library(data.table)
set.seed(1200)
id <- 1:100
bills <- sample(c(1:20,NA),100,replace = T)
nos <- sample(c(1:80,NA),100,replace = T)
stru <- sample(c("A","B","C","D",NA),100,replace = T)
type <- sample(c(1:7,NA),100,replace = T)
value <- sample(c(100:1000,NA),100,replace = T)

df1 <- as.data.table(data.frame(id,bills,nos,stru,type,value))
class(df1)

var_num <- c("bills","nos","value")
var_chr <- c("stru","type")

impute <- function(x){
  #print(x)
  if(colnames(x) %in% var_num){
    x[is.na(x)] = median(x,na.rm = T)
  } else if (colnames(x) %in% var_chr){
    x[is.na(x)] = mode(x)
  } else {
    x #if not part of var_num and var_chr then nothing needs to be done and return the original value
  }
  return(x)
}


df1_imp_med <- data.frame(apply(df1,2,impute))

When I try to run the above it gives me error Error in if (colnames(x) %in% var_num) { : argument is of length zero

Please help me understand how I can correct this and achieve my requirement.

If you're using data.table you should consider taking advantage of its capabilites like update-by-reference using := assingment, or in this case possibly better suited, for + set to iterate over several columns. — talat
– talat, Commented Jul 17, 2018 at 10:43

YOLO · Accepted Answer · 2018-07-17 12:44:46Z

7

As suggested in comments, you can use for-set combination in data.table for a faster imputation:

for(k in names(df1)){

      if(k %in% var_num){

        # impute numeric variables with median
        med <- median(df1[[k]],na.rm = T)
        set(x = df1, which(is.na(df1[[k]])), k, med)

    } else if(k %in% var_char){

        ## impute categorical variables with mode
        mode <- names(which.max(table(df1[[k]])))
        set(x = df1, which(is.na(df1[[k]])), k, mode)
    }
}

edited Jul 17, 2018 at 12:44

answered Jul 17, 2018 at 12:31

YOLO

22k5 gold badges25 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user1412 Over a year ago

thank you for your answer. I want to refer to the variables specified in var_num and var_chr.....you solution would do the imputation for all columns. But yes, it will be a good reference.

caw5cv Over a year ago

And here's the general solution I couldn't immediately come up with ;) Note for OP, with this method the "type" column of df1 needs to be changed to a factor or character to calculate the mode instead of median (as it is a numeric vector, but mode is desired)

YOLO Over a year ago

@user1412 I made it more generic previously such that you don't need to hardcode the column names, just updated the answer.

user1412 Over a year ago

@YOLO Thank you !!

caw5cv · Accepted Answer · 2018-07-17 12:35:32Z

3

It may or may not be worth your time coding up a single function for both of your use cases. A direct (but specific) solution is below -- note that mode may not be behaving as you expect, by reading ?mode.

library(data.table)

set.seed(1200)
df1 <- data.table(
id = 1:100,
bills = sample(c(1:20,NA),100,replace = T),
nos = sample(c(1:80,NA),100,replace = T),
stru = sample(c("A","B","C","D",NA),100,replace = T),
type = sample(c(as.character(1:7),NA),100,replace = T),
value = sample(c(100:1000,NA),100,replace = T)
)

# Function to calculate the most frequent object in a vector:
getMode <- function(myvector) {
    mytable <- table(myvector)
    return(names(mytable)[which.max(mytable)])
}

# replace na values by reference, with `:=`
df1[is.na(bills), bills := median(df1[,bills], na.rm=T)]
df1[is.na(nos), nos := median(df1[,nos], na.rm=T)]
df1[is.na(value), value := median(df1[,value], na.rm=T)]
df1[is.na(stru), stru := getMode(df1[,stru])]
df1[is.na(type), type := getMode(df1[,type])]

answered Jul 17, 2018 at 12:35

caw5cv

7213 silver badges9 bronze badges

1 Comment

user1412 Over a year ago

Thank you for your answer. Yes I got that mode is different in R and using the combination of names(which(table....see my answer. As you mentioned this would be a lengthy way of doing it as there are many variables....

user1412 · Accepted Answer · 2018-07-17 12:40:15Z

I managed to get a working solution. One of the key things was to refer to the variables specified in var_num and var_chr for numeric and categorical imputation. Variables that are not specified in these vectors need not be imputed.

Challenge I was facing is to refer to them in the function. I dropped the idea of writing the function and managed to write a for loop as below -

df1 <- as.data.frame(df1)

for (var in 1:ncol(df1)) {
  if (names(df1[var]) %in% var_num) {
    df1[is.na(df1[,var]),var] <- median(df1[,var], na.rm = TRUE)
  } else if (names(df1[var]) %in% var_chr) {
    df1[is.na(df1[,var]),var] <- names(which.max(table(df1[,var])))
  }
}

This for loop does the needed imputation.

If there is more simpler and concise way of achieving this do let me know. Maybe some apply family may do the trick.

IceCreamToucan · Accepted Answer · 2018-07-17 13:18:03Z

0

Another option using lapply

lapply(c(var_num, var_chr), function(x){ 
  imp.fun <- ifelse(x %in% var_num
                   , function(x) median(x, na.rm = T) 
                   , function(x) names(which.max(table(x))))
  df1[is.na(df1[[x]]), (x) := imp.fun(df1[[x]])]})

edited Jul 17, 2018 at 13:18

answered Jul 17, 2018 at 12:51

IceCreamToucan

28.8k2 gold badges27 silver badges48 bronze badges

1 Comment

Frank Over a year ago

That code doesn't run, but there's the similar

imp = df1[, c(lapply(.SD[, ..var_num], median, na.rm = TRUE), lapply(.SD[, ..var_chr], getMode))]; for (k in c(var_num, var_chr)) df1[is.na(get(k)), (k) := imp[[k]]][]

(getMode borrowed from caw's answer)

Collectives™ on Stack Overflow

r data.table impute missing values for multiple set of columns

4 Answers 4

4 Comments

1 Comment

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related