R - Replacing missing values in a data table

Question

I have a data table like dt below. It's mostly complete, but has a few missing values that I'm trying to fill in a reasonable way.

set.seed(2015)
require(data.table)
dt<-data.table(id=1:10, x=sample(letters[1:3],10,replace=TRUE), y=sample(letters[4:6],10,replace=TRUE), key="id")
dt[sample(10,3), y:=""]
dt
    id x y
 1:  1 a f
 2:  2 c  
 3:  3 a d
 4:  4 a  
 5:  5 a f
 6:  6 b f
 7:  7 b  
 8:  8 a d
 9:  9 b f
10: 10 b e

For each missing y, I would like to set the y value equal to the most frequent (non blank) y value for its class in x. In the case of a tie, choose a random y of the tied winners. If no winner exists, leave y blank. In this example my data table should get transformed to

    id x y
 1:  1 a f
 2:  2 c  
 3:  3 a d
 4:  4 a d
 5:  5 a f
 6:  6 b f
 7:  7 b f
 8:  8 a d
 9:  9 b f
10: 10 b e

or

    id x y
 1:  1 a f
 2:  2 c  
 3:  3 a d
 4:  4 a f
 5:  5 a f
 6:  6 b f
 7:  7 b f
 8:  8 a d
 9:  9 b f
10: 10 b e

(the y value in row 4 could become d or f)

Couldn't figure out an efficient way to do this.

Arun · Accepted Answer · 2014-07-17 12:45:48Z

4

I'd first get the corresponding entries to replace y with for each value in x as follows:

idt = dt[, .N, by="x,y"][, list(y=sample(y[N %in% max(N)], 1L)), by=x]
#    x y
# 1: a d
# 2: c  
# 3: b f

and then replace missing y by reference using a binary-subset for each x on idt as follows:

setkey(idt, x)
dt[y == "", y := idt[x]$y]
#     id x y
#  1:  1 a f
#  2:  2 c  
#  3:  3 a d
#  4:  4 a d
#  5:  5 a f
#  6:  6 b f
#  7:  7 b f
#  8:  8 a d
#  9:  9 b f
# 10: 10 b e

edited Jul 17, 2014 at 12:45

answered Jul 17, 2014 at 11:52

Arun

119k28 gold badges290 silver badges396 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ben Over a year ago

This looks good except for one issue - if the most frequent class in x for a given y is the empty string, this appears to assign the empty string to that y. I think I fixed this issue by idt = dt[y!="", .N, by="x,y"][, list(y=sample(y[N %in% max(N)], 1L)), by=x] Thanks for your help, Arun.

kohske · Accepted Answer · 2014-07-17 01:19:10Z

2

Not sure if this is fastest, but you can do by:

dt[, z := ifelse(y!="", y, if(length(el <- sort(table(y[y!=""]), decreasing = TRUE)) > 0 ) {names(el)[1]} else {""}),by=x]

then you will get

> dt
    id x y z
 1:  1 a f f
 2:  2 c    
 3:  3 a d d
 4:  4 a   d
 5:  5 a f f
 6:  6 b f f
 7:  7 b   f
 8:  8 a d d
 9:  9 b f f
10: 10 b e e

answered Jul 17, 2014 at 1:19

kohske

67.2k9 gold badges168 silver badges155 bronze badges

1 Comment

Ben Over a year ago

I'm not sure that this handles ties in a random manner, thus introducing bias into my dataset. For example, I think this method will always set the y value of row 4 equal to 'd' instead of possibly setting it to 'f'.

Collectives™ on Stack Overflow

R - Replacing missing values in a data table

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related