2

I have a data table like dt below. It's mostly complete, but has a few missing values that I'm trying to fill in a reasonable way.

set.seed(2015)
require(data.table)
dt<-data.table(id=1:10, x=sample(letters[1:3],10,replace=TRUE), y=sample(letters[4:6],10,replace=TRUE), key="id")
dt[sample(10,3), y:=""]
dt
    id x y
 1:  1 a f
 2:  2 c  
 3:  3 a d
 4:  4 a  
 5:  5 a f
 6:  6 b f
 7:  7 b  
 8:  8 a d
 9:  9 b f
10: 10 b e

For each missing y, I would like to set the y value equal to the most frequent (non blank) y value for its class in x. In the case of a tie, choose a random y of the tied winners. If no winner exists, leave y blank. In this example my data table should get transformed to

    id x y
 1:  1 a f
 2:  2 c  
 3:  3 a d
 4:  4 a d
 5:  5 a f
 6:  6 b f
 7:  7 b f
 8:  8 a d
 9:  9 b f
10: 10 b e

or

    id x y
 1:  1 a f
 2:  2 c  
 3:  3 a d
 4:  4 a f
 5:  5 a f
 6:  6 b f
 7:  7 b f
 8:  8 a d
 9:  9 b f
10: 10 b e

(the y value in row 4 could become d or f)

Couldn't figure out an efficient way to do this.

2 Answers 2

4

I'd first get the corresponding entries to replace y with for each value in x as follows:

idt = dt[, .N, by="x,y"][, list(y=sample(y[N %in% max(N)], 1L)), by=x]
#    x y
# 1: a d
# 2: c  
# 3: b f

and then replace missing y by reference using a binary-subset for each x on idt as follows:

setkey(idt, x)
dt[y == "", y := idt[x]$y]
#     id x y
#  1:  1 a f
#  2:  2 c  
#  3:  3 a d
#  4:  4 a d
#  5:  5 a f
#  6:  6 b f
#  7:  7 b f
#  8:  8 a d
#  9:  9 b f
# 10: 10 b e
Sign up to request clarification or add additional context in comments.

1 Comment

This looks good except for one issue - if the most frequent class in x for a given y is the empty string, this appears to assign the empty string to that y. I think I fixed this issue by idt = dt[y!="", .N, by="x,y"][, list(y=sample(y[N %in% max(N)], 1L)), by=x] Thanks for your help, Arun.
2

Not sure if this is fastest, but you can do by:

dt[, z := ifelse(y!="", y, if(length(el <- sort(table(y[y!=""]), decreasing = TRUE)) > 0 ) {names(el)[1]} else {""}),by=x]

then you will get

> dt
    id x y z
 1:  1 a f f
 2:  2 c    
 3:  3 a d d
 4:  4 a   d
 5:  5 a f f
 6:  6 b f f
 7:  7 b   f
 8:  8 a d d
 9:  9 b f f
10: 10 b e e

1 Comment

I'm not sure that this handles ties in a random manner, thus introducing bias into my dataset. For example, I think this method will always set the y value of row 4 equal to 'd' instead of possibly setting it to 'f'.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.