Remove rows from data.table in R based on values of several columns

Question

I have a data.table in R which has several ids and a value. For each combination of ids, there are several rows. If one of these rows contains NA in the column 'value', I would like to remove all rows with this combination of ids. For example, in the table below, I would like to remove all rows for which id1 == 2 and id2 == 1.

If I had only one id I would do dat[!(id1 %in% dat[is.na(value),id1])]. In the example, this would remove all rows where i1 == 2. However, I did not manage to include several columns.

dat <- data.table(id1 = c(1,1,2,2,2,2),
                  id2 = c(1,2,1,2,3,1),
                  value = c(5,3,NA,6,7,3))

Try dat[!(id1==2 & id2==1)] or setkey(dat, id1, id2)[!J(2,1) ] — akrun
– akrun, Commented Jan 17, 2015 at 17:45
I know that this would work in the simple example above. However, the question is meant to be more general as there might be a large number of rows with NAs. — lilaf
– lilaf, Commented Jan 17, 2015 at 17:49
I think he is looking for dat[, if(all(!is.na(value))) .SD, .(id1, id2)] — David Arenburg
– David Arenburg, Commented Jan 17, 2015 at 17:49
@lilaf Okay, just now read the part about the NA. My comment was based on I would like to remove all rows for which id1 == 2 and id2 == 1. — akrun
– akrun, Commented Jan 17, 2015 at 17:54
@akrun Anyway thank you for your answer, I'll try to be clearer next time. — lilaf
– lilaf, Commented Jan 17, 2015 at 18:01

David Arenburg · Accepted Answer · 2017-04-08 17:35:55Z

4

If you want to check per combination of id1 and id2 if any of the values are NAs and then remove that whole combination, you can insert an if statement per group and only retrieve the results (using .SD) if that statement returns TRUE.

dat[, if(!anyNA(value)) .SD, by = .(id1, id2)]
#    id1 id2 value
# 1:   1   1     5
# 2:   1   2     3
# 3:   2   2     6
# 4:   2   3     7

Or similarly,

dat[, if(all(!is.na(value))) .SD, by = .(id1, id2)]

edited Apr 8, 2017 at 17:35

answered Jan 17, 2015 at 17:55

David Arenburg

92.4k18 gold badges145 silver badges202 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Frank Over a year ago

It might be costly to split dat into all those .SD and stack them. An alternative (maybe generally faster?) approach would be to select rows to keep dat[dat[,!any(is.na(value)),by="id1,id2"]$V1]

Frank Over a year ago

Ah, you're right. I did test it, but somehow convinced myself that what I saw was the right answer. The alternative I should have mentioned is: dat[dat[,.I[!any(is.na(value))],by="id1,id2"]$V1]

David Arenburg Over a year ago

@Frank that's a nice option too.

Collectives™ on Stack Overflow

Remove rows from data.table in R based on values of several columns

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related