I want to remove certain rows from a data frame based on other entries in other columns in the data frame. For example, If I have a data frame that looks like this:
asd <- data.frame(
var_1 = as.factor(c("a1", "a2", "a3", "a1", "a2", "a3", "a1", "a2", "a3")),
var_2 = as.factor(c("a1", "a1", "a1", "a2", "a2", "a2", "a3", "a3", "a3")),
var_3 = c("NO", "YES","YES","YES","NO", "YES","YES","YES","NO"),
var_4 = c(0, 2, 4, 2, 0, 7, 4, 7, 0)
)
> asd
var_1 var_2 var_3 var_4
1 a1 a1 NO 0
2 a2 a1 YES 2
3 a3 a1 YES 4
4 a1 a2 YES 2
5 a2 a2 NO 0
6 a3 a2 YES 7
7 a1 a3 YES 4
8 a2 a3 YES 7
9 a3 a3 NO 0
I want to remove every row that has a NO in var_3 column (luckily, the NO's are always equally spaced, so I can use that fact to help remove them)
... and I also want to remove any duplicates. What I mean by duplicates is, for example, row 2 has a2 & a1 and row 4 has a1 and a2... these rows are duplicates of each other.
To achieve this I was using the following code:
# This line removes all the rows with NO
asdf <- asd[-seq(1, NROW(asd), by = 4), ]
# This line removes the duplicate rows
asdf <- asdf[!duplicated(t(apply(asdf, 1, sort))), ]
This results in:
> asdf
var_1 var_2 var_3 var_4
2 a2 a1 YES 2
3 a3 a1 YES 4
6 a3 a2 YES 7
This is the exact result I would like... but I was wondering if there is an easier, less messy way of achieving this result (preferably using base R... but this isn't an unbreakable rule)?
Any suggestion are greatly appreciated