0

I want to remove certain rows from a data frame based on other entries in other columns in the data frame. For example, If I have a data frame that looks like this:

 asd <- data.frame(
  var_1 = as.factor(c("a1", "a2", "a3", "a1", "a2", "a3", "a1", "a2", "a3")),
  var_2 = as.factor(c("a1", "a1", "a1", "a2", "a2", "a2", "a3", "a3", "a3")),
  var_3 = c("NO", "YES","YES","YES","NO", "YES","YES","YES","NO"),
  var_4 = c(0, 2, 4, 2, 0, 7, 4, 7, 0)
)
> asd
  var_1 var_2 var_3 var_4
1    a1    a1    NO     0
2    a2    a1   YES     2
3    a3    a1   YES     4
4    a1    a2   YES     2
5    a2    a2    NO     0
6    a3    a2   YES     7
7    a1    a3   YES     4
8    a2    a3   YES     7
9    a3    a3    NO     0

I want to remove every row that has a NO in var_3 column (luckily, the NO's are always equally spaced, so I can use that fact to help remove them)

... and I also want to remove any duplicates. What I mean by duplicates is, for example, row 2 has a2 & a1 and row 4 has a1 and a2... these rows are duplicates of each other.

To achieve this I was using the following code:

# This line removes all the rows with NO 
asdf <- asd[-seq(1, NROW(asd), by = 4), ]
# This line removes the duplicate rows
asdf <- asdf[!duplicated(t(apply(asdf, 1, sort))), ] 

This results in:

> asdf
  var_1 var_2 var_3 var_4
2    a2    a1   YES     2
3    a3    a1   YES     4
6    a3    a2   YES     7

This is the exact result I would like... but I was wondering if there is an easier, less messy way of achieving this result (preferably using base R... but this isn't an unbreakable rule)?

Any suggestion are greatly appreciated

1 Answer 1

1

A base R way which avoids the use of apply :

pmin/pmax performs rowwise sorting, with duplicated we drop the duplicates and remove rows which have var3 = 'NO'.

result <- transform(asd, var_1 = pmin(var_1, var_2), var_2 = pmax(var_1, var_2))
subset(result, (!duplicated(result[1:2])) &  var_3 != 'NO')

#  var_1 var_2 var_3 var_4
#2    a1    a2   YES     2
#3    a1    a3   YES     4
#6    a2    a3   YES     7
Sign up to request clarification or add additional context in comments.

2 Comments

This is an excellent answer. However, I made a mistake and have forgotten to add a detail to my original question. In my data set, var_1 and var_2 are factors... I will edit the question to reflect this fact.
In that case you need to change the columns to characters for this answer to work.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.