How to remove certain rows from data frame based on other columns in R?

Question

I want to remove certain rows from a data frame based on other entries in other columns in the data frame. For example, If I have a data frame that looks like this:

 asd <- data.frame(
  var_1 = as.factor(c("a1", "a2", "a3", "a1", "a2", "a3", "a1", "a2", "a3")),
  var_2 = as.factor(c("a1", "a1", "a1", "a2", "a2", "a2", "a3", "a3", "a3")),
  var_3 = c("NO", "YES","YES","YES","NO", "YES","YES","YES","NO"),
  var_4 = c(0, 2, 4, 2, 0, 7, 4, 7, 0)
)
> asd
  var_1 var_2 var_3 var_4
1    a1    a1    NO     0
2    a2    a1   YES     2
3    a3    a1   YES     4
4    a1    a2   YES     2
5    a2    a2    NO     0
6    a3    a2   YES     7
7    a1    a3   YES     4
8    a2    a3   YES     7
9    a3    a3    NO     0

I want to remove every row that has a NO in var_3 column (luckily, the NO's are always equally spaced, so I can use that fact to help remove them)

... and I also want to remove any duplicates. What I mean by duplicates is, for example, row 2 has a2 & a1 and row 4 has a1 and a2... these rows are duplicates of each other.

To achieve this I was using the following code:

# This line removes all the rows with NO 
asdf <- asd[-seq(1, NROW(asd), by = 4), ]
# This line removes the duplicate rows
asdf <- asdf[!duplicated(t(apply(asdf, 1, sort))), ]

This results in:

> asdf
  var_1 var_2 var_3 var_4
2    a2    a1   YES     2
3    a3    a1   YES     4
6    a3    a2   YES     7

This is the exact result I would like... but I was wondering if there is an easier, less messy way of achieving this result (preferably using base R... but this isn't an unbreakable rule)?

Any suggestion are greatly appreciated

Ronak Shah · Accepted Answer · 2021-02-23 02:27:02Z

1

A base R way which avoids the use of apply :

pmin/pmax performs rowwise sorting, with duplicated we drop the duplicates and remove rows which have var3 = 'NO'.

result <- transform(asd, var_1 = pmin(var_1, var_2), var_2 = pmax(var_1, var_2))
subset(result, (!duplicated(result[1:2])) &  var_3 != 'NO')

#  var_1 var_2 var_3 var_4
#2    a1    a2   YES     2
#3    a1    a3   YES     4
#6    a2    a3   YES     7

answered Feb 23, 2021 at 2:27

Ronak Shah

391k20 gold badges173 silver badges237 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Electrino Over a year ago

This is an excellent answer. However, I made a mistake and have forgotten to add a detail to my original question. In my data set, var_1 and var_2 are factors... I will edit the question to reflect this fact.

Ronak Shah Over a year ago

In that case you need to change the columns to characters for this answer to work.

Collectives™ on Stack Overflow

How to remove certain rows from data frame based on other columns in R?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related