Remove duplicate rows based on 2 columns and a condition in a third column

Question

I'm having some trouble cleaning up a compiled dataset. Here's what the data look like:

   site unique_id      date latitude longitude depth name    count
1  L012    L012_1   no data 18.17606 -65.10571    40 dat1        0
2  L012    L012_1   no data 18.17606 -65.10571    40 dat2        5
3  L012    L012_1   no data 18.17606 -65.10571    40 dat3        4
4  B197    B197_1   no data 18.21543 -65.04415    43 dat2        5
5   S56     S56_1 9/16/2016 18.24459 -65.11549   999 dat4        5
6 N9040   N9040_1 7/16/2013 18.26385 -64.90385    25 dat5        1
7    SC      SC_1 7/19/2006 18.26267 -64.87237    24 dat6        0
8    SC      SC_2 7/19/2006 18.26267 -64.87237    24 dat6        0

I need to remove duplicate rows based on the latitude and longitude columns on the condition that the count column has a number in it greater than 0 within those duplicate rows. The row that should remain then would be a unique lat/long with a 0 in the count column. That would be the case with the first three rows in this df.

At the same time, I need to keep any lat/longs that are unique (rows 4,5,6), even though they have numbers in the count columns greater than 0. I also need to keep any duplicate rows with the same lat/long, but have a 0 in the count column.

Ideally, I want the resulting data frame to look like this:

   site unique_id      date latitude longitude depth name    count
1  L012    L012_1   no data 18.17606 -65.10571    40 dat1        0
4  B197    B197_1   no data 18.21543 -65.04415    43 dat2        5
5   S56     S56_1 9/16/2016 18.24459 -65.11549   999 dat4        5
6 N9040   N9040_1 7/16/2013 18.26385 -64.90385    25 dat5        1
7    SC      SC_1 7/19/2006 18.26267 -64.87237    24 dat6        0
8    SC      SC_2 7/19/2006 18.26267 -64.87237    24 dat6        0

The original data frame is much larger than this and contains more 4s in the count column, so just 4s cannot be removed.

DatamineR · Accepted Answer · 2017-10-23 20:44:38Z

2

What about this?

library(dplyr)
df %>% group_by(latitude, longitude) %>% filter(n() == 1 | count == 0)
Source: local data frame [6 x 8]
Groups: latitude, longitude [5]

   site unique_id      date latitude longitude depth  name count
  <chr>     <chr>     <chr>    <dbl>     <dbl> <int> <chr> <int>
1  L012    L012_1    nodata 18.17606 -65.10571    40  dat1     0
2  B197    B197_1    nodata 18.21543 -65.04415    43  dat2     5
3   S56     S56_1 9/16/2016 18.24459 -65.11549   999  dat4     5
4 N9040   N9040_1 7/16/2013 18.26385 -64.90385    25  dat5     1
5    SC      SC_1 7/19/2006 18.26267 -64.87237    24  dat6     0
6    SC      SC_2 7/19/2006 18.26267 -64.87237    24  dat6     0

answered Oct 23, 2017 at 20:44

DatamineR

9,6803 gold badges28 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Remove duplicate rows based on 2 columns and a condition in a third column

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related