Remove duplicate rows based on the value of another variable

Question

I have a duplicate date which I want to remove based on the value of another variable. If one of dmean values for corresponding duplicates dates is NA I want to drop it. If both dmean values for a certain date are NA I would like to keep any of the date. Sample data is found below: I have tried

subset(df1, !duplicated(date))

but this removed all duplicates regardless of the value of dmean. For example for date 2010-12-23 I would like to keep the the dmean value 28.38250 instead of the one with NA.

structure(list(date = c("2010-12-22", "2010-12-22", "2010-12-23", 
"2010-12-23", "2010-12-24", "2010-12-24", "2010-12-25", "2010-12-25", 
"2010-12-26", "2010-12-26", "2010-12-27", "2010-12-27", "2010-12-28", 
"2010-12-28"), dmean = c(NA, NA, NA, 28.3825, 35.54625, NA, 75.27625, 
NA, NA, 75.225, NA, 41.75, NA, 37.98375)), .Names = c("date", 
"dmean"), class = "data.frame", row.names = c(NA, -14L))

juba · Accepted Answer · 2013-10-11 08:53:03Z

1

Here is a solution with plyr :

ddply(df, .(date), summarize,
      dmean=ifelse(all(is.na(dmean)), NA, max(dmean,na.rm=TRUE)))

Which gives :

        date    dmean
1 2010-12-22       NA
2 2010-12-23 28.38250
3 2010-12-24 35.54625
4 2010-12-25 75.27625
5 2010-12-26 75.22500
6 2010-12-27 41.75000
7 2010-12-28 37.98375

Note that it is really easy to change the function call if you want the mean, the min or any other statistics of your dmean values.

You can do the same with data.table, too :

dt <- data.table(df)
dt[,list(dmean=ifelse(all(is.na(dmean)), NA_real_, max(dmean,na.rm=TRUE))),by=date]

edited Oct 11, 2013 at 8:53

answered Oct 11, 2013 at 8:46

juba

49.3k14 gold badges116 silver badges121 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Sven Hohenstein · Accepted Answer · 2013-10-11 08:41:20Z

1

It will work if you order the data frame by date and dmean first:

df1_sorted <- df1[order(df1$date, df1$dmean), ]

After the reordering, the NAs in dmeans are below the numeric values for each corresponding date.

Now, you can exclude the rows with duplicated dates:

subset(df1_sorted, !duplicated(date))

The result:

         date    dmean
1  2010-12-22       NA
4  2010-12-23 28.38250
5  2010-12-24 35.54625
7  2010-12-25 75.27625
10 2010-12-26 75.22500
12 2010-12-27 41.75000
14 2010-12-28 37.98375

answered Oct 11, 2013 at 8:41

Sven Hohenstein

82k17 gold badges150 silver badges173 bronze badges

3 Comments

Backlin Over a year ago

Beware, that if no copy of a date is NA one of them will still be dropped by this solution. Can that happen, @Meso?

Sven Hohenstein Over a year ago

@Backlin You are right. I suppose the data contain one or no dmean value. This is the case in the example.

Meso Over a year ago

@Backlin, in my data one date is always NA. But there are also occasions in which both dates are NA.

Collectives™ on Stack Overflow

Remove duplicate rows based on the value of another variable

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related