Remove duplicate rows in R data frame, based on a date field and another field

Question

New to R, but learning to handle db data and hit a wall.

I want to remove duplicate rows/observations from a table, based on two criteria: A user ID field and a date field that indicates the last time there was a change to the user, so the most recent dated row.

My truncated data set would look like the following:

UID    | DateLastChange
1      |  01/01/2016
1      |  01/03/2016
2      |  01/14/2015
3      |  02/15/2014
3      |  03/15/2016

I would like to end up with:

UID    | DateLastChange
1      |  01/03/2016
2      |  01/14/2015
3      |  03/15/2016

I have attempted to use duplicate or unique, but they don't seem to fully embrace the ability to be selective. I can conceive of the possibility to build a new table with unique UIDs, then left join in some way to only match with the most recent date.

Any advice would be much appreciated. Scott

This is just a duplicated operation if it is in the order shown - dat[!duplicated(dat$UID, fromLast=TRUE),] — thelatemail
– thelatemail, Commented Jan 4, 2017 at 3:28
Thanks for the edit to the post. As you may have read, that was my first post to SO, so I haven't really figured out how to end up with neat tables. TY. SW. — Scottieie
– Scottieie, Commented Jan 13, 2017 at 17:47

akrun · Accepted Answer · 2017-01-04 03:54:57Z

6

We can use data.table

library(data.table)
setDT(df1)[order(UID, -as.IDate(DateLastChange, "%m/%d/%Y")), head(.SD, 1), by = UID]
#     UID DateLastChange
#1:   1     01/03/2016
#2:   2     01/14/2015
#3:   3     03/15/2016

Or using duplicated

setDT(df1)[order(UID, -as.IDate(DateLastChange, "%m/%d/%Y"))][!duplicated(UID)]

answered Jan 4, 2017 at 3:54

akrun

891k38 gold badges590 silver badges700 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

questionmark Over a year ago

Does this work for randomly ordered DateLastChange entries or do they have to be in chronological order, as in OP's example?

akrun Over a year ago

@questionmark In the i, we are ordering the column 'DateLastChange` after converting to Date class. It should work for randomly ordered. Only change that may need is the format %m/%d/%Y if it is not in that order of month/day/Year

Andrew Lavers · Accepted Answer · 2017-01-04 03:49:52Z

1

Using dplyr - data can be in any order

require(dplyr)
dat$DateLastChange <- strptime(dat$DateLastChange, "%m/%d%Y")) 
dat %>% group_by(UID) %>% summarize(DateLastChange = max(DateLastChange))

answered Jan 4, 2017 at 3:49

Andrew Lavers

4,3881 gold badge14 silver badges19 bronze badges

4 Comments

Scottieie Over a year ago

Epi99, thanks for the quick response. It returns an error indicating that the date format is wrong. "Error in grouped_df_impl(data, unname(vars), drop) : column 'EmploymentStatusChangeDate' has unsupported class : POSIXlt, POSIXt. I attempted to find a way to specify this as POSIXct, but have yet to find appropriate usage. I see where you are heading however with this and will keep plugging away.

Andrew Lavers Over a year ago

Your sample data is in plain text - so the strptime() is to parse the text date format into a datetime object, so that max() can do a valid compare. Your data frame may aready be in a date or time format, in which case you don't need the line that includes strptime(). This is why generally recommend using dput to show your example data - then your reader can recreate exactly the data you have.

Scottieie Over a year ago

I will explore dput so we can all have the same apples/apples comparison on the data. Yesterday I worked through converting the date from character to a date and the max() comparison actually worked, so thank you.

Leo Over a year ago

This works only for the given example, however, if you have additional columns you want to use a filter instead: dat %>% group_by(UID) %>% filter(Date == max(Date)).

Collectives™ on Stack Overflow

Remove duplicate rows in R data frame, based on a date field and another field

2 Answers 2

2 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related