6

New to R, but learning to handle db data and hit a wall.

I want to remove duplicate rows/observations from a table, based on two criteria: A user ID field and a date field that indicates the last time there was a change to the user, so the most recent dated row.

My truncated data set would look like the following:

UID    | DateLastChange
1      |  01/01/2016
1      |  01/03/2016
2      |  01/14/2015
3      |  02/15/2014
3      |  03/15/2016

I would like to end up with:

UID    | DateLastChange
1      |  01/03/2016
2      |  01/14/2015
3      |  03/15/2016

I have attempted to use duplicate or unique, but they don't seem to fully embrace the ability to be selective. I can conceive of the possibility to build a new table with unique UIDs, then left join in some way to only match with the most recent date.

Any advice would be much appreciated. Scott

2
  • 2
    This is just a duplicated operation if it is in the order shown - dat[!duplicated(dat$UID, fromLast=TRUE),] Commented Jan 4, 2017 at 3:28
  • Thanks for the edit to the post. As you may have read, that was my first post to SO, so I haven't really figured out how to end up with neat tables. TY. SW. Commented Jan 13, 2017 at 17:47

2 Answers 2

6

We can use data.table

library(data.table)
setDT(df1)[order(UID, -as.IDate(DateLastChange, "%m/%d/%Y")), head(.SD, 1), by = UID]
#     UID DateLastChange
#1:   1     01/03/2016
#2:   2     01/14/2015
#3:   3     03/15/2016

Or using duplicated

setDT(df1)[order(UID, -as.IDate(DateLastChange, "%m/%d/%Y"))][!duplicated(UID)]
Sign up to request clarification or add additional context in comments.

2 Comments

Does this work for randomly ordered DateLastChange entries or do they have to be in chronological order, as in OP's example?
@questionmark In the i, we are ordering the column 'DateLastChange` after converting to Date class. It should work for randomly ordered. Only change that may need is the format %m/%d/%Y if it is not in that order of month/day/Year
1

Using dplyr - data can be in any order

require(dplyr)
dat$DateLastChange <- strptime(dat$DateLastChange, "%m/%d%Y")) 
dat %>% group_by(UID) %>% summarize(DateLastChange = max(DateLastChange))

4 Comments

Epi99, thanks for the quick response. It returns an error indicating that the date format is wrong. "Error in grouped_df_impl(data, unname(vars), drop) : column 'EmploymentStatusChangeDate' has unsupported class : POSIXlt, POSIXt. I attempted to find a way to specify this as POSIXct, but have yet to find appropriate usage. I see where you are heading however with this and will keep plugging away.
Your sample data is in plain text - so the strptime() is to parse the text date format into a datetime object, so that max() can do a valid compare. Your data frame may aready be in a date or time format, in which case you don't need the line that includes strptime(). This is why generally recommend using dput to show your example data - then your reader can recreate exactly the data you have.
I will explore dput so we can all have the same apples/apples comparison on the data. Yesterday I worked through converting the date from character to a date and the max() comparison actually worked, so thank you.
This works only for the given example, however, if you have additional columns you want to use a filter instead: dat %>% group_by(UID) %>% filter(Date == max(Date)).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.