1

I have a data frame with columns

    shipment_id     created_at    picked_at   packed_at   shipped_at
    CSDJKH231BN     2019-02-03    2019-02-03    
    CSDJKH231BN     2019-02-03    2019-02-03  2019-02-04  2019-02-05
    CSDJKH2KFJ3     2019-02-01    2019-02-04  2019-02-07  

The data base is being uploaded to rServer via google drive which is being constantly being updated.

    u1 <- "https://docs.google.com/spreadsheets/d/e/"link""
    tc1 <- getURL(u1, ssl.verifypeer=FALSE)
    x <- read.csv(textConnection(tc1))

If in the first update the shipment_id CSDJKH231BN was upto picked_at and in second update from google drive we get CSDJKH231BN upto shipped_at. How do i keep only the shipment_id that are upto shipped_at, but i also want to keep the shipment_id like CSDJKH2KFJ3 which are still to be processed and are not shipped yet.

Basically just to delete the duplicate entries but this code is not working for me.

    df <- df[!duplicated(df), ]

Any help would be appreciated.

1 Answer 1

2

I think you just need to specify that you're looking for duplicates in shipment_id. However, that will just keep the first version which would have nothing in the shipped_at column. So you might need to sort the column by the shipped_at and packed_at columns (in reverse, so that null values are at the bottom). Does this work?

df <- df[order(df[,'shipped_at'],df[,'packed_at'], decreasing=TRUE),]
df <- df[!duplicated(df$shipment_id), ]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.