filtering out rows of data using dates within each row using R

Question

Representative dataset:

ID,drug_start,date_of_bloods,result
1234,NA,12-10-2010,80
1234,NA,12-10-2011,50
1234,NA,12-10-2013,10
4532,05-03-2015,01-01-2013,80
4532,05-03-2015,01-01-2014,60
4532,05-03-2015,01-01-2016,40
7894,04-09-2016,03-08-2012,40
7894,04-09-2016,03-08-2014,38
7894,04-09-2016,02-06-2015,30
7894,04-09-2016,29-10-2016,27
7894,04-09-2016,10-10-2017,26

I would like to filter this to result >20 and those rows where the date_of_bloods is < drug_start within each row. Where NA is in the drug_start column I would like them just filtered on result >20. Desired outcome would be:

ID,drug_start,date_of_bloods,result
1234,NA,12-10-2010,80
1234,NA,12-10-2011,50
4532,05-03-2015,01-01-2013,80
4532,05-03-2015,01-01-2014,60
7894,04-09-2016,03-08-2012,40
7894,04-09-2016,03-08-2014,38
7894,04-09-2016,02-06-2015,30

So far I have tried :

table <- readxl::read_excel("data.xlsx", sheet = "bloods")
table <- as.data.frame(table)
table$result <- as.numeric(table$result)
table<- subset(table, result >20)
table$date_of_bloods <- as.Date(table$date_of_bloods, format="%d/%m/%Y")
table$drug_start <- as.Date(table$drug_start, format="%d/%m/%Y")
table[!(table$drug_start >= table$date_of_bloods),]

The final line does not work and just turns all the rows where drug_start is NA to all columns reading NA and the dates not having filtered.

Your help would be greatly welcomed

Please provide minimal and reproducible example(s). Use dput() for data... — s_baldur
– s_baldur, Commented Jul 2, 2020 at 8:46

Ronak Shah · Accepted Answer · 2020-07-02 08:43:12Z

2

You can convert the dates to actual and date and use subset to select rows where result is greater than 20 and date_of_bloods < drug_start or drug_start is NA.

table$date_of_bloods <- as.Date(table$date_of_bloods, format="%d-%m-%Y")
table$drug_start <- as.Date(table$drug_start, format="%d-%m-%Y")

subset(table, result > 20 & (date_of_bloods < drug_start | is.na(drug_start)))

#    ID drug_start date_of_bloods result
#1 1234       <NA>     2010-10-12     80
#2 1234       <NA>     2011-10-12     50
#4 4532 2015-03-05     2013-01-01     80
#5 4532 2015-03-05     2014-01-01     60
#7 7894 2016-09-04     2012-08-03     40
#8 7894 2016-09-04     2014-08-03     38
#9 7894 2016-09-04     2015-06-02     30

The same logic using dplyr :

library(dplyr)

table %>%
  mutate(across(c(drug_start, date_of_bloods), lubridate::dmy)) %>%
  filter(result > 20 & (date_of_bloods < drug_start | is.na(drug_start)))

answered Jul 2, 2020 at 8:43

Ronak Shah

391k20 gold badges173 silver badges237 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

tacrolimus Over a year ago

thanks Ronak that worked perfectly. To clarify for my understanding when I was using as.Date in my attempt I should have been using '-' instead of '/'? Further to that my filtering didn't work as I was wasn't using subset?

Ronak Shah Over a year ago

In format of as.Date you should specify what format you have. In your data you have format "%d-%m-%Y" with "-" bu you specified %d/%m/%Y" with "/" which turned all the dates to NA and so your next line did not work since all the dates were NA.

tacrolimus Over a year ago

Thank you again for taking the time to help, makes perfect sense! Out of interest would table[!(table$drug_start >= table$date_of_bloods),] then work if dates had been formatted properly?

Ronak Shah Over a year ago

Yes, it would but then it would not select the first 2 NA rows for that you need to add another condition as mentioned in my answer using is.na.

tacrolimus Over a year ago

Thanks again for your help!

Collectives™ on Stack Overflow

filtering out rows of data using dates within each row using R

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related