0

I have a dataset with start and end dates of service use for individuals (one row for each episode). Sometimes these periods overlap, sometimes they don't. I want to count the number of unique days in a year the person has touched a service (using R). Tried to use the IVS package but am running into issues as this struggles with rows where the start and end date are on the same day. How do I count distinct days where the same person has single days of service as well as episodes of multiple days.

eg_data <- data.frame(
id = c(1,1,1,  2,2,  3,3,3,3,3,3,  4,4,  5,5,5,5),
start_dt = c("01/01/2016", "12/02/2016", "03/12/2017",  "02/01/2016", 
"03/04/2016",  "01/01/2016", "03/05/2016", "05/07/2016", "07/01/2016", 
"09/04/2016", "10/10/2016",  "01/01/2016", "05/28/2016",  "01/01/2016", 
"06/05/2016", "08/25/2016", "11/01/2016"),  
end_dt =   c("12/01/2016", "12/02/2016", "05/15/2017",  "05/15/2016", 
"12/29/2016",  "03/02/2016", "04/29/2016", "06/29/2016", "08/31/2016", 
"03/04/2016", "11/29/2016",  "05/31/2016", "08/19/2016",  "06/10/2016", 
"07/25/2016", "08/25/2016", "12/30/2016"))
eg_data$row_n <- 1:nrow(eg_data)

Tried

ab <- a %>%
  mutate(
    start_dt = as.Date(ActivityStartDate, format = "%m/%d/%Y"),
    end_dt = as.Date(ActivityEndDate, format = "%m/%d/%Y")
  ) %>%
  mutate(
    range = iv(start_dt, end_dt),
    .keep = "unused"
  )


c <-ab %>%
  group_by(ID) %>%
  mutate(group = iv_identify_group(range)) %>%
  group_by(group, .add = TRUE)

But doesn't work for records where start and end date are on the same day. Also want the output to be a dataframe with date variables, not a vector, so I can calculate the total number of days with activity (without counting the same day more than once).

3 Answers 3

1

One approach is to filter the data for each id, get and combine the date sequences for each row, then count the number of unique dates. Not sure what you meant about needing the output as a data frame with date variables, but I converted the result to a data frame, in the hope that it is close to what you are after. Note that in your data, row ten has a start date after the end date, so that needs fixing before the following will work. I assumed they were back-to-front.

DayTotals <- sapply(seq_along(unique(eg_data$id)), function(id_index) {
  Current_id <- unique(eg_data$id)[id_index]
  Current_id_data <- eg_data %>% filter(id == Current_id)
  Current_id_dates <- apply(Current_id_data,1,function(row) {
    seq.Date(from = as.Date(row['start_dt'],format="%m/%d/%Y"),
                      to=as.Date(row['end_dt'],format="%m/%d/%Y"),
             by="day")})
  Current_id_No_Of_Days <- Current_id_dates %>% unlist %>% unique %>% length
})

DayTotalsDF <- data.frame(id=unique(eg_data$id),
                          NoOfDays=DayTotals)

> DayTotalsDF
  id NoOfDays
1  1      402
2  2      333
3  3      298
4  4      232
5  5      268
Sign up to request clarification or add additional context in comments.

Comments

0

Min and Max dates are calculated to ensure consistency of start <= end. Then mapply() used to generate a sequence of dates from seq.Date() function. These date sequences are combined into a vector using unlist() then duplicate dates removed using unique(). The length of the vector is then calculated to determine the total number of days with activity for each ID.

see: https://www.mycompiler.io/view/Jau8tbboisq

library(dplyr)

eg_data <- data.frame(
id = c(1,1,1,  2,2,  3,3,3,3,3,3,  4,4,  5,5,5,5),
start_dt = c("01/01/2016", "12/02/2016", "03/12/2017",  "02/01/2016", 
"03/04/2016",  "01/01/2016", "03/05/2016", "05/07/2016", "07/01/2016", 
"09/04/2016", "10/10/2016",  "01/01/2016", "05/28/2016",  "01/01/2016", 
"06/05/2016", "08/25/2016", "11/01/2016"),  
end_dt =   c("12/01/2016", "12/02/2016", "05/15/2017",  "05/15/2016", 
"12/29/2016",  "03/02/2016", "04/29/2016", "06/29/2016", "08/31/2016", 
"03/04/2016", "11/29/2016",  "05/31/2016", "08/19/2016",  "06/10/2016", 
"07/25/2016", "08/25/2016", "12/30/2016"))
eg_data$row_n <- 1:nrow(eg_data)

eg_data %>%
  mutate(
    start_dt = as.Date(start_dt, format = "%m/%d/%Y"),
    end_dt = as.Date(end_dt, format = "%m/%d/%Y"),
    min_date = pmin(start_dt, end_dt),
    max_date = pmax(start_dt, end_dt)
  ) %>%
  group_by(id) %>%
  summarize(
    total_days = length(unique(unlist(mapply(seq.Date, min_date, max_date, by = "day"))))
  )

the result from this is;

     id total_days
  <dbl>      <int>
1     1        402
2     2        333
3     3        298
4     4        232
5     5        268

If this isn't the wanted result please provide the wanted result and if possible explain how you arrived at that (from the sample data only)

2 Comments

This solution has been working great for me. If I now wanted to do it by subgroup, how would i add this? When doing this, it doesn't seem to count unique days anymore but all days:eg_data %>% mutate( start_dt = as.Date(start_dt, format = "%m/%d/%Y"), end_dt = as.Date(end_dt, format = "%m/%d/%Y"), min_date = pmin(start_dt, end_dt), max_date = pmax(start_dt, end_dt) ) %>% group_by(id) %>% summarize( total_days = length(unique(unlist(mapply(seq.Date, min_date, max_date, by = "day")))[ActivityTypeCode=='T01']) )
btw: you might be able to amend/use mycompiler.io/view/Jau8tbboisq to help explain what you mean by "subgroup"
0

I think this is actually a fantastic case for ivs, you just need to adjust your thinking a little bit to shift from closed intervals like [ ] to half-open ones like [ ). All you need to do is add 1 to your end dates, which "just works" in this case.

Using half-open intervals also ends up making the math work out nicely too.

(This requires dplyr 1.1.0 or higher)

library(dplyr, warn.conflicts = FALSE)
library(ivs)

df <- tibble(
  id = c(1,1,1,  2,2,  3,3,3,3,3,3,  4,4,  5,5,5,5),
  start_dt = c(
    "01/01/2016", "12/02/2016", "03/12/2017", "02/01/2016", 
    "03/04/2016", "01/01/2016", "03/05/2016", "05/07/2016", 
    "07/01/2016", "09/04/2016", "10/10/2016", "01/01/2016",
    "05/28/2016", "01/01/2016", "06/05/2016", "08/25/2016", 
    "11/01/2016"
  ),  
  end_dt = c(
    "12/01/2016", "12/02/2016", "05/15/2017", "05/15/2016", 
    "12/29/2016", "03/02/2016", "04/29/2016", "06/29/2016",
    "08/31/2016", "09/04/2016", "11/29/2016", "05/31/2016", 
    "08/19/2016", "06/10/2016", "07/25/2016", "08/25/2016", 
    "12/30/2016"
  )
)

df <- df %>%
  mutate(
    start_dt = as.Date(start_dt, format = "%m/%d/%Y"),
    end_dt = as.Date(end_dt, format = "%m/%d/%Y") + 1L
  ) %>%
  mutate(
    range = iv(start_dt, end_dt),
    .keep = "unused"
  )

df %>%
  reframe(range = iv_groups(range), .by = id) %>%
  mutate(days = as.integer(iv_end(range) - iv_start(range))) %>%
  summarise(count = sum(days), .by = id)
#> # A tibble: 5 × 2
#>      id count
#>   <dbl> <int>
#> 1     1   402
#> 2     2   333
#> 3     3   286
#> 4     4   232
#> 5     5   268

You'll notice that my answer for id 3 is different from other solutions. That is because I think you have a typo in your original dataset in row 10, where the end date is significantly before the start date:

df[10,]
#> # A tibble: 1 × 3
#>      id start_dt   end_dt    
#>   <dbl> <chr>      <chr>     
#> 1     3 09/04/2016 03/04/2016

ivs detected this automatically for me:

#> Error in `mutate()`:
#> ℹ In argument: `range = iv(start_dt, end_dt)`.
#> Caused by error in `iv()`:
#> ! `start` must be less than `end`.
#> ℹ `start` is not less than `end` at locations: `10`.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.