Identify number of unique days of service use by group

Question

I have a dataset with start and end dates of service use for individuals (one row for each episode). Sometimes these periods overlap, sometimes they don't. I want to count the number of unique days in a year the person has touched a service (using R). Tried to use the IVS package but am running into issues as this struggles with rows where the start and end date are on the same day. How do I count distinct days where the same person has single days of service as well as episodes of multiple days.

eg_data <- data.frame(
id = c(1,1,1,  2,2,  3,3,3,3,3,3,  4,4,  5,5,5,5),
start_dt = c("01/01/2016", "12/02/2016", "03/12/2017",  "02/01/2016", 
"03/04/2016",  "01/01/2016", "03/05/2016", "05/07/2016", "07/01/2016", 
"09/04/2016", "10/10/2016",  "01/01/2016", "05/28/2016",  "01/01/2016", 
"06/05/2016", "08/25/2016", "11/01/2016"),  
end_dt =   c("12/01/2016", "12/02/2016", "05/15/2017",  "05/15/2016", 
"12/29/2016",  "03/02/2016", "04/29/2016", "06/29/2016", "08/31/2016", 
"03/04/2016", "11/29/2016",  "05/31/2016", "08/19/2016",  "06/10/2016", 
"07/25/2016", "08/25/2016", "12/30/2016"))
eg_data$row_n <- 1:nrow(eg_data)

Tried

ab <- a %>%
  mutate(
    start_dt = as.Date(ActivityStartDate, format = "%m/%d/%Y"),
    end_dt = as.Date(ActivityEndDate, format = "%m/%d/%Y")
  ) %>%
  mutate(
    range = iv(start_dt, end_dt),
    .keep = "unused"
  )


c <-ab %>%
  group_by(ID) %>%
  mutate(group = iv_identify_group(range)) %>%
  group_by(group, .add = TRUE)

But doesn't work for records where start and end date are on the same day. Also want the output to be a dataframe with date variables, not a vector, so I can calculate the total number of days with activity (without counting the same day more than once).

Rainfall.NZ · Accepted Answer · 2023-03-22 06:39:12Z

One approach is to filter the data for each id, get and combine the date sequences for each row, then count the number of unique dates. Not sure what you meant about needing the output as a data frame with date variables, but I converted the result to a data frame, in the hope that it is close to what you are after. Note that in your data, row ten has a start date after the end date, so that needs fixing before the following will work. I assumed they were back-to-front.

DayTotals <- sapply(seq_along(unique(eg_data$id)), function(id_index) {
  Current_id <- unique(eg_data$id)[id_index]
  Current_id_data <- eg_data %>% filter(id == Current_id)
  Current_id_dates <- apply(Current_id_data,1,function(row) {
    seq.Date(from = as.Date(row['start_dt'],format="%m/%d/%Y"),
                      to=as.Date(row['end_dt'],format="%m/%d/%Y"),
             by="day")})
  Current_id_No_Of_Days <- Current_id_dates %>% unlist %>% unique %>% length
})

DayTotalsDF <- data.frame(id=unique(eg_data$id),
                          NoOfDays=DayTotals)

> DayTotalsDF
  id NoOfDays
1  1      402
2  2      333
3  3      298
4  4      232
5  5      268

Paul Maxwell · Accepted Answer · 2023-03-22 07:24:52Z

0

Min and Max dates are calculated to ensure consistency of start <= end. Then mapply() used to generate a sequence of dates from seq.Date() function. These date sequences are combined into a vector using unlist() then duplicate dates removed using unique(). The length of the vector is then calculated to determine the total number of days with activity for each ID.

see: https://www.mycompiler.io/view/Jau8tbboisq

library(dplyr)

eg_data <- data.frame(
id = c(1,1,1,  2,2,  3,3,3,3,3,3,  4,4,  5,5,5,5),
start_dt = c("01/01/2016", "12/02/2016", "03/12/2017",  "02/01/2016", 
"03/04/2016",  "01/01/2016", "03/05/2016", "05/07/2016", "07/01/2016", 
"09/04/2016", "10/10/2016",  "01/01/2016", "05/28/2016",  "01/01/2016", 
"06/05/2016", "08/25/2016", "11/01/2016"),  
end_dt =   c("12/01/2016", "12/02/2016", "05/15/2017",  "05/15/2016", 
"12/29/2016",  "03/02/2016", "04/29/2016", "06/29/2016", "08/31/2016", 
"03/04/2016", "11/29/2016",  "05/31/2016", "08/19/2016",  "06/10/2016", 
"07/25/2016", "08/25/2016", "12/30/2016"))
eg_data$row_n <- 1:nrow(eg_data)

eg_data %>%
  mutate(
    start_dt = as.Date(start_dt, format = "%m/%d/%Y"),
    end_dt = as.Date(end_dt, format = "%m/%d/%Y"),
    min_date = pmin(start_dt, end_dt),
    max_date = pmax(start_dt, end_dt)
  ) %>%
  group_by(id) %>%
  summarize(
    total_days = length(unique(unlist(mapply(seq.Date, min_date, max_date, by = "day"))))
  )

the result from this is;

     id total_days
  <dbl>      <int>
1     1        402
2     2        333
3     3        298
4     4        232
5     5        268

If this isn't the wanted result please provide the wanted result and if possible explain how you arrived at that (from the sample data only)

edited Mar 22, 2023 at 7:24

answered Mar 22, 2023 at 6:39

Paul Maxwell

35.7k4 gold badges39 silver badges55 bronze badges

2 Comments

Linda P Over a year ago

This solution has been working great for me. If I now wanted to do it by subgroup, how would i add this? When doing this, it doesn't seem to count unique days anymore but all days:eg_data %>% mutate( start_dt = as.Date(start_dt, format = "%m/%d/%Y"), end_dt = as.Date(end_dt, format = "%m/%d/%Y"), min_date = pmin(start_dt, end_dt), max_date = pmax(start_dt, end_dt) ) %>% group_by(id) %>% summarize( total_days = length(unique(unlist(mapply(seq.Date, min_date, max_date, by = "day")))[ActivityTypeCode=='T01']) )

Paul Maxwell Over a year ago

btw: you might be able to amend/use mycompiler.io/view/Jau8tbboisq to help explain what you mean by "subgroup"

Davis Vaughan · Accepted Answer · 2023-03-22 14:26:54Z

I think this is actually a fantastic case for ivs, you just need to adjust your thinking a little bit to shift from closed intervals like [ ] to half-open ones like [ ). All you need to do is add 1 to your end dates, which "just works" in this case.

Using half-open intervals also ends up making the math work out nicely too.

(This requires dplyr 1.1.0 or higher)

library(dplyr, warn.conflicts = FALSE)
library(ivs)

df <- tibble(
  id = c(1,1,1,  2,2,  3,3,3,3,3,3,  4,4,  5,5,5,5),
  start_dt = c(
    "01/01/2016", "12/02/2016", "03/12/2017", "02/01/2016", 
    "03/04/2016", "01/01/2016", "03/05/2016", "05/07/2016", 
    "07/01/2016", "09/04/2016", "10/10/2016", "01/01/2016",
    "05/28/2016", "01/01/2016", "06/05/2016", "08/25/2016", 
    "11/01/2016"
  ),  
  end_dt = c(
    "12/01/2016", "12/02/2016", "05/15/2017", "05/15/2016", 
    "12/29/2016", "03/02/2016", "04/29/2016", "06/29/2016",
    "08/31/2016", "09/04/2016", "11/29/2016", "05/31/2016", 
    "08/19/2016", "06/10/2016", "07/25/2016", "08/25/2016", 
    "12/30/2016"
  )
)

df <- df %>%
  mutate(
    start_dt = as.Date(start_dt, format = "%m/%d/%Y"),
    end_dt = as.Date(end_dt, format = "%m/%d/%Y") + 1L
  ) %>%
  mutate(
    range = iv(start_dt, end_dt),
    .keep = "unused"
  )

df %>%
  reframe(range = iv_groups(range), .by = id) %>%
  mutate(days = as.integer(iv_end(range) - iv_start(range))) %>%
  summarise(count = sum(days), .by = id)
#> # A tibble: 5 × 2
#>      id count
#>   <dbl> <int>
#> 1     1   402
#> 2     2   333
#> 3     3   286
#> 4     4   232
#> 5     5   268

You'll notice that my answer for id 3 is different from other solutions. That is because I think you have a typo in your original dataset in row 10, where the end date is significantly before the start date:

df[10,]
#> # A tibble: 1 × 3
#>      id start_dt   end_dt    
#>   <dbl> <chr>      <chr>     
#> 1     3 09/04/2016 03/04/2016

ivs detected this automatically for me:

#> Error in `mutate()`:
#> ℹ In argument: `range = iv(start_dt, end_dt)`.
#> Caused by error in `iv()`:
#> ! `start` must be less than `end`.
#> ℹ `start` is not less than `end` at locations: `10`.

Collectives™ on Stack Overflow

Identify number of unique days of service use by group

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related