R filter a variable by the count of another variable, but only counted within a day interval max

Question

Here is the dataframe that I am working with:

df <- tribble(
  ~Patient, ~date, ~Doctor
  "A", "2020-01-01", "A",
  "A", "2020-03-01", "A",
  "A", "2020-04-30", "B",
  "A", "2020-06-29", "C",
  "A", "2020-08-28", "A",
  "B", "2020-01-01", "A",
  "B", "2020-03-01","B",
  "B", "2020-04-30","B",
  "B", "2020-06-29","B",
  "B", "2020-08-28","C",
  "C", "2020-04-30","A",
  "C", "2020-06-29","A",
  "C", "2020-08-28","B",
  "C", "2020-10-27","C",
  "C", "2020-12-26","A",
)

As you can see, there are three columns: Patient, date, and Doctor.

Here is the desired dataframe that I am working towards.

desired_df <- tribble(
  ~Patient, ~Number_of_Diff_Doctors_within_180_days, 
  "A", "3", 
  "B", "2", 
  "C", "3", 
)

Here is the logic: I'm trying to return a dataframe with one unique value for each patient and the number of doctors that that patient has seen in a 180-day window. This 180-day period is like a moving window, and the job is to figure out the maximum number of doctors seen during any 180-day window for the patient.

In the example, Patient A has three different doctors, doctors A, B, and C, within 2020-03-01 to 2020-06-29, which is <180 window, so this patient gets a code for 1 corresponding to three doctors. But Patient B, who also has three doctors, sees Doctor A on 2020-01-01 and Doctor C on day 2020-08-28, so only has two doctors in any 180-day window. And Patient C is the same as Patient A in the intervals, except the days are shifted forward.

Here is my attempt so far. It doesn't do anything about the date logic because I didn't know what I was doing with all that.

attempt <- df %>%
  dplyr::select(Patient, Doctor) %>%
  dplyr::group_by(Patient, Doctor) %>%
  distinct() %>%
  dplyr::group_by(Patient) %>%
  tally() %>%
  filter(n > 1)

Please don't post screenshots of data - rather, post data frame/tibble excerpts inline. — andrew_reece
– andrew_reece, Commented May 26, 2022 at 0:32
Can you state the constraints you're trying to achieve more clearly? Even pseudo-code or a logical expression would be helpful. It's a little hard to understand from your written explanation. — andrew_reece
– andrew_reece, Commented May 26, 2022 at 0:33
Hi Andrew, I fixed the entire post. Removed the screenshots and replaced with tables, and also changed the variables in the data so the problem is more real, and finally I also changed the text, so I hope it's more clearly explained — hachiko
– hachiko, Commented May 26, 2022 at 18:31

Michael Dewar · Accepted Answer · 2022-05-27 05:16:12Z

1

Use the runner package for rolling window computations like this. It's wonderful.

library(tidyverse)
library(lubridate)
library(runner)


df <- tribble(
    ~Patient, ~date, ~Doctor,
    "A", "2020-01-01", "A",
    "A", "2020-03-01", "A",
    "A", "2020-04-30", "B",
    "A", "2020-06-29", "C",
    "A", "2020-08-28", "A",
    "B", "2020-01-01", "A",
    "B", "2020-03-01","B",
    "B", "2020-04-30","B",
    "B", "2020-06-29","B",
    "B", "2020-08-28","C",
    "C", "2020-04-30","A",
    "C", "2020-06-29","A",
    "C", "2020-08-28","B",
    "C", "2020-10-27","C",
    "C", "2020-12-26","A",
) %>% 
    mutate(date = ymd(date))

df %>% 
    group_by(Patient) %>% 
    mutate(num_docs = runner(Doctor, n_distinct, k = 180, idx = date)) %>% 
    summarize(num_docs = max(num_docs))

# A tibble: 3 × 2
  Patient num_docs
  <chr>      <int>
1 A              3
2 B              2
3 C              3

answered May 27, 2022 at 5:16

Michael Dewar

3,6331 gold badge8 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

hachiko Over a year ago

Hi Michael thanks for the suggestion. Do you think it’s fast? Without exaggeration my file is tens of millions of observations

Michael Dewar Over a year ago

Yes, I think it's fast. It's faster than other rolling window packages I've tried. I've used it successfully to refactor slow code which I applied to 20 million rows of data.

Michael Dewar Over a year ago

You should probably sort your data first. It helps the CPU cache be more efficient.

hachiko Over a year ago

This was a great solution. Only edit I would make is that I had to first type arrange(Patient, date) before the group_by otherwise I would get an error: Caused by error in window_run(): ! idx have to be in ascending order

hachiko Over a year ago

Fast, actually, I was impressed!

|

andrew_reece · Accepted Answer · 2022-05-27 02:09:10Z

1

Updated solution per OP edits.

First let's get a tidy data frame with cumulative days across a patient's visits:

df2 <- df %>% 
  mutate(date = as.Date(date)) %>% 
  group_by(Patient) %>% 
  mutate(days_btwn = replace_na(day(days(date - lag(date))), 0),
         cum_days = cumsum(days_btwn)) %>% 
  ungroup

Sample df2 output:

# A tibble: 15 × 5
   Patient date       Doctor days_btwn cum_days
   <chr>   <date>     <chr>      <dbl>    <dbl>
 1 A       2020-01-01 A              0        0
 2 A       2020-03-01 A             60       60
 3 A       2020-04-30 B             60      120
 4 A       2020-06-29 C             60      180
 5 A       2020-08-28 A             60      240
 6 B       2020-01-01 A              0        0
#...

Next, we can loop over each Patient (basically a group-by operation), and iteratively sample the rolling windows of visit periods. Compute the max number of unique Doctor values in each window where the total number of days is <= 180, and combine all patients' results in one data frame.


unique(df2$Patient) %>% 
  map_dfr(function(pat) {
    this_pat <- df2 %>% filter(Patient == pat)
    n_obs <- nrow(this_pat)
    max_docs <- n_distinct(this_pat$Doctor)
    n_docs <- 0
    max_win_docs <- 0
    for (i in 1:n_obs) {
      for (j in 1:n_obs) {
        win_days <- abs(this_pat$cum_days[j] - this_pat$cum_days[i])
        if (win_days <= 180) {
          n_docs <- n_distinct(this_pat %>% slice(i:j) %>% select(Doctor))
          if (n_docs > max_win_docs) max_win_docs <- n_docs
          if (max_win_docs == max_docs) next
        }
      }
    }
    list(patient = pat, n_diff_docs_within_180 = max_win_docs)
  }
)

Output

# A tibble: 3 × 2
  patient n_diff_docs_within_180
  <chr>                    <int>
1 A                            3
2 B                            2
3 C                            3

edited May 27, 2022 at 2:09

answered May 23, 2022 at 23:19

andrew_reece

21.4k3 gold badges40 silver badges64 bronze badges

2 Comments

hachiko Over a year ago

Hi Andrew, thanks so much for the quick reply. I went through your solution carefully and then I had to change my post just a bit to be a little more clear. I added another three lines to the small data frame example and updated the explanation. The problem is, there are actually multiple customer_id values with multiple sales_id values -- these pairings can appear dozens of times. I need to figure out a way to include the customer id value anytime the customer_id appears with a different sales_id within 100 days of each other

andrew_reece Over a year ago

Updated solution. There may be a nicer way to do this with either zoo or RcppRoll packages, but the conditional of max window sized paired with the unique string count made it easier to just write out in old-fashioned nested loops.

jlhoward · Accepted Answer · 2022-05-27 03:28:20Z

It's a little ambiguous what you mean by "within 180 days". Within 180 days of what date?

This determines the number of distinct doctors visited by each patient within 180 days of each visit.

library(data.table)
setDT(df)[, date:=as.Date(date)]
df[, date.hi:=date+180]
result <- df[df, on=.(Patient, date>=date, date<=date.hi)]
result[, .(count=uniqueN(Doctor)), by=.(Patient, date)]
       Patient       date count
##  1:       A 2020-01-01     3
##  2:       A 2020-03-01     3
##  3:       A 2020-04-30     3
##  4:       A 2020-06-29     2
##  5:       A 2020-08-28     1
##  6:       B 2020-01-01     2
##  7:       B 2020-03-01     2
##  8:       B 2020-04-30     2
##  9:       B 2020-06-29     2
## 10:       B 2020-08-28     1
## 11:       C 2020-04-30     3
## 12:       C 2020-06-29     3
## 13:       C 2020-08-28     3
## 14:       C 2020-10-27     2
## 15:       C 2020-12-26     1

So, Patient A visited 3 doctors within 180 days of 2020-01-01 (row 1), but only 2 doctors within 180 days of 2020-06-29 (row 4). Obviously, if the dataset ends less than 180 days after a given date we really don't know the number of visits that will occur in that time frame.

The expected result in your question seems to be based off the first visit for each patient. We can extract that as follows:

result[, .(count=uniqueN(Doctor)), by=.(Patient, date)][, .SD[1], by=.(Patient)]
##    Patient       date count
## 1:       A 2020-01-01     3
## 2:       B 2020-01-01     2
## 3:       C 2020-04-30     3

EDIT: based on OP comment. Max count for each patient is given by

result[, .(count=uniqueN(Doctor)), by=.(Patient, date)][
       , .(maxCount=max(count)),   by=.(Patient)]
##    Patient maxCount
## 1:       A        3
## 2:       B        2
## 3:       C        3

thanks so much for the response. Just to clarify I'm looking for is the max number of visits in any 180 day interval

Collectives™ on Stack Overflow

R filter a variable by the count of another variable, but only counted within a day interval max

3 Answers 3

10 Comments

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

10 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related