0

I'm looking for a vectorized solution to the following problem. There are customers that can have one of two different products, x or y, at a time. I would like to identify all rows of product x that are followed by product y for the same customer. In that case, the to_date of product x would be the same as the from_date of product y. Here is an example:

customerid = c(rep(1,2),rep(2,3))
product = c("x", "y", "x", "x", "y")
from_date = as.Date(c("2000-01-01", "2000-06-07","2001-02-01","2005-01-01","2005-11-01"))
to_date = as.Date(c("2000-06-07", "2000-10-31","2002-04-01","2005-11-01","2006-01-01"))

data.frame(customerid, product, from_date, to_date)

      customerid product  from_date    to_date
1          1       x 2000-01-01 2000-06-07
2          1       y 2000-06-07 2000-10-31
3          2       x 2001-02-01 2002-04-01
4          2       x 2005-01-01 2005-11-01
5          2       y 2005-11-01 2006-01-01

The desired output would look like:

  customerid product  from_date    to_date followed_by_y
1          1       x 2000-01-01 2000-06-07             yes
2          1       y 2000-06-07 2000-10-31             no
3          2       x 2001-02-01 2002-04-01             no
4          2       x 2005-01-01 2005-11-01             yes
5          2       y 2005-11-01 2006-01-01             no

My approach so far is to group the data.frame by costumerid with dplyr. But then I do not know how to check the to_date for equal values in the from_date.

1 Answer 1

1

You could check for all conditions like below:

library(dplyr)

df %>%
  group_by(customerid) %>%
  mutate(followed_by_y = c('no', 'yes')[(product == 'x' &
                                         lead(product) == 'y' &
                                         to_date == lead(from_date)) + 1])

Output:

# A tibble: 5 x 5
# Groups:   customerid [2]
  customerid product from_date  to_date    followed_by_y
       <dbl> <fct>   <date>     <date>     <chr>        
1          1 x       2000-01-01 2000-06-07 yes          
2          1 y       2000-06-07 2000-10-31 no           
3          2 x       2001-02-01 2002-04-01 no           
4          2 x       2005-01-01 2005-11-01 yes          
5          2 y       2005-11-01 2006-01-01 no   

Note, this is essentially the same as saying:

library(dplyr)

df %>%
  group_by(customerid) %>%
  mutate(followed_by_y = case_when(
    product == 'x' & lead(product) == 'y' & to_date == lead(from_date) ~ 'yes',
    TRUE ~ 'no')
  )
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.