Add rows based on missing dates within a group

Question

I am trying to add rows to a data frame based on the minimum and maximum data within each group. Suppose this is my original data frame:

df = data.frame(Date = as.Date(c("2017-12-01", "2018-01-01", "2017-12-01", "2018-01-01", "2018-02-01","2017-12-01", "2018-02-01")),
            Group = c(1,1,2,2,2,3,3),
            Value = c(100, 200, 150, 125, 200, 150, 175))

Notice that Group 1 has 2 consecutive dates, group 2 has 3 consecutive dates, and group 3 is missing the date in the middle (2018-01-01). I'd like to be able to complete the data frame by adding rows for missing dates. But the thing is I only want to add additional dates based on dates that are missing between the minimum and maximum date within each group. So if I were to complete this data frame it would look like this:

df_complete = data.frame(Date = as.Date(c("2017-12-01", "2018-01-01", "2017-12-01", "2018-01-01", "2018-02-01","2017-12-01","2018-01-01", "2018-02-01")),
            Group = c(1,1,2,2,2,3,3,3),
            Value = c(100, 200, 150, 125, 200, 150,NA, 175))

Only one row was added because Group 3 was missing one date. There was no date added for Group 1 because it had all the dates between its minimum (2017-12-01) and maximum date (2018-01-01).

MKR · Accepted Answer · 2018-03-16 22:26:29Z

13

You can use tidyr::complete with dplyr to find a solution. The interval between consecutive dates seems to be month. The approach will be as below:

library(dplyr)
library(tidyr)

df %>% group_by(Group) %>%
  complete(Group, Date = seq.Date(min(Date), max(Date), by = "month"))

# A tibble: 8 x 3
# Groups: Group [3]
# Group Date       Value
# <dbl> <date>     <dbl>
#   1  1.00 2017-12-01   100
# 2  1.00 2018-01-01   200
# 3  2.00 2017-12-01   150
# 4  2.00 2018-01-01   125
# 5  2.00 2018-02-01   200
# 6  3.00 2017-12-01   150
# 7  3.00 2018-01-01    NA
# 8  3.00 2018-02-01   175

Data

df = data.frame(Date = as.Date(c("2017-12-01", "2018-01-01", "2017-12-01", "2018-01-01",
               "2018-02-01","2017-12-01", "2018-02-01")),
                Group = c(1,1,2,2,2,3,3),
                Value = c(100, 200, 150, 125, 200, 150, 175))

answered Mar 16, 2018 at 22:26

MKR

20.2k4 gold badges26 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

0range Over a year ago

This will only work for numeric group columns that are typecast as numeric double values. If the "Group" column holds, e.g., character strings, it will be typecast as factors and the complete() operation results in a tibble with a row for every factor/time combination for each group.

0range · Accepted Answer · 2020-03-24 19:43:30Z

2

@MKR's approach of using tidyr::complete with dplyr is good, but will fail if the group column is not numeric. It will then be typecast as factors and the complete() operation will then result in a tibble with a row for every factor/time combination for each group.

complete() does not need the group variable as first argument, so the solution is

library(dplyr)
library(tidyr)

df %>% group_by(Group) %>%
  complete(Date = seq.Date(min(Date), max(Date), by = "month"))

answered Mar 24, 2020 at 19:43

0range

2,1762 gold badges25 silver badges32 bronze badges

Collectives™ on Stack Overflow

Add rows based on missing dates within a group

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related