4

I am trying to add rows to a data frame based on the minimum and maximum data within each group. Suppose this is my original data frame:

df = data.frame(Date = as.Date(c("2017-12-01", "2018-01-01", "2017-12-01", "2018-01-01", "2018-02-01","2017-12-01", "2018-02-01")),
            Group = c(1,1,2,2,2,3,3),
            Value = c(100, 200, 150, 125, 200, 150, 175))

Notice that Group 1 has 2 consecutive dates, group 2 has 3 consecutive dates, and group 3 is missing the date in the middle (2018-01-01). I'd like to be able to complete the data frame by adding rows for missing dates. But the thing is I only want to add additional dates based on dates that are missing between the minimum and maximum date within each group. So if I were to complete this data frame it would look like this:

df_complete = data.frame(Date = as.Date(c("2017-12-01", "2018-01-01", "2017-12-01", "2018-01-01", "2018-02-01","2017-12-01","2018-01-01", "2018-02-01")),
            Group = c(1,1,2,2,2,3,3,3),
            Value = c(100, 200, 150, 125, 200, 150,NA, 175))

Only one row was added because Group 3 was missing one date. There was no date added for Group 1 because it had all the dates between its minimum (2017-12-01) and maximum date (2018-01-01).

2 Answers 2

13

You can use tidyr::complete with dplyr to find a solution. The interval between consecutive dates seems to be month. The approach will be as below:

library(dplyr)
library(tidyr)

df %>% group_by(Group) %>%
  complete(Group, Date = seq.Date(min(Date), max(Date), by = "month"))

# A tibble: 8 x 3
# Groups: Group [3]
# Group Date       Value
# <dbl> <date>     <dbl>
#   1  1.00 2017-12-01   100
# 2  1.00 2018-01-01   200
# 3  2.00 2017-12-01   150
# 4  2.00 2018-01-01   125
# 5  2.00 2018-02-01   200
# 6  3.00 2017-12-01   150
# 7  3.00 2018-01-01    NA
# 8  3.00 2018-02-01   175

Data

df = data.frame(Date = as.Date(c("2017-12-01", "2018-01-01", "2017-12-01", "2018-01-01",
               "2018-02-01","2017-12-01", "2018-02-01")),
                Group = c(1,1,2,2,2,3,3),
                Value = c(100, 200, 150, 125, 200, 150, 175))
Sign up to request clarification or add additional context in comments.

1 Comment

This will only work for numeric group columns that are typecast as numeric double values. If the "Group" column holds, e.g., character strings, it will be typecast as factors and the complete() operation results in a tibble with a row for every factor/time combination for each group.
2

@MKR's approach of using tidyr::complete with dplyr is good, but will fail if the group column is not numeric. It will then be typecast as factors and the complete() operation will then result in a tibble with a row for every factor/time combination for each group.

complete() does not need the group variable as first argument, so the solution is

library(dplyr)
library(tidyr)

df %>% group_by(Group) %>%
  complete(Date = seq.Date(min(Date), max(Date), by = "month"))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.