0

I have a panel dataset in R, which includes observations per group over time (month). The following dataframe is a snapshot of the complete dataframe:

df <- data.frame(group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),month = c("January", "January", "January", "February", "February", "February", "March", "March", "March", "January", "January", "February", "February", "March", "March"),first_value = c("A","BC","D", NA,NA,NA, "D","G","H", "K","L", NA,NA, "DE","GH"),second_value = c(1,5,7, NA,NA,NA, 2,3,9, 7,1, NA,NA, 4,4))

The dataset is already arranged by group and time. As you can see, observations ("first_value*"* and *"*second_value") can be completely empty for a group in a given month (here February, but can be any month except the first and the last month for every group). What I want to achieve is that the empty months are filled with the last non-empty previous month within a group.

I want to get the following dataframe:

df_filled <- data.frame(group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),month = c("January", "January", "January", "February", "February", "February", "March", "March", "March", "January", "January", "February", "February", "March", "March"),first_value = c("A","BC","D", "A","BC","D", "D","G","H", "K","L", "K","L", "DE","GH"),second_value = c(1,5,7, 1,5,7, 2,3,9, 7,1, 7,1, 4,4))

Please note that, by construction, the last non-empty previous month always has the same number of observations than the following empty months.

I tried different commands with fill() from the dplyr package and na.locf () from the zoo package but all I achieved was filling down the last row of the last non-empty previous month, so that

df_filled <- data.frame(group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), month = c("January", "January", "January", "February", "February", "February", "March", "March", "March", "January", "January", "February", "February", "March", "March"), first_value = c("A","BC","D", "D","D","D", "D","G","H", "K","L", "L","L", "DE","GH"), second_value = c(1,5,7, 7,7,7, 2,3,9, 7,1, 1,1, 4,4))

Looking forward to your suggestions. thanks.

1 Answer 1

0

An approach using row_number, assuming there are no 2 consecutive month with NA.

library(dplyr)

df %>% 
  mutate(n_na = sum(is.na(first_value)), .by = c(group, month)) %>% 
  mutate(across(ends_with("_value"), ~ 
           if_else(is.na(.x), .x[row_number() - n_na], .x)), .by = group, 
         n_na = NULL)
   group    month first_value second_value
1      1  January           A            1
2      1  January          BC            5
3      1  January           D            7
4      1 February           A            1
5      1 February          BC            5
6      1 February           D            7
7      1    March           D            2
8      1    March           G            3
9      1    March           H            9
10     2  January           K            7
11     2  January           L            1
12     2 February           K            7
13     2 February           L            1
14     2    March          DE            4
15     2    March          GH            4

If consecutive NA months are possible with this approach it takes a bit of group juggling

df %>% 
  mutate(n_na = sum(is.na(first_value)), .by = c(group, month)) %>% 
  mutate(n_lag = lag(n_na, default=0), .by = group) %>% 
  mutate(n_lag = n_na + first(n_lag), 
         n_na = if_else(n_na != 0, n_lag, n_na), 
         n_lag = NULL, .by = c(group, month)) %>% 
  mutate(across(ends_with("_value"), ~ 
           if_else(is.na(.x), .x[row_number() - n_na], .x)), .by = group, 
         n_na = NULL)
Sign up to request clarification or add additional context in comments.

2 Comments

thanks for your answer. Maybe I should have been more precise in my description, there are 2 and more consecutive month with only NAs. Moreover, when running your code, I get the following error message: Error in mutate(): ! Problem while computing .by = c(group, month). x .by must be size 15 or 1, not 30. Run rlang::last_error() to see where the error occurred.
@tobibow Regarding the error it may be you're running dplyr < 1.1.0. Use %>% group_by(group, month) %>% ... %>% ungroup() %>% ... in this case, or simply upgrade. 2 or more NA months should work with the second code.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.