Replace missing values with previous value for multiple columns by grouping

Question

I have a dataframe of 6 variables. For each column, data is same for the same group with some missing values. I want to fill these missing values by duplicating value of the same group for each variable. In case all values are missing for a particular group, it should fill value of above group. So, I want the result as df_complete.

Here is what I have tried but it fails when first observation of any group is missing. Unable to figure out what is wrong in it.

set.seed(123)
df <- data.frame(matrix(rnorm(100), ncol = 5))
df$Group <- letters[1:20]
df <- df[rep(seq_len(nrow(df)), sample(1:10, 20, replace = T)),]
df_complete <- df
df$X1[sample(1:nrow(df), 15)] <- NA
df$X2[sample(1:nrow(df), 10)] <- NA
df$X3[sample(1:nrow(df), 25)] <- NA
df$X4[sample(1:nrow(df), 10)] <- NA
df$X5[sample(1:nrow(df), 15)] <- NA

lvcf <- function(x)
{
  miss_ind <- which(is.na(x))

  if(length(miss_ind) != 0)
  {
    if(miss_ind[1]==1)
    {
      ind1 <- which(!is.na(x))[1]
      x[1] <- x[ind1]
      miss_ind <- which(is.na(x))
    }

    for(i in 1:length(miss_ind))
    {
      x[miss_ind[i]] <- x[miss_ind[i]-1]
    }
  }      
  return(x)
}

df_complete <- df %>%
  group_by(Group) %>%
  sapply(lvcf)

Rui Barradas · Accepted Answer · 2017-10-10 07:24:35Z

2

Package zoo has a function to handle that kind of problem, na.locf, for last observation carried forward.

library(zoo)
df_complete <- df %>%
  group_by(Group) %>%
  na.locf(., na.rm = FALSE)

head(df_complete)
## A tibble: 6 x 6
## Groups:   Group [2]
#           X1          X2          X3          X4           X5 Group
#        <chr>       <chr>       <chr>       <chr>        <chr> <chr>
#1 -0.56047565 -1.06782371 -0.69470698        <NA>  0.005764186     a
#2 -0.56047565 -1.06782371 -0.69470698  0.37963948  0.005764186     a
#3 -0.56047565 -1.06782371 -0.69470698  0.37963948  0.005764186     a
#4 -0.23017749 -0.21797491 -0.20791728 -0.50232345  0.385280401     b
#5 -0.23017749 -0.21797491 -0.20791728 -0.50232345  0.385280401     b
#6 -0.23017749 -0.21797491 -0.20791728 -0.50232345  0.385280401     b

Note the <NA> in column X4.

EDIT.
Following the OP's comment below and G. Grothendieck's answer, the following removes all NA values. Just use a second na.locf with argument fromLast = TRUE.

df_complete <- df %>%
  group_by(Group) %>%
  na.locf(., na.rm = FALSE) %>%
  na.locf(., fromLast = TRUE)

head(df_complete)
## A tibble: 6 x 6
## Groups:   Group [2]
#           X1          X2          X3          X4           X5 Group
#        <chr>       <chr>       <chr>       <chr>        <chr> <chr>
#1 -0.56047565 -1.06782371 -0.69470698  0.37963948  0.005764186     a
#2 -0.56047565 -1.06782371 -0.69470698  0.37963948  0.005764186     a
#3 -0.56047565 -1.06782371 -0.69470698  0.37963948  0.005764186     a
#4 -0.23017749 -0.21797491 -0.20791728 -0.50232345  0.385280401     b
#5 -0.23017749 -0.21797491 -0.20791728 -0.50232345  0.385280401     b
#6 -0.23017749 -0.21797491 -0.20791728 -0.50232345  0.385280401     b

EDIT 2
Following the bug found out by the OP, here's a solution using base R only. I will make a new df with NA values starting each group but the first, i.e., group a.

set.seed(123)

df2 <- data.frame(X1 = rnorm(20),
                  X2 = rnorm(20),
                  Group = rep(letters[1:4], each = 5))
df2[c(6, 11, 16), 1:2] <- NA

df2_complete <- lapply(split(df2, df2$Group), function(x){
        k <- which(names(x) == "Group")
        x[-k] <- sapply(x[-k], na.locf, na.rm = FALSE)
        x[-k] <- sapply(x[-k], na.locf, fromLast = TRUE)
        x
})
df2_complete <- do.call(rbind, df2_complete)
row.names(df2_complete) <- NULL
df2_complete

edited Oct 10, 2017 at 7:24

answered Oct 9, 2017 at 11:30

Rui Barradas

78k8 gold badges41 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Rajan Over a year ago

That's why I have made custom function where NA in 1st row will be replaced by next available value.

G. Grothendieck Over a year ago

@Rajan, You can use na.fill(na.locf0(x), "extend") to extend the series at the ends or can run na.locf in both directions:na.locf(na.locf0(x), fromLast = TRUE) `

Rajan Over a year ago

@Grothendieck, still it doesnot solve my problem. When NA appears in 1st observation of any group, it is replacing it value of previous group rather than that group. For e.g. if a value of X2 is missing for 1st observation of group b, i.e. row number 4, then it is replacing with value of group a. This is same problem I was facing in my code.

Rui Barradas Over a year ago

@Rajan Doesn't my edit solve the problem? I see the first NA being replaced by the next non missing value of the same group.

Rajan Over a year ago

@Rui, it do replace for 1st group i.e. a, but for other groups it is filling the value of previous group.

Collectives™ on Stack Overflow

Replace missing values with previous value for multiple columns by grouping

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related