0

I have a dataframe of 6 variables. For each column, data is same for the same group with some missing values. I want to fill these missing values by duplicating value of the same group for each variable. In case all values are missing for a particular group, it should fill value of above group. So, I want the result as df_complete.

Here is what I have tried but it fails when first observation of any group is missing. Unable to figure out what is wrong in it.

set.seed(123)
df <- data.frame(matrix(rnorm(100), ncol = 5))
df$Group <- letters[1:20]
df <- df[rep(seq_len(nrow(df)), sample(1:10, 20, replace = T)),]
df_complete <- df
df$X1[sample(1:nrow(df), 15)] <- NA
df$X2[sample(1:nrow(df), 10)] <- NA
df$X3[sample(1:nrow(df), 25)] <- NA
df$X4[sample(1:nrow(df), 10)] <- NA
df$X5[sample(1:nrow(df), 15)] <- NA

lvcf <- function(x)
{
  miss_ind <- which(is.na(x))

  if(length(miss_ind) != 0)
  {
    if(miss_ind[1]==1)
    {
      ind1 <- which(!is.na(x))[1]
      x[1] <- x[ind1]
      miss_ind <- which(is.na(x))
    }

    for(i in 1:length(miss_ind))
    {
      x[miss_ind[i]] <- x[miss_ind[i]-1]
    }
  }      
  return(x)
}

df_complete <- df %>%
  group_by(Group) %>%
  sapply(lvcf)

1 Answer 1

2

Package zoo has a function to handle that kind of problem, na.locf, for last observation carried forward.

library(zoo)
df_complete <- df %>%
  group_by(Group) %>%
  na.locf(., na.rm = FALSE)

head(df_complete)
## A tibble: 6 x 6
## Groups:   Group [2]
#           X1          X2          X3          X4           X5 Group
#        <chr>       <chr>       <chr>       <chr>        <chr> <chr>
#1 -0.56047565 -1.06782371 -0.69470698        <NA>  0.005764186     a
#2 -0.56047565 -1.06782371 -0.69470698  0.37963948  0.005764186     a
#3 -0.56047565 -1.06782371 -0.69470698  0.37963948  0.005764186     a
#4 -0.23017749 -0.21797491 -0.20791728 -0.50232345  0.385280401     b
#5 -0.23017749 -0.21797491 -0.20791728 -0.50232345  0.385280401     b
#6 -0.23017749 -0.21797491 -0.20791728 -0.50232345  0.385280401     b

Note the <NA> in column X4.

EDIT.
Following the OP's comment below and G. Grothendieck's answer, the following removes all NA values. Just use a second na.locf with argument fromLast = TRUE.

df_complete <- df %>%
  group_by(Group) %>%
  na.locf(., na.rm = FALSE) %>%
  na.locf(., fromLast = TRUE)

head(df_complete)
## A tibble: 6 x 6
## Groups:   Group [2]
#           X1          X2          X3          X4           X5 Group
#        <chr>       <chr>       <chr>       <chr>        <chr> <chr>
#1 -0.56047565 -1.06782371 -0.69470698  0.37963948  0.005764186     a
#2 -0.56047565 -1.06782371 -0.69470698  0.37963948  0.005764186     a
#3 -0.56047565 -1.06782371 -0.69470698  0.37963948  0.005764186     a
#4 -0.23017749 -0.21797491 -0.20791728 -0.50232345  0.385280401     b
#5 -0.23017749 -0.21797491 -0.20791728 -0.50232345  0.385280401     b
#6 -0.23017749 -0.21797491 -0.20791728 -0.50232345  0.385280401     b

EDIT 2
Following the bug found out by the OP, here's a solution using base R only. I will make a new df with NA values starting each group but the first, i.e., group a.

set.seed(123)

df2 <- data.frame(X1 = rnorm(20),
                  X2 = rnorm(20),
                  Group = rep(letters[1:4], each = 5))
df2[c(6, 11, 16), 1:2] <- NA

df2_complete <- lapply(split(df2, df2$Group), function(x){
        k <- which(names(x) == "Group")
        x[-k] <- sapply(x[-k], na.locf, na.rm = FALSE)
        x[-k] <- sapply(x[-k], na.locf, fromLast = TRUE)
        x
})
df2_complete <- do.call(rbind, df2_complete)
row.names(df2_complete) <- NULL
df2_complete
Sign up to request clarification or add additional context in comments.

5 Comments

That's why I have made custom function where NA in 1st row will be replaced by next available value.
@Rajan, You can use na.fill(na.locf0(x), "extend") to extend the series at the ends or can run na.locf in both directions:na.locf(na.locf0(x), fromLast = TRUE) `
@Grothendieck, still it doesnot solve my problem. When NA appears in 1st observation of any group, it is replacing it value of previous group rather than that group. For e.g. if a value of X2 is missing for 1st observation of group b, i.e. row number 4, then it is replacing with value of group a. This is same problem I was facing in my code.
@Rajan Doesn't my edit solve the problem? I see the first NA being replaced by the next non missing value of the same group.
@Rui, it do replace for 1st group i.e. a, but for other groups it is filling the value of previous group.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.