3

I've got a data.frame dt with some duplicate keys and missing data, i.e.

Name     Height     Weight   Age
Alice    180        NA       35
Bob      NA         80       27
Alice    NA         70       NA
Charles  170        75       NA

In this case the key is the name, and I would like to apply to each column a function like

f <- function(x){
  x <- x[!is.na(x)]
  x <- x[1]
  return(x)
  }

while aggregating by the key (i.e., the "Name" column), so as to obtain as a result

Name     Height     Weight   Age
Alice    180        70       35
Bob      NA         80       27
Charles  170        75       NA

I tried

dt_agg <- aggregate(. ~ Name,
                    data = dt,
                    FUN = f)

and I got some errors, then I tried the following

dt_agg_1 <- aggregate(Height ~ Name,
                      data = dt,
                      FUN = f)

dt_agg_2 <- aggregate(Weight ~ Name,
                      data = dt,
                      FUN = f)

and this time it worked.

Since I have 50 columns, this second approach is quite cumbersome for me. Is there a way to fix the first approach?

Thanks for help!

5 Answers 5

3

You were very close with the aggregate function, you needed to adjust how aggregate handles NA (from na.omit to na.pass). My guess is that aggregate removes all rows with NA first and then does its aggregating, instead of removing NAs as aggregate iterates over the columns to be aggregated. Since your example dataframe you have an NA in each row you end up with a 0-row dataframe (which is the error I was getting when running your code). I tested this by removing all but one NA and your code works as-is. So we set na.action = na.pass to pass the NA's through.

dt_agg <- aggregate(. ~ Name,
                    data = dt,
                    FUN = f, na.action = "na.pass")

original answer

dt_agg <- aggregate(dt[, -1], 
                    by = list(dt$Name),
                    FUN = f)
dt_agg
# Group.1 Height Weight Age
# 1   Alice    180     70  35
# 2     Bob     NA     80  27
# 3 Charles    170     75  NA
Sign up to request clarification or add additional context in comments.

Comments

2

You can do this with dplyr:

library(dplyr)
df %>%
  group_by(Name) %>%
  summarize_all(funs(sort(.)[1]))

Result:

# A tibble: 3 x 4
     Name Height Weight   Age
   <fctr>  <int>  <int> <int>
1   Alice    180     70    35
2     Bob     NA     80    27
3 Charles    170     75    NA

Data:

df = read.table(text = "Name     Height     Weight   Age
Alice    180        NA       35
Bob      NA         80       27
Alice    NA         70       NA
Charles  170        75       NA", header = TRUE)

Comments

2

Here is an option with data.table

library(data.table)
setDT(df)[, lapply(.SD, function(x) head(sort(x), 1)), Name]
#      Name Height Weight Age
#1:   Alice    180     70  35
#2:     Bob     NA     80  27
#3: Charles    170     75  NA

Comments

2

Simply, add na.action=na.pass in aggregate() call:

aggdf <- aggregate(.~Name, data=df, FUN=f, na.action=na.pass)
#      Name Height Weight Age
# 1   Alice    180     70  35
# 2     Bob     NA     80  27
# 3 Charles    170     75  NA

Comments

1

If you add an ifelse() to your function to make sure the function returns a value if all values are NA:

f <- function(x) {
  x <- x[!is.na(x)]
  ifelse(length(x) == 0, NA, x)
}

You can use dplyr to aggregate:

library(dplyr)
dt %>% group_by(Name) %>% summarise_all(funs(f))

This returns:

# A tibble: 3 x 4
     Name Height Weight   Age
   <fctr>  <dbl>  <dbl> <dbl>
1   Alice    180     70    35
2     Bob     NA     80    27
3 Charles    170     75    NA

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.