1

I am relatively new to R and probably the solution to this problem is rather simple.

I have a dataframe that looks like this:

id1    id2    v1    v2    v3    ...    v100
  A      X     1    NA    NA    ...       1
  B      Y     1     3     4    ...       1
  C      X     1     3     4    ...       1
  D      X     1     3     4    ...       1
  E      Y     1     3     4    ...       1
  A      X    NA     3     4    ...      NA 

What I would like to do is to 'merge' two observations with the same id (id1 and id2) to one observation. The missing values of an observation should be replaced by the values of the other observation.

For example in the dataframe from above these are 'observation 1' and 'observation 6' and the result should look something like this:

id1    id2    v1    v2    v3    ...    v100
  A      X     1     3     4    ...       1
  B      Y     1     3     4    ...       1
  C      X     1     3     4    ...       1
  D      X     1     3     4    ...       1
  E      Y     1     3     4    ...       1

Currently I am using loops for this and I know it is very slow and probably not the best solution. I have more than 1000 observations with approximately 100 duplicate observations and a few thousand variables. If anyone could provide an idea how to speed up things, I would be really happy.

Many thanks in advance!

Edit: 03/10/2014

Many thanks for all the helpful comments! The answer by David Armstrong is what I wanted! Thank you so much!

I am sorry for being not precisely enough in my first post, so here are some specifications.

Observations with identical ids can occur multiple times in the dataset and not only twice.

Further, of all those identical observations only one observation will have a non-missing value per variable (if it all). It can also be the case that all observations of a variable are missing, but it can never be the caset that two observaions have a non-missing value. The following example might make things more clearer.

id1    id2    v1    v2    v3    v4    v5    v6    v7
  A      X     6     9     3     1     2     1     1
  B      X     2     2     1     4     2     3     3
  C      X     1     6     7     1     3     4     5
  D      X     4     2     9     2     3     6     2
  E      X    NA     3    NA    NA    NA    NA    NA
  E      X    NA    NA     4    NA    NA    NA    NA
  E      X    NA    NA    NA     3    NA    NA    NA
  E      X    NA    NA    NA    NA     6    NA    NA
  E      X    NA    NA    NA    NA    NA     4    NA
  E      X    NA    NA    NA    NA    NA    NA     1

And the result I would like to have would be:

id1    id2    v1    v2    v3    v4    v5    v6    v7
  A      X     6     9     3     1     2     1     1
  B      X     2     2     1     4     2     3     3
  C      X     1     6     7     1     3     4     5
  D      X     4     2     9     2     3     6     2
  E      X    NA     3     4     3     6     4     1

I hope this helps.

Thank you very much!

3
  • 2
    Can we assume that there are always pairs of observations with missing values such that missing values of one observation are always values in the other observation and the other way around? E.g., can we do something like x[is.na(x)] <- na.omit(y)? Commented Oct 2, 2014 at 13:56
  • @vandm It is not clear about how you want to summarise the rows with the same groups that have non-missing values. In the example you provided, the values are just identical, which may not be the case in your original dataset. What if there are triplicates etc.? Commented Oct 2, 2014 at 15:28
  • @vandm, you don't need to create a completely new account in here. Just add another account to your already existing CrossValidated account Commented Oct 4, 2014 at 23:39

3 Answers 3

2

Also, maybe

library(data.table)
setDT(df)[, lapply(.SD, na.omit), by = list(id1, id2)]
#    id1 id2 v1 v2 v3 v100
# 1:   A   X  1  3  4    1
# 2:   B   Y  1  3  4    1
# 3:   C   X  1  3  4    1
# 4:   D   X  1  3  4    1
# 5:   E   Y  1  3  4    1

If we can't always assume that there missing values (like mentioned in @Rolands comment), you can add unique (if you always want only one pair). Something like

unique(setDT(df)[, lapply(.SD, na.omit), by = list(id1, id2)])
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks @akrun, it is actually hard to tell what they exactly want, so added unique too
1

Try:

library(dplyr) 
df %>%
    group_by(id1, id2) %>%
    summarise_each(funs(mean=mean(., na.rm=TRUE)))

#    id1 id2 v1 v2 v3
# 1   A   X  1  3  4
# 2   B   Y  1  3  4
# 3   C   X  1  3  4
# 4   D   X  1  3  4
# 5   E   Y  1  3  4

Or perhaps

df %>% 
    group_by(id1, id2) %>%
    mutate_each(funs(replace(., is.na(.), stats::na.omit(.)))) %>%
    unique()

data

df <- structure(list(id1 = c("A", "B", "C", "D", "E", "A"), id2 = c("X", 
"Y", "X", "X", "Y", "X"), v1 = c(1L, 1L, 1L, 1L, 1L, NA), v2 = c(NA, 
3L, 3L, 3L, 3L, 3L), v3 = c(NA, 4L, 4L, 4L, 4L, 4L)), .Names = c("id1", 
"id2", "v1", "v2", "v3"), class = "data.frame", row.names = c(NA, 
-6L))

Comments

0

If ddf is your data frame:

> t(sapply(split(ddf, paste(ddf$id1, ddf$id2)), 
           function(x) sapply(x[3:ncol(ddf)], sum, na.rm=T)))
    v1 v2 v3 v4
A X  1  3  4  1
B Y  1  3  4  1
C X  1  3  4  1
D X  1  3  4  1
E Y  1  3  4  1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.