Replace missing values by merging with another row

Question

I read data from different sources into a data.table. Two sources supply different variables for the same time step.

How can I replace the missing variables by the other source (row)?

   Here is a minimal example: 
   (gg = data.table(SourceCode = c(1,1,2,2), time = c(1,2,1,2), LE = c(10,10,NA,NA), R = c(NA,NA,20,30)))
   SourceCode time LE  R
1:          1    1 10 NA
2:          1    2 10 NA
3:          2    1 NA 20
4:          2    2 NA 30

> # rename SourceCode
> gg[SourceCode == 1, SourceCode := 2 ]
> gg
   SourceCode time LE  R
1:          2    1 10 NA
2:          2    2 10 NA
3:          2    1 NA 20
4:          2    2 NA 30

Desired output:

   SourceCode time LE  R
1:          2    1 10 20
2:          2    2 10 30

Please, provide code that you have already tried to do. For now there is no any issue in your post. — aleksandrbel
– aleksandrbel, Commented Jul 25, 2018 at 12:46

s_baldur · Accepted Answer · 2018-07-25 12:53:36Z

1

I had a similar question recently and discovered dplyr::coalesce():

The simple solution would be:

library(dplyr)
coalesce(
  filter(gg, SourceCode == 2),
  filter(gg, SourceCode == 1)
)
  SourceCode time LE  R
1          2    1 10 20
2          2    2 10 30

But more generalisable:

do.call(coalesce, split(gg, gg$SourceCode))
   SourceCode time LE  R
1:          1    1 10 20
2:          1    2 10 30

If you want to base of the second source (or the last source) you could do:

do.call(coalesce, rev(split(gg, gg$SourceCode)))
   SourceCode time LE  R
1:          2    1 10 20
2:          2    2 10 30

edited Jul 25, 2018 at 12:53

answered Jul 25, 2018 at 12:46

s_baldur

34.6k4 gold badges43 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Maurits Evers · Accepted Answer · 2018-07-25 13:05:03Z

0

Since you seem to be working with data.tables, here is a data.table solution

unique(gg[, `:=`(LE = LE[!is.na(LE)], R = R[!is.na(R)]), by = time], by = "time")
#   SourceCode time LE  R
#1:          1    1 10 20
#2:          1    2 10 30

Or for the second source

unique(gg[, `:=`(LE = LE[!is.na(LE)], R = R[!is.na(R)]), by = time], by = "time", fromLast = T)
#   SourceCode time LE  R
#1:          2    1 10 20
#2:          2    2 10 30

Since SourceCode seems not relevant anymore (you summarise across different SourceCodes) you could also do

gg[, lapply(.SD, function(x) x[!is.na(x)]), by = time, .SDcols = 3:4]
#   time LE  R
#1:    1 10 20
#2:    2 10 30

edited Jul 25, 2018 at 13:05

answered Jul 25, 2018 at 12:56

Maurits Evers

51k4 gold badges53 silver badges75 bronze badges

Comments

AndS. · Accepted Answer · 2018-07-25 13:16:53Z

0

An option:

library(tidyverse)
dd %>% 
    gather(var, val, -SourceCode, -time) %>% 
    na.omit(val) %>% 
    spread(var, val)

#   SourceCode time LE  R
# 1          2    1 10 20
# 2          2    2 10 30

Or another option based on grouping

dd %>% 
    group_by(SourceCode, time) %>% 
    summarise_at(vars(LE:R), .funs = funs(.[which(!is.na(.))]))

#   SourceCode time LE  R
# 1          2    1 10 20
# 2          2    2 10 30

Note that I only add the SourceCode in the group_by call to keep it in the summary. If you do not need that column, you could omit that.

edited Jul 25, 2018 at 13:16

answered Jul 25, 2018 at 13:02

AndS.

8,1702 gold badges14 silver badges18 bronze badges

Collectives™ on Stack Overflow

Replace missing values by merging with another row

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related