1

I read data from different sources into a data.table. Two sources supply different variables for the same time step.

How can I replace the missing variables by the other source (row)?

   Here is a minimal example: 
   (gg = data.table(SourceCode = c(1,1,2,2), time = c(1,2,1,2), LE = c(10,10,NA,NA), R = c(NA,NA,20,30)))
   SourceCode time LE  R
1:          1    1 10 NA
2:          1    2 10 NA
3:          2    1 NA 20
4:          2    2 NA 30

> # rename SourceCode
> gg[SourceCode == 1, SourceCode := 2 ]
> gg
   SourceCode time LE  R
1:          2    1 10 NA
2:          2    2 10 NA
3:          2    1 NA 20
4:          2    2 NA 30

Desired output:

   SourceCode time LE  R
1:          2    1 10 20
2:          2    2 10 30
1
  • 1
    Please, provide code that you have already tried to do. For now there is no any issue in your post. Commented Jul 25, 2018 at 12:46

3 Answers 3

1

I had a similar question recently and discovered dplyr::coalesce():

The simple solution would be:

library(dplyr)
coalesce(
  filter(gg, SourceCode == 2),
  filter(gg, SourceCode == 1)
)
  SourceCode time LE  R
1          2    1 10 20
2          2    2 10 30

But more generalisable:

do.call(coalesce, split(gg, gg$SourceCode))
   SourceCode time LE  R
1:          1    1 10 20
2:          1    2 10 30

If you want to base of the second source (or the last source) you could do:

do.call(coalesce, rev(split(gg, gg$SourceCode)))
   SourceCode time LE  R
1:          2    1 10 20
2:          2    2 10 30
Sign up to request clarification or add additional context in comments.

Comments

0

Since you seem to be working with data.tables, here is a data.table solution

unique(gg[, `:=`(LE = LE[!is.na(LE)], R = R[!is.na(R)]), by = time], by = "time")
#   SourceCode time LE  R
#1:          1    1 10 20
#2:          1    2 10 30

Or for the second source

unique(gg[, `:=`(LE = LE[!is.na(LE)], R = R[!is.na(R)]), by = time], by = "time", fromLast = T)
#   SourceCode time LE  R
#1:          2    1 10 20
#2:          2    2 10 30

Since SourceCode seems not relevant anymore (you summarise across different SourceCodes) you could also do

gg[, lapply(.SD, function(x) x[!is.na(x)]), by = time, .SDcols = 3:4]
#   time LE  R
#1:    1 10 20
#2:    2 10 30

Comments

0

An option:

library(tidyverse)
dd %>% 
    gather(var, val, -SourceCode, -time) %>% 
    na.omit(val) %>% 
    spread(var, val)

#   SourceCode time LE  R
# 1          2    1 10 20
# 2          2    2 10 30

Or another option based on grouping

dd %>% 
    group_by(SourceCode, time) %>% 
    summarise_at(vars(LE:R), .funs = funs(.[which(!is.na(.))]))

#   SourceCode time LE  R
# 1          2    1 10 20
# 2          2    2 10 30

Note that I only add the SourceCode in the group_by call to keep it in the summary. If you do not need that column, you could omit that.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.