2

Let see the following example in dplyr style.

# 1. Data set
df <- data.table(
  g1 = c(1, 1, 2, 1, 2, 2, 1),
  g2 = c(2, 1, 3, 3, 1, 1, 2),
  status = c(1, 0, 1, 0, 0, 1, 1),
  date_obs = as.Date(c("2019-01-01", "2019-01-02", "2019-01-12", "2019-01-15",
           "2019-01-20", "2019-01-24", "2019-01-30")))

# 2. Arrange data
df <- df %>% 
  arrange(g1, g2, date_obs)

# 3. Populate missing 'date_obs' and 'status' values
df_filled <- df %>%
  group_by(g1, g2) %>%
  complete(date_obs = seq.Date(min(date_obs), max(date_obs), by = "day")) %>%
  fill(status) %>%
  arrange(g1, g2, date_obs) %>% 
  ungroup()

How can I make the same operations by using data.table R syntax?

Thanks!

1
  • 1
    Can you please share the complete() definition? Commented Jun 1, 2019 at 10:38

2 Answers 2

5

Another option using a rolling join.

setkey(DT, g1, g2, date_obs)

out <- DT[DT[, .(date_obs = seq(first(date_obs),
                                last(date_obs),
                                by = "day")), by=.(g1, g2)],
          on=.(g1, g2, date_obs),
          roll = TRUE]
out
#    g1 g2 status   date_obs
# 1:  1  1      0 2019-01-02
# 2:  1  2      1 2019-01-01
# 3:  1  2      1 2019-01-02
# 4:  1  2      1 2019-01-03
# 5:  1  2      1 2019-01-04
# 6:  1  2      1 2019-01-05
# 7:  1  2      1 2019-01-06
# 8:  1  2      1 2019-01-07
# 9:  1  2      1 2019-01-08
#10:  1  2      1 2019-01-09
#11:  1  2      1 2019-01-10
#12:  1  2      1 2019-01-11
#13:  1  2      1 2019-01-12
#14:  1  2      1 2019-01-13
#15:  1  2      1 2019-01-14
#16:  1  2      1 2019-01-15
#17:  1  2      1 2019-01-16
#18:  1  2      1 2019-01-17
#19:  1  2      1 2019-01-18
#20:  1  2      1 2019-01-19
#21:  1  2      1 2019-01-20
#22:  1  2      1 2019-01-21
#23:  1  2      1 2019-01-22
#24:  1  2      1 2019-01-23
#25:  1  2      1 2019-01-24
#26:  1  2      1 2019-01-25
#27:  1  2      1 2019-01-26
#28:  1  2      1 2019-01-27
#29:  1  2      1 2019-01-28
#30:  1  2      1 2019-01-29
#31:  1  2      1 2019-01-30
#32:  1  3      0 2019-01-15
#33:  2  1      0 2019-01-20
#34:  2  1      0 2019-01-21
#35:  2  1      0 2019-01-22
#36:  2  1      0 2019-01-23
#37:  2  1      1 2019-01-24
#38:  2  3      1 2019-01-12
#    g1 g2 status   date_obs

data

DT <- data.table(
  g1 = c(1, 1, 2, 1, 2, 2, 1),
  g2 = c(2, 1, 3, 3, 1, 1, 2),
  status = c(1, 0, 1, 0, 0, 1, 1),
  date_obs = as.Date(c("2019-01-01", "2019-01-02", "2019-01-12", "2019-01-15",
                       "2019-01-20", "2019-01-24", "2019-01-30")))
Sign up to request clarification or add additional context in comments.

3 Comments

Does the setkey provide any improvements? Could you skip the on = .(...) argument?
@Cole I thought that it allows to skip the on but then I ended up with an error. I am still in the phase of learning all the data.table tricks.
I really like data.table but there are still some mysteries for me. Thanks for your comments and solution - it's a cleaner way to fill the NAs with the rolling join.
2

The method is to generate a "complete" data.table with all of the date ranges by variables g1 and g2. Then we do a left join to the original data.table and then fill in the NA status.

This still relies on another package to do the fill method. I couldn't get tidyr::fill to work but zoo::na.locf worked like a charm. There are a lot of other options for filling NA values.

Replace NA with last non-NA in data.table by using only data.table

R data.table join/ subsetting/ match by group and by a condition

Replacing NAs with latest non-NA value

setorder(dt, g1, g2, date_obs)

dt_complete <- dt[, .(date_obs = seq.Date(min(date_obs), max(date_obs), by = "day")), by = .(g1, g2)]

dt[dt_complete
    , on = c('date_obs', 'g1', 'g2')
    , .(g1, g2, date_obs, status = zoo::na.locf(status))
    ]

    g1 g2   date_obs status
 1:  1  1 2019-01-02      0
 2:  1  2 2019-01-01      1
 3:  1  2 2019-01-02      1
 4:  1  2 2019-01-03      1
 5:  1  2 2019-01-04      1
 6:  1  2 2019-01-05      1
 7:  1  2 2019-01-06      1
 8:  1  2 2019-01-07      1
 9:  1  2 2019-01-08      1
... 38 total rows...

Data which is simply OP:

library(data.table)
library(tidyverse)
# 1. Data set
dt <- data.table(
  g1 = c(1, 1, 2, 1, 2, 2, 1),
  g2 = c(2, 1, 3, 3, 1, 1, 2),
  status = c(1, 0, 1, 0, 0, 1, 1),
  date_obs = as.Date(c("2019-01-01", "2019-01-02", "2019-01-12", "2019-01-15",
                       "2019-01-20", "2019-01-24", "2019-01-30")))

1 Comment

The current dev version of data.table provides the new function nafill. See NEWS

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.