1

I have a large dataset (~800M rows) as a data.table. The dataset consists out of equidistant timeseries data for thousands of IDs. My problem is that missing values were originally not encoded but are really missing in the dataset. So, I would like to add the rows with missing data. I know that for each ID the same timestamps should be present.

Given the size of the dataset my initial idea was to create one data.table which includes every timestep the data should include and then use merge with all=TRUE, for each ID of the main data.table. However so far, I have only managed to do that if my data.table with all-time steps (complete_dt) includes also the ID column. However, this creates a lot of redundant information, as each ID should have the same timesteps.

I made a MWE - for simplicity as my data is equidistant, I have replaced the POSIXct column with a simple integer column

library(data.table)

# My main dataset 
set.seed(123)
main_dt <- data.table(id = as.factor(rep(1:3, c(5,4,3))), 
                   pseudo_time = c(1,3,4,6,7, 1,3,4,5, 3,5,6),
                   value = runif(12))

# Assuming that I should have the pseudo timesteps 1:7 for each ID
# Given the size of my real data I would like to create the pseudo time not for each ID but only once
complete_dt <- main_dt[, list(pseudo_time = 1:7), by = id]

#The dt I need to get in the end
result_dt <- merge.data.table(main_dt,complete_dt, all = TRUE )

I have seen this so what similar question Merge (full join) recursively one data.table with each group of another data.table, but I have not managed to apply this to my problem.

Any help for a more efficient solution then mine would be much appreciated.

5
  • You say "grouped" as if the range of intended pseudo_time should be determined individually for each group, but then you blanket assign them all indiscriminately. If that's the case, why not just merge(main_dt, data.table(pseudo_time=1:7), all=TRUE)? Commented Oct 31, 2022 at 12:27
  • r2evans, you suggestion leads to a completely different result. I need to obtain result_dt - so result_dt in my MWE is correct. The question for me is only if that result can be obtained more efficiently, i.e., without creating such a large complete_dt. Commented Oct 31, 2022 at 12:33
  • I see, okay. Glad you got an answer. Commented Oct 31, 2022 at 13:19
  • I would try to not expand data but operate on then as on a sparse data object. Depend how dense/sparse data are you can easily run out of memory when trying to expand it. Commented Oct 31, 2022 at 19:25
  • @ jangorecki Yes that's of course true it will use more memory. But at least for my case I need a dataset without missing values. So, I need to impute, and this for me easier if I encode missing values with NA before. Commented Nov 1, 2022 at 6:59

1 Answer 1

1

Here is an alternative but probably not much more efficient:

setkey(main_dt, id, pseudo_time)
main_dt[CJ(id, pseudo_time = 1:7, unique = TRUE)]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.