0

Thanks in advance for any help that is provided.

Long story short: I am working with hourly time series data from a measurement device (exported from SQL then imported in to R in order to properly format the date time ) - the time series contains missing data, sometimes in groups, and I need to locate these missing rows/indices and insert a new row for each instance that holds an NA value.

Related Questions that did not solve my problem:

how to insert missing observations on a data frame

Adding row to a data frame with missing values

Problem Data

The dataset that I am working with in this case is fairly large and varies depending on the measurement device I select. As a test case, I have one time series that contains 17469 hourly observations. I located a small section of the dataset that may be used for testing purposes. Here it is:

> snip
                   date Reading
408 2015-12-15 00:00:00    4.40
409 2015-12-14 23:00:00    4.62
410 2015-12-14 22:00:00    4.61
411 2015-12-14 21:00:00    6.15
412 2015-12-14 20:00:00    6.06
413 2015-12-14 19:00:00    7.04
414 2015-12-14 18:00:00    8.57
415 2015-12-14 11:00:00    4.12
416 2015-12-14 10:00:00    3.73

We can see that observations are missing for 2015-12-14 12:00:00 to 2015-12-14 17:00:00. I would like to first locate then populate the time series with these date times and input NA for the Reading column in these positions. I would also like to return the indices that are missing in an additional vector.

How can this be done?

So far I have tried the following code (as suggested here, how to add a missing dates and remove repeated dates in hourly time series), but all I end up with is NA values when I perform the merge function and still need to identify where the missing indices are located.

Here is the result:

> grid = data.frame(date=seq.POSIXt(min(snip[,1]), to=max(snip[,1]), by="1 hours"));
> dat = merge(grid, snip, by="date", all.x=TRUE)
> dat
                  date Reading
1  2015-12-14 10:00:00      NA
2  2015-12-14 11:00:00      NA
3  2015-12-14 12:00:00      NA
4  2015-12-14 13:00:00      NA
5  2015-12-14 14:00:00      NA
6  2015-12-14 15:00:00      NA
7  2015-12-14 16:00:00      NA
8  2015-12-14 17:00:00      NA
9  2015-12-14 18:00:00      NA
10 2015-12-14 19:00:00      NA
11 2015-12-14 20:00:00      NA
12 2015-12-14 21:00:00      NA
13 2015-12-14 22:00:00      NA
14 2015-12-14 23:00:00      NA
15 2015-12-15 00:00:00      NA

What am I missing here? Is it because grid and snip$date are in reverse order? For additional information here is what the date time format looks like (in case this is from where my issue stems):

> snip[2,1]
[1] "2015-12-14 23:00:00 GMT"

The result of the dput(snip) command is as follows (thanks for the suggestion @42):

> dput(snip)
structure(list(date = structure(list(sec = c(0, 0, 0, 0, 0, 0, 
0, 0, 0), min = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), hour = c(0L, 
23L, 22L, 21L, 20L, 19L, 18L, 11L, 10L), mday = c(15L, 14L, 14L, 
14L, 14L, 14L, 14L, 14L, 14L), mon = c(11L, 11L, 11L, 11L, 11L, 
11L, 11L, 11L, 11L), year = c(115L, 115L, 115L, 115L, 115L, 115L, 
115L, 115L, 115L), wday = c(2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), yday = c(348L, 347L, 347L, 347L, 347L, 347L, 347L, 347L, 347L
), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Reading = c(4.4, 
4.62, 4.61, 6.15, 6.06, 7.04, 8.57, 4.12, 3.73)), .Names = c("date", 
"Reading"), row.names = 408:416, class = "data.frame")
1
  • POSIXct values are numeric and not fully displayed with the print method. We would need to see dput(snip) to have any chance of know what the underlying values actually were. You could merge on their formatted (text) values Commented Feb 2, 2016 at 21:38

1 Answer 1

0

Here's how I was able to do it with some help from na.locf documentation. Does it help?

dat<- dget("yoursample")
require(xts)
datxts<- as.xts(dat[,-1],order.by = dat$date,frequency = 24)
tzn<-tzone(datxts)
g<- seq(start(datxts), end(datxts), "hour")
gxts<- xts(rep(NA,length(g)),order.by = as.POSIXct(g), tzone = tzn)

merge(datxts,gxts,all = T)$datxts

Edit: And also, your method works if you add a column of NA's to generated dataframe

dates=seq.POSIXt(min(snip[,1]), to=max(snip[,1]), by="1 hours")
grid = data.frame(date=dates,dummydata=rep(NA,length(dates)));
dat = merge(grid, snip, by="date", all=T)
Sign up to request clarification or add additional context in comments.

8 Comments

beautiful, your first solution works very nicely. However, the second solution simply appends an NA value for each date time in the original list, while it places NA in the missing rows, it also populates an NA for available observations.
is there a simple way to create a vector of the missing observations, perhaps by doing a set difference (sorry, I work primarily in MATLAB) on the dates contained in datxts and gxts?
The column dat$reading in the second one should be the same as the first answer.. are you getting any warnings regarding mismatching timezones? And the second question-- you just want dates for which data is missing, right?
Sorry, here's the rest of it , to get the missing times : as.POSIXct(setdiff(g, dat$date), origin="1970-01-01", tz="GMT") ..There may be a shorter way though..
first comment: I do not receive any errors, dat turns into a 25x3 data frame, appending NA in the $Reading column for dates that are missing as well as for those dates where observations are already existing. It should be 15x3 with NA only present on those dates with missing observations
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.