After reading about benchmarks and speed comparisons of R methods, I am in the process of converting to the speedy data.table package for data manipulation on my large data sets.
I am having trouble with a particular task:
For a certain observed variable, I want to check, for each station, if the absolute lagged difference (with lag 1) is greater than a certain threshold. If it is, I want to replace it with NA, else do nothing.
I can do this for the entire data.table using the set command, but I need to do this operation by station.
Example:
# Example data. Assume the columns are ordered by date.
set.seed(1)
DT <- data.table(station=sample.int(n=3, size=1e6, replace=TRUE),
wind=rgamma(n=1e6, shape=1.5, rate=1/10),
other=rnorm(n=1.6),
key="station")
# My attempt
max_rate <- 35
set(DT, i=which(c(NA, abs(diff(DT[['wind']]))) > max_rate),
j=which(names(DT)=='wind'), value=NA)
# The results
summary(DT)
The trouble with my implementation is that I need to do this by station, and I do not want to get the lagged difference between the last reading in station 1 and the first reading of station 2.
I tried to use the by=station operator within the [ ], but I am not sure how to do this.