3

After reading about benchmarks and speed comparisons of R methods, I am in the process of converting to the speedy data.table package for data manipulation on my large data sets.

I am having trouble with a particular task:

For a certain observed variable, I want to check, for each station, if the absolute lagged difference (with lag 1) is greater than a certain threshold. If it is, I want to replace it with NA, else do nothing.

I can do this for the entire data.table using the set command, but I need to do this operation by station.

Example:

# Example data. Assume the columns are ordered by date.
set.seed(1)
DT <- data.table(station=sample.int(n=3, size=1e6, replace=TRUE), 
                 wind=rgamma(n=1e6, shape=1.5, rate=1/10),
                 other=rnorm(n=1.6),
                 key="station")

# My attempt
max_rate <- 35
set(DT, i=which(c(NA, abs(diff(DT[['wind']]))) > max_rate), 
    j=which(names(DT)=='wind'), value=NA)
# The results
summary(DT)

The trouble with my implementation is that I need to do this by station, and I do not want to get the lagged difference between the last reading in station 1 and the first reading of station 2.

I tried to use the by=station operator within the [ ], but I am not sure how to do this.

1 Answer 1

5

One way is to get the row numbers you've to replace using the special variable .I and then assign NA to those rows by reference using the := operator (or set).

# get the row numbers
idx = DT[, .I[which(c(NA, diff(wind)) > 35)], by=station][, V1]
# then assign by reference
DT[idx, wind := NA_real_]

This FR #2793 filed by @eddi when/if implemented will have a much more natural way to accomplish this task by providing the expression resulting in the corresponding indices on LHS and the value to replace with on RHS. That is, in the future, we should be able to do:

# in the future - a more natural way of doing the same operation shown above.
DT[, wind[which(c(NA, diff(wind)) > 35)] := NA_real_, by=station]
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you very much. I am liking the data.table package very much so far, but there is indeed a learning curve coming from plyr. Do you know where I can find documentation for the data.table special variables (e.g. .I, .SD, NA_real_, etc)?
Also, a +1 from me for that feature request! That is similar to what I tried/expected before failing miserably.
Gladly. NA_real_ is R's way of representing NA for numeric types. The other special variables are documented in ?data.table. Make sure to go through example(data.table) (also in ?data.table) and run them one by one and understand what's happening.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.