1

Imagine we have a data set called df, and that this data set is composed of two variables called year and x1:

year <- c(2000, 2001, 2002, 2003, 2004)
x1 <- c(7, 8, 6, 3, 3)
df <- data.frame(year, x1)

My task is to compute two new variables out of x1. The first variable is cSum, which must reflect the sum of the values of x1 for the last two years. The second variable is cMax, which must reflect the highest values for x1 in the last three years.

The outcome should be as follows:

year  x1  cSum  cMax
2000   7     
2001   8    15     
2002   6    14     8
2003   3     9     8
2004   3     6     6

How can I compute the cSum and cMax variables above?

Thanks!

0

2 Answers 2

3

Using data.table:

library(data.table)
setDT(df)

First, an convoluted way; since transpose is optimized, this may be faster (untested):

df[ , cSum := transpose(lapply(transpose(shift(x1, 0:1)), sum))]
df[ , cMax := transpose(lapply(transpose(shift(x1, 0:2)), max))]

shift is essentially a lag operator; we want lags 0, 1, and (for cMax) 2 to get the current and prior 1 (or 2) periods.

Alternatively:

df[ , cSum := rowSums(do.call(cbind, shift(x1, 0:1)))]
df[ , cMax := do.call(pmax, shift(x1, 0:2))]

Both give the same output:

df
#    year x1 cSum cMax
# 1: 2000  7   NA   NA
# 2: 2001  8   15   NA
# 3: 2002  6   14    8
# 4: 2003  3    9    8
# 5: 2004  3    6    6

The thing making this messy is that when shift returns more than one lag, it returns a list; but unfortunately this list is the transpose of what we need (we're doing a row-wise operation, and it's produced in a column-friendly way). The first option transposes the list to get it in a more manageable form, then does the row-wise operation before transposeing back into the columnar form.

The second option converts the output to an array and does row-wise operations on the array.

Sign up to request clarification or add additional context in comments.

7 Comments

is transpose needed , this would achieve the same df[,cSum:=(shift(x1,1,"lag")+shift(x1,2,"lag"))]
@Bg1850 I was actually going to add that, thanks for pointing it out. that approach is not very extensible (summing 10 periods, e.g.), but is certainly more pleasant in this case.
Thanks! One more thing, if possible: How should I edit the code if I want to do this without the lag? That is, in a way that should result in the NA values going to the bottom of the column, rather than the top?
you mean leading instead of lagging? simply negate the indices.
I tried. But negative numbers (0:1) return an error.
|
0

Here is an approach utilizing a lag operator. Essentially I'm augmenting your data so as to minimize the need of for loops. In doing so, I'm increasing the amount of memory utilized. This approach may make sense if you are going to be doing more time series analysis with this data set. In the answer I utilize the zoo package, which is my favorite time series package. However, there are many others ts, xts (which is generally faster than zoo),...

library(zoo)

year <- c(2000, 2001, 2002, 2003, 2004, 2005)
x1 <- c(7, 8, 6, 3, 3, 6)
df <- data.frame(year, x1)

dfZ <- zoo(df[,-1], order.by = df[,1]) 

dfZ <- merge(dfZ, lag(dfZ, seq(-1, -2)))

names(dfZ) <- paste0("L", seq(0,2))

dfZ$cSum <- rowSums(dfZ[, c("L0", "L1")])
dfZ$cMax <- apply(dfZ[, c("L0", "L1", "L2")], 1, max)  

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.