Define variable iteratively in data table in r

Question

I am trying to find a faster solution to defining a variable iteratively, i.e., the next row of the variable depends on the previous row. For example, suppose I have the following data.table:

tmp <- data.table(type = c("A", "A", "A", "B", "B", "B"), 
                  year = c(2011, 2012, 2013, 2011, 2012, 2013), 
                  alpha = c(1,1,1,2,2,2), 
                  beta = c(3,3,3,4,4,4), 
                  pred = c(1,NA,NA,2,NA, NA))

For each type (A and then B), I want to solve for pred going forward, where pred for type A for the year 2012 is:

pred_2012_A = alpha + beta * pred_2011_A

and the pred for 2013 for type A continues:

pred_2013_A = alpha + beta * pred_2012_A

I have a solution using a for loop to go through type and create a variable to store the previous value and use the "by" command in data table to loop through the year as such:

for(i in c("A", "B")){
  tmp.val <- tmp[type == i & year == 2011]$pred # initial value for type i
  tmp[year > 2011 & type == i, pred := {
    tmp.val <- alpha + beta * tmp.val
  }, by = year]
}

Ultimately, the original data table looks like:

   type year alpha beta pred
1:    A 2011     1    3    1
2:    A 2012     1    3   NA
3:    A 2013     1    3   NA
4:    B 2011     2    4    2
5:    B 2012     2    4   NA
6:    B 2013     2    4   NA

And the updated table looks like:

   type year alpha beta pred
1:    A 2011     1    3    1
2:    A 2012     1    3    4
3:    A 2013     1    3   13
4:    B 2011     2    4    2
5:    B 2012     2    4   10
6:    B 2013     2    4   42

My question here is if there is a faster way to implement this without the for loop. Is there a way to implement this routine in one data table statement that is faster than using the for loop? My real usage has many more types and many more years to compute, so a faster implementation would be greatly appreciated.

Thank you.

Thank you for your solution Frank. However, it seems that I made this example too easy. What if the predicted values do not have a nice closed form solution based on the initial values? Ultimately, I'm trying to see what the fastest way to access the previous value of pred in calculating the next value of pred, without storing it in a temporary variable and using the for loop. I think this might be the case if alpha and beta change each year. Does this make sense? — naveendaftari
– naveendaftari, Commented Jul 25, 2016 at 22:30
If your solution has to be iterative, there is no way (in R or any language, I guess) to do it apart from in a loop. Such a for loop would probably be pretty slow in R, but you could translate it to C++ and use the Rcpp library, which would probably help a lot. — Frank
– Frank, Commented Jul 25, 2016 at 22:59

Frank · Accepted Answer · 2016-07-25 21:19:18Z

You can just do the math:

tmp[, pred := pred[1]*beta^(1:.N-1) + alpha*cumsum(c(0, beta[1]^(0:(.N-2)))), by=type]

#    type year alpha beta pred
# 1:    A 2011     1    3    1
# 2:    A 2012     1    3    4
# 3:    A 2013     1    3   13
# 4:    B 2011     2    4    2
# 5:    B 2012     2    4   10
# 6:    B 2013     2    4   42

Comment. In my opinion, the data structure in the OP is flawed. Alpha and beta are clearly attributes of the type, not something that is varying from row to row. It should start with:

typeDT = data.table(
  type=c("A","B"), 
  year.start = 2011L, 
  year.end=2013, 
  a = 1:2, 
  b = 3:4,
  pred0 = 1:2
)

#    type year.start year.end a b pred0
# 1:    A       2011     2013 1 3     1
# 2:    B       2011     2013 2 4     2

With this structure, you could expand to your data set naturally:

typeDT[, {
  year = year.start:year.end
  n    = length(year)
  p    = pred0*b^(0:(n-1)) + a*cumsum(c(0, b^(0:(n-2))))
  .(year = year, pred = p)
}, by=type]

#    type year pred
# 1:    A 2011    1
# 2:    A 2012    4
# 3:    A 2013   13
# 4:    B 2011    2
# 5:    B 2012   10
# 6:    B 2013   42

shayaa · Accepted Answer · 2016-07-25 21:23:28Z

0

A bit hacky but bear with me, it only takes two iterations.

df <- read.table(text = "type year alpha beta pred
1:    A 2011     1    3    1
2:    A 2012     1    3   NA
3:    A 2013     1    3   NA
4:    B 2011     2    4    2
5:    B 2012     2    4   NA
6:    B 2013     2    4   NA", header = T)

df2 <- df

while(any(is.na(df2$pred))){
  df2$pred <- df2$alpha + df2$beta*lag(df2$pred)
  df2$pred[which(!is.na(df$pred))] <- df$pred[which(!is.na(df$pred))]
}

The solution is correct

df2
   type year alpha beta pred
1:    A 2011     1    3    1
2:    A 2012     1    3    4
3:    A 2013     1    3   13
4:    B 2011     2    4    2
5:    B 2012     2    4   10
6:    B 2013     2    4   42

answered Jul 25, 2016 at 21:23

shayaa

2,79715 silver badges19 bronze badges

3 Comments

Frank Over a year ago

I'm confused. How does it work without accounting for grouping by type?

shayaa Over a year ago

...because he had the first pred value for each type. Not sure this is a fair assumption, then again, the question didn't exactly fit the example very well, as you well noted. It is not really the worst assumption because without the first value, you resort to carrying forth a bunch of NAs.

Frank Over a year ago

Ok. I suspect it also depends on having the same number of rows per group, though haven't really thought it out.

Collectives™ on Stack Overflow

Define variable iteratively in data table in r

2 Answers 2

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related