How to find linear regression between groups using data.table?

Question

I am trying to find the linear regression between all available groups of the following dataset.

library(data.table)
dt <- data.table(time = c(rep(rep(1:100, times = 1), 4), rep(1:30, times = 1)),
                   group = c(rep(c("a","b","c","d"), each = 100), rep("e", 30)), 
                   value = rnorm(430))
dt[]
      time group      value
  1:    1     a  0.1625954
  2:    2     a -1.2288462
  3:    3     a -0.1628570
  4:    4     a  1.0597886
  5:    5     a -1.1828334
 ---                      
426:   26     e -1.3762654
427:   27     e  0.3761436
428:   28     e -1.6982330
429:   29     e  0.1940263
430:   30     e -0.4631258

The output should be something like

group1     group2      regression
a           b           1.2
a           c           0.3
b           c           0.5
d           a           4.3
...

I am looking for a solution using data.table library only.

Linear regression of all the combinations of groups should be found. That includes cases a~b and b~a as the regression for each of these cases will be different.
Since the size of some groups is different, the time variables should be used to find the common rows between any set of groups.
The solution will require finding all combinations of groups.

I am looking for both (a,b) and (b,a), as regression of each will give different results. — Saurabh
– Saurabh, Commented Jun 24, 2021 at 18:45
One more thing, the size of these groups might be different. The data.table will be like - dt <- data.table(group = c(rep(1:4, 100), rep(5, 30)), a = rnorm(430)) — Saurabh
– Saurabh, Commented Jun 24, 2021 at 18:54
Thanks @Arun for handling the case where li < 0, Please post this answer. — Saurabh
– Saurabh, Commented Jun 24, 2021 at 19:43

akrun · Accepted Answer · 2021-06-24 22:35:57Z

1

With the new data, we could split the data by 'group' into a list. Then, use combn on the names of the list for pairwise combination, extract the list elements (s1, s2), check if there are any common 'time' (intersect). Use a condition based on length i.e. if there are common elements, then apply the lm on the corresponding 'value' columns, create a data.table with summarised coef along with the group names and rbind the list elements

library(data.table)
lst1 <- split(dt, dt$group)
rbindlist(combn(names(lst1), 2, FUN = function(x) {
      s1 <- lst1[[x[1]]]
      s2 <- lst1[[x[2]]]
      i1 <- intersect(s1$time, s2$time)
      if(length(i1) > 0) na.omit(s1[s2, on = .(time)][, 
        . (group1 = first(s1$group), group2 = first(s2$group), 
          regression = lm(i.value ~ value)$coef[2])]) 
       else
         data.table(group1 = first(s1$group), group2 = first(s2$group), 
         regression = NA_real_)}, simplify = FALSE))

-output

     group1 group2  regression
 1:      a      b  0.03033996
 2:      a      c  0.06391242
 3:      a      d -0.09138112
 4:      a      e -0.27738183
 5:      b      c  0.05663270
 6:      b      d  0.05481604
 7:      b      e  0.27789495
 8:      c      d -0.13987978
 9:      c      e  0.16388299
10:      d      e  0.12380720

If we want full combinations, use either expand.grid or CJ (from data.table

dt2 <- CJ(group1 = names(lst1), group2 = names(lst1))[group1 != group2]
dt2[, rbindlist(Map(function(x, y) {
       s1 <- lst1[[x]]
       s2 <- lst1[[y]]
       i1 <- intersect(s1$time, s2$time)
       if(length(i1) > 0) na.omit(s1[s2, on = .(time)][,
           data.table(group1 = x, group2 = y, 
          regresion = lm(i.value ~ value)$coef[2])]) else 
           data.table(group1 = x, group2 = y, regression = NA_real_)

        }, group1, group2))]

-output

  group1 group2   regresion
 1:      a      b  0.03033996
 2:      a      c  0.06391242
 3:      a      d -0.09138112
 4:      a      e -0.27738183
 5:      b      a  0.03247826
 6:      b      c  0.05663270
 7:      b      d  0.05481604
 8:      b      e  0.27789495
 9:      c      a  0.07488082
10:      c      b  0.06198333
11:      c      d -0.13987978
12:      c      e  0.16388299
13:      d      a -0.09295215
14:      d      b  0.05208743
15:      d      c -0.12144302
16:      d      e  0.12380720
17:      e      a -0.25136439
18:      e      b  0.34052322
19:      e      c  0.28677255
20:      e      d  0.21435666

edited Jun 24, 2021 at 22:35

answered Jun 24, 2021 at 19:48

akrun

891k38 gold badges590 silver badges700 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Saurabh Over a year ago

Thanks @Arun! It does handle cases such as a~b, but the other scenario of b~a is not produced. Is it possible for you to update the code to handle b~a scenario?

akrun Over a year ago

@Saurabh For that, we need expand.grid or CJ, and then remove the ones that are same

Saurabh Over a year ago

Okay, I will try to do that. Thanks!

akrun Over a year ago

@Saurabh I didn't use it earlier because you mentioned the data is really big with lots of groups

Saurabh Over a year ago

Thanks @Arun for the brilliant answer.

|

Collectives™ on Stack Overflow

How to find linear regression between groups using data.table?

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related