1

I am trying to find the linear regression between all available groups of the following dataset.

library(data.table)
dt <- data.table(time = c(rep(rep(1:100, times = 1), 4), rep(1:30, times = 1)),
                   group = c(rep(c("a","b","c","d"), each = 100), rep("e", 30)), 
                   value = rnorm(430))
dt[]
      time group      value
  1:    1     a  0.1625954
  2:    2     a -1.2288462
  3:    3     a -0.1628570
  4:    4     a  1.0597886
  5:    5     a -1.1828334
 ---                      
426:   26     e -1.3762654
427:   27     e  0.3761436
428:   28     e -1.6982330
429:   29     e  0.1940263
430:   30     e -0.4631258

The output should be something like

group1     group2      regression
a           b           1.2
a           c           0.3
b           c           0.5
d           a           4.3
...

I am looking for a solution using data.table library only.

  1. Linear regression of all the combinations of groups should be found. That includes cases a~b and b~a as the regression for each of these cases will be different.
  2. Since the size of some groups is different, the time variables should be used to find the common rows between any set of groups.
  3. The solution will require finding all combinations of groups.
16
  • I guess you are looking for only a, b and not b, a right? Commented Jun 24, 2021 at 18:45
  • I am looking for both (a,b) and (b,a), as regression of each will give different results. Commented Jun 24, 2021 at 18:45
  • One more thing, the size of these groups might be different. The data.table will be like - dt <- data.table(group = c(rep(1:4, 100), rep(5, 30)), a = rnorm(430)) Commented Jun 24, 2021 at 18:54
  • 1
    Thanks @Arun for handling the case where li < 0, Please post this answer. Commented Jun 24, 2021 at 19:43
  • 1
    Sure thing, I will try foreach. Thanks for the tip! Commented Jun 24, 2021 at 19:55

1 Answer 1

1

With the new data, we could split the data by 'group' into a list. Then, use combn on the names of the list for pairwise combination, extract the list elements (s1, s2), check if there are any common 'time' (intersect). Use a condition based on length i.e. if there are common elements, then apply the lm on the corresponding 'value' columns, create a data.table with summarised coef along with the group names and rbind the list elements

library(data.table)
lst1 <- split(dt, dt$group)
rbindlist(combn(names(lst1), 2, FUN = function(x) {
      s1 <- lst1[[x[1]]]
      s2 <- lst1[[x[2]]]
      i1 <- intersect(s1$time, s2$time)
      if(length(i1) > 0) na.omit(s1[s2, on = .(time)][, 
        . (group1 = first(s1$group), group2 = first(s2$group), 
          regression = lm(i.value ~ value)$coef[2])]) 
       else
         data.table(group1 = first(s1$group), group2 = first(s2$group), 
         regression = NA_real_)}, simplify = FALSE))

-output

     group1 group2  regression
 1:      a      b  0.03033996
 2:      a      c  0.06391242
 3:      a      d -0.09138112
 4:      a      e -0.27738183
 5:      b      c  0.05663270
 6:      b      d  0.05481604
 7:      b      e  0.27789495
 8:      c      d -0.13987978
 9:      c      e  0.16388299
10:      d      e  0.12380720

If we want full combinations, use either expand.grid or CJ (from data.table

dt2 <- CJ(group1 = names(lst1), group2 = names(lst1))[group1 != group2]
dt2[, rbindlist(Map(function(x, y) {
       s1 <- lst1[[x]]
       s2 <- lst1[[y]]
       i1 <- intersect(s1$time, s2$time)
       if(length(i1) > 0) na.omit(s1[s2, on = .(time)][,
           data.table(group1 = x, group2 = y, 
          regresion = lm(i.value ~ value)$coef[2])]) else 
           data.table(group1 = x, group2 = y, regression = NA_real_)

        }, group1, group2))]

-output

  group1 group2   regresion
 1:      a      b  0.03033996
 2:      a      c  0.06391242
 3:      a      d -0.09138112
 4:      a      e -0.27738183
 5:      b      a  0.03247826
 6:      b      c  0.05663270
 7:      b      d  0.05481604
 8:      b      e  0.27789495
 9:      c      a  0.07488082
10:      c      b  0.06198333
11:      c      d -0.13987978
12:      c      e  0.16388299
13:      d      a -0.09295215
14:      d      b  0.05208743
15:      d      c -0.12144302
16:      d      e  0.12380720
17:      e      a -0.25136439
18:      e      b  0.34052322
19:      e      c  0.28677255
20:      e      d  0.21435666
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks @Arun! It does handle cases such as a~b, but the other scenario of b~a is not produced. Is it possible for you to update the code to handle b~a scenario?
@Saurabh For that, we need expand.grid or CJ, and then remove the ones that are same
Okay, I will try to do that. Thanks!
@Saurabh I didn't use it earlier because you mentioned the data is really big with lots of groups
Thanks @Arun for the brilliant answer.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.