Speeding up data.frame operations instead of looping

Question

I have the following dataset in R

dat <- data.frame(t = rep(seq(1, 5, 1),4), id = rep(c(rep("A",5), rep("B",5), rep("C",5), rep("D",5)), 1),
                  x = 1:20, y = 51:70, h = c(rep(1,10), rep(0,10) ) ) 
require(dplyr)
dat <- arrange(dat, t)

The dataset is a panel with t as the time variable and id as the subject id. I need to attach an additional row, where I compute the sum of x times y for the remaining subjects at time t and divide it by the standard deviation of the x variables for the remaining subjects at time t. This new row should show a zero for the subjects with h == 0.

For example, for subject A at time t == 1, the operation is: (6 * 56 + 11 * 61 + 16 * 66) / sd(c(6, 11, 16)). A similar operation for subject B at time t == 1 is (1 * 51 + 11 * 61 + 16 * 66) / sd(c(1, 11, 16)). However, for subjects C and D, the new row would feature only a 0.

What is the fastest way to do this without a loop? I believe the dplyr package is the fastest, but I'm quite new to it, and I'm unsure on how to deal with it. In my attempt I first group by time, and then gather the variables but I receive a warning and several variables are dropped. I'm unsure on how to select the variables for each group.

dat %>%
  group_by(t) %>%
  gather(key, value, -t)
# Warning message:
# attributes are not identical across measure variables;
# they will be dropped

CONDITIONING

How to include in the previous operation a condition such that in the following table, the operation is computed only when cond == id. For example, for the first row we would have: 0 because subjects B, C and D all have values different from their id (cond is A). For row 6 instead the operation is (2*52 + 12*62 + 17*67) / sd(c(2,12,17)).

dat <- data.frame(t = rep(seq(1, 5, 1),4), id = rep(c(rep("A",5), rep("B",5), rep("C",5), rep("D",5)), 1),
                  x = 1:20, y = 51:70, h = c(rep(1,10), rep(0,10) ) )
dat <- arrange(dat, t)
dat <- data.frame(dat, cond = c("B", "A", "A", "A", "A", "B", "C", "D", "A", "B", "D", "C", "A", "D", "C", "A", "A", "C", "C", "B") )
dat

#    t  id x y  h   cond
# 1  1  A  1 51 1    B
# 2  1  B  6 56 1    A
# 3  1  C 11 61 0    A
# 4  1  D 16 66 0    A
# 5  2  A  2 52 1    A
# 6  2  B  7 57 1    B
# 7  2  C 12 62 0    C
# 8  2  D 17 67 0    D
# 9  3  A  3 53 1    A
# 10 3  B  8 58 1    B
# 11 3  C 13 63 0    D
# 12 3  D 18 68 0    C
# 13 4  A  4 54 1    A
# 14 4  B  9 59 1    D
# 15 4  C 14 64 0    C
# 16 4  D 19 69 0    A
# 17 5  A  5 55 1    A
# 18 5  B 10 60 1    C
# 19 5  C 15 65 0    C
# 20 5  D 20 70 0    B

A proposed solution

dat %>% 
 filter(id == cond) %>% 
 group_by(t) %>% 
 mutate(new = h * ((sum(x *y) - (x * y))/map_dbl(row_number(), ~ sd(x[-.x])))) %>% 
 bind_rows(dat %>% filter(id != cond))

works very well but partially, as it creates NaN from multiplying 0 * Inf. Instead I would like to have 0 when the conditions do not apply or when the standard deviation at the denominator is 0. Thank you so much!

akrun · Accepted Answer · 2018-06-15 02:50:24Z

2

After grouping by 't', create the 'new' column by taking the difference of the sum of the products of 'x' and 'y' with the product 'x' and 'y' (to exclude the current row product) and dividing it by getting the sd of elements of 'x' by looping through the row index (row_number()) to be used for excluding current row and multiply by 'h' so that we get 0 where 'h' is 0.

library(tidyverse)
out <- dat %>% 
         group_by(t) %>% 
         mutate(new =  h * ((sum(x *y) - (x * y))/map_dbl(row_number(),
                                                     ~ sd(x[-.x]))))
head(out, 4)
# A tibble: 4 x 6
# Groups:   t [1]
#      t id        x     y     h   new
#  <dbl> <fct> <int> <int> <dbl> <dbl>
#1     1 A         1    51     1  413.
#2     1 B         6    56     1  233.
#3     1 C        11    61     0    0 
#4     1 D        16    66     0    0

edited Jun 15, 2018 at 2:50

answered Jun 15, 2018 at 2:37

akrun

891k38 gold badges590 silver badges700 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Andrew Over a year ago

This works perfectly! Thank you! Would you also know how to extend the code so that I consider in the summation at the numerator and in the sd at the denominator only the individuals that meet a condition? e.g., assume the data is

dat <- data.frame(t = rep(seq(1, 5, 1),4), id = rep(c(rep("A",5), rep("B",5), rep("C",5), rep("D",5)), 1), x = 1:20, y = 51:70, h = c(rep(1,10), rep(0,10) ), cond = sample(c("A", "B"), 20, replace = T) )

and I get new only out of the subjects that meet id != cond.

Andrew Over a year ago

Sorry, if it was confusing. I'm wondering how to modify the mutate line with additional conditioning. For example, computing the operation above (sum of remaining subjects / standard deviation) using only data for subjects such that id == cond (given that cond takes values either in A or B).

akrun Over a year ago

@Andrew May be you meant

dat %>% filter(id == cond) %>%  group_by(t) %>%           mutate(new =  h * ((sum(x *y) - (x * y))/map_dbl(row_number(),                                                      ~ sd(x[-.x])))) %>% bind_rows(dat %>% filter(id != cond))

Andrew Over a year ago

Yes! this almost does it! it just creates NaNs instead of zeros. I added an example in the text! Thank you so much Akrun!

Andrew Over a year ago

I think adding this two lines to your last code does the trick: replace_na(list(new = 0)) %>% arrange(t, id). Thank you so much!

Collectives™ on Stack Overflow

Speeding up data.frame operations instead of looping

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related