I have the following dataset in R
dat <- data.frame(t = rep(seq(1, 5, 1),4), id = rep(c(rep("A",5), rep("B",5), rep("C",5), rep("D",5)), 1),
x = 1:20, y = 51:70, h = c(rep(1,10), rep(0,10) ) )
require(dplyr)
dat <- arrange(dat, t)
The dataset is a panel with t as the time variable and id as the subject id. I need to attach an additional row, where I compute the sum of x times y for the remaining subjects at time t and divide it by the standard deviation of the x variables for the remaining subjects at time t. This new row should show a zero for the subjects with h == 0.
For example, for subject A at time t == 1, the operation is: (6 * 56 + 11 * 61 + 16 * 66) / sd(c(6, 11, 16)). A similar operation for subject B at time t == 1 is (1 * 51 + 11 * 61 + 16 * 66) / sd(c(1, 11, 16)). However, for subjects C and D, the new row would feature only a 0.
What is the fastest way to do this without a loop? I believe the dplyr package is the fastest, but I'm quite new to it, and I'm unsure on how to deal with it. In my attempt I first group by time, and then gather the variables but I receive a warning and several variables are dropped. I'm unsure on how to select the variables for each group.
dat %>%
group_by(t) %>%
gather(key, value, -t)
# Warning message:
# attributes are not identical across measure variables;
# they will be dropped
CONDITIONING
How to include in the previous operation a condition such that in the following table, the operation is computed only when cond == id. For example, for the first row we would have: 0 because subjects B, C and D all have values different from their id (cond is A). For row 6 instead the operation is (2*52 + 12*62 + 17*67) / sd(c(2,12,17)).
dat <- data.frame(t = rep(seq(1, 5, 1),4), id = rep(c(rep("A",5), rep("B",5), rep("C",5), rep("D",5)), 1),
x = 1:20, y = 51:70, h = c(rep(1,10), rep(0,10) ) )
dat <- arrange(dat, t)
dat <- data.frame(dat, cond = c("B", "A", "A", "A", "A", "B", "C", "D", "A", "B", "D", "C", "A", "D", "C", "A", "A", "C", "C", "B") )
dat
# t id x y h cond
# 1 1 A 1 51 1 B
# 2 1 B 6 56 1 A
# 3 1 C 11 61 0 A
# 4 1 D 16 66 0 A
# 5 2 A 2 52 1 A
# 6 2 B 7 57 1 B
# 7 2 C 12 62 0 C
# 8 2 D 17 67 0 D
# 9 3 A 3 53 1 A
# 10 3 B 8 58 1 B
# 11 3 C 13 63 0 D
# 12 3 D 18 68 0 C
# 13 4 A 4 54 1 A
# 14 4 B 9 59 1 D
# 15 4 C 14 64 0 C
# 16 4 D 19 69 0 A
# 17 5 A 5 55 1 A
# 18 5 B 10 60 1 C
# 19 5 C 15 65 0 C
# 20 5 D 20 70 0 B
A proposed solution
dat %>%
filter(id == cond) %>%
group_by(t) %>%
mutate(new = h * ((sum(x *y) - (x * y))/map_dbl(row_number(), ~ sd(x[-.x])))) %>%
bind_rows(dat %>% filter(id != cond))
works very well but partially, as it creates NaN from multiplying 0 * Inf. Instead I would like to have 0 when the conditions do not apply or when the standard deviation at the denominator is 0. Thank you so much!