0

I have the following dataset in R

dat <- data.frame(t = rep(seq(1, 5, 1),4), id = rep(c(rep("A",5), rep("B",5), rep("C",5), rep("D",5)), 1),
                  x = 1:20, y = 51:70, h = c(rep(1,10), rep(0,10) ) ) 
require(dplyr)
dat <- arrange(dat, t)

The dataset is a panel with t as the time variable and id as the subject id. I need to attach an additional row, where I compute the sum of x times y for the remaining subjects at time t and divide it by the standard deviation of the x variables for the remaining subjects at time t. This new row should show a zero for the subjects with h == 0.

For example, for subject A at time t == 1, the operation is: (6 * 56 + 11 * 61 + 16 * 66) / sd(c(6, 11, 16)). A similar operation for subject B at time t == 1 is (1 * 51 + 11 * 61 + 16 * 66) / sd(c(1, 11, 16)). However, for subjects C and D, the new row would feature only a 0.

What is the fastest way to do this without a loop? I believe the dplyr package is the fastest, but I'm quite new to it, and I'm unsure on how to deal with it. In my attempt I first group by time, and then gather the variables but I receive a warning and several variables are dropped. I'm unsure on how to select the variables for each group.

dat %>%
  group_by(t) %>%
  gather(key, value, -t)
# Warning message:
# attributes are not identical across measure variables;
# they will be dropped

CONDITIONING

How to include in the previous operation a condition such that in the following table, the operation is computed only when cond == id. For example, for the first row we would have: 0 because subjects B, C and D all have values different from their id (cond is A). For row 6 instead the operation is (2*52 + 12*62 + 17*67) / sd(c(2,12,17)).

dat <- data.frame(t = rep(seq(1, 5, 1),4), id = rep(c(rep("A",5), rep("B",5), rep("C",5), rep("D",5)), 1),
                  x = 1:20, y = 51:70, h = c(rep(1,10), rep(0,10) ) )
dat <- arrange(dat, t)
dat <- data.frame(dat, cond = c("B", "A", "A", "A", "A", "B", "C", "D", "A", "B", "D", "C", "A", "D", "C", "A", "A", "C", "C", "B") )
dat

#    t  id x y  h   cond
# 1  1  A  1 51 1    B
# 2  1  B  6 56 1    A
# 3  1  C 11 61 0    A
# 4  1  D 16 66 0    A
# 5  2  A  2 52 1    A
# 6  2  B  7 57 1    B
# 7  2  C 12 62 0    C
# 8  2  D 17 67 0    D
# 9  3  A  3 53 1    A
# 10 3  B  8 58 1    B
# 11 3  C 13 63 0    D
# 12 3  D 18 68 0    C
# 13 4  A  4 54 1    A
# 14 4  B  9 59 1    D
# 15 4  C 14 64 0    C
# 16 4  D 19 69 0    A
# 17 5  A  5 55 1    A
# 18 5  B 10 60 1    C
# 19 5  C 15 65 0    C
# 20 5  D 20 70 0    B

A proposed solution

dat %>% 
 filter(id == cond) %>% 
 group_by(t) %>% 
 mutate(new = h * ((sum(x *y) - (x * y))/map_dbl(row_number(), ~ sd(x[-.x])))) %>% 
 bind_rows(dat %>% filter(id != cond))

works very well but partially, as it creates NaN from multiplying 0 * Inf. Instead I would like to have 0 when the conditions do not apply or when the standard deviation at the denominator is 0. Thank you so much!

1 Answer 1

2

After grouping by 't', create the 'new' column by taking the difference of the sum of the products of 'x' and 'y' with the product 'x' and 'y' (to exclude the current row product) and dividing it by getting the sd of elements of 'x' by looping through the row index (row_number()) to be used for excluding current row and multiply by 'h' so that we get 0 where 'h' is 0.

library(tidyverse)
out <- dat %>% 
         group_by(t) %>% 
         mutate(new =  h * ((sum(x *y) - (x * y))/map_dbl(row_number(),
                                                     ~ sd(x[-.x]))))
head(out, 4)
# A tibble: 4 x 6
# Groups:   t [1]
#      t id        x     y     h   new
#  <dbl> <fct> <int> <int> <dbl> <dbl>
#1     1 A         1    51     1  413.
#2     1 B         6    56     1  233.
#3     1 C        11    61     0    0 
#4     1 D        16    66     0    0 
Sign up to request clarification or add additional context in comments.

5 Comments

This works perfectly! Thank you! Would you also know how to extend the code so that I consider in the summation at the numerator and in the sd at the denominator only the individuals that meet a condition? e.g., assume the data is dat <- data.frame(t = rep(seq(1, 5, 1),4), id = rep(c(rep("A",5), rep("B",5), rep("C",5), rep("D",5)), 1), x = 1:20, y = 51:70, h = c(rep(1,10), rep(0,10) ), cond = sample(c("A", "B"), 20, replace = T) ) and I get new only out of the subjects that meet id != cond.
Sorry, if it was confusing. I'm wondering how to modify the mutate line with additional conditioning. For example, computing the operation above (sum of remaining subjects / standard deviation) using only data for subjects such that id == cond (given that cond takes values either in A or B).
@Andrew May be you meant dat %>% filter(id == cond) %>% group_by(t) %>% mutate(new = h * ((sum(x *y) - (x * y))/map_dbl(row_number(), ~ sd(x[-.x])))) %>% bind_rows(dat %>% filter(id != cond))
Yes! this almost does it! it just creates NaNs instead of zeros. I added an example in the text! Thank you so much Akrun!
I think adding this two lines to your last code does the trick: replace_na(list(new = 0)) %>% arrange(t, id). Thank you so much!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.