Creating a new variable while using subsequent values in r

Question

I have the following data frame:

df1 <- data.frame(id = rep(1:3, each = 5), 
                  time = rep(1:5),
                  y = c(rep(1, 4), 0, 1, 0, 1, 1, 0, 0, 1, rep(0,3)))

df1
##    id time y
## 1   1    1 1
## 2   1    2 1
## 3   1    3 1
## 4   1    4 1
## 5   1    5 0
## 6   2    1 1
## 7   2    2 0
## 8   2    3 1
## 9   2    4 1
## 10  2    5 0
## 11  3    1 0
## 12  3    2 1
## 13  3    3 0
## 14  3    4 0
## 15  3    5 0

I'd like to create a new indicator variable that tells me, for each of the three ids, at what point y = 0 for all subsequent responses. In the example above, for ids 1 and 2 this occurs at the 5th time point, and for id 3 this occurs at the 3rd time point.

I'm getting tripped up on id 2, where y = 1 at time point 2, but then goes back to one -- I'd like to the indicator variable to take subsequent time points into account.

Essentially, I'm looking for the following output:

df1
##    id time y new_col
## 1   1    1 1       0
## 2   1    2 1       0
## 3   1    3 1       0
## 4   1    4 1       0
## 5   1    5 0       1
## 6   2    1 1       0
## 7   2    2 0       0
## 8   2    3 1       0
## 9   2    4 1       0
## 10  2    5 0       1
## 11  3    1 0       0
## 12  3    2 1       0
## 13  3    3 0       1
## 14  3    4 0       1
## 15  3    5 0       1

The new_col variable is indicating whether or not y = 0 at that time point and for all subsequent time points.

library(dplyr);df1 %>% group_by(id) %>% summarise(zero = match(0, y)) If you need a column, change summarise to mutate. It would be better if you show the expected output as well — akrun
– akrun, Commented Dec 1, 2017 at 13:58
Thanks @akrun. This isn't exactly what I'm going for, since for id 2, your solution doesn't account for the subsequent '1' at time points 3 and 4. — afleishman
– afleishman, Commented Dec 1, 2017 at 14:05

talat · Accepted Answer · 2017-12-01 14:48:36Z

I would use a little helper function for that.

foo <- function(x, val) {
  pos <- max(which(x != val)) +1
  as.integer(seq_along(x) >= pos)
}

df1 %>% 
  group_by(id) %>% 
  mutate(indicator = foo(y, 0))

# # A tibble: 15 x 4
# # Groups:   id [3]
#     id  time     y indicator
#   <int> <int> <dbl>     <int>
# 1     1     1     1         0
# 2     1     2     1         0
# 3     1     3     1         0
# 4     1     4     1         0
# 5     1     5     0         1
# 6     2     1     1         0
# 7     2     2     0         0
# 8     2     3     1         0
# 9     2     4     1         0
# 10     2     5     0         1
# 11     3     1     0         0
# 12     3     2     1         0
# 13     3     3     0         1
# 14     3     4     0         1
# 15     3     5     0         1

In case you want to consider NA-values in y, you can adjust foo to:

foo <- function(x, val) {
  pos <- max(which(x != val | is.na(x))) +1
  as.integer(seq_along(x) >= pos)
}

That way, if there's a NA after the last y=0, the indicator will remain 0.

akrun · Accepted Answer · 2017-12-01 14:59:59Z

0

Here is an option using data.table

library(data.table)
setDT(df1)[,  indicator := cumsum(.I %in% .I[which.max(rleid(y)*!y)]), id]
df1
#    id time y indicator
# 1:  1    1 1         0
# 2:  1    2 1         0
# 3:  1    3 1         0
# 4:  1    4 1         0
# 5:  1    5 0         1
# 6:  2    1 1         0
# 7:  2    2 0         0
# 8:  2    3 1         0
# 9:  2    4 1         0
#10:  2    5 0         1
#11:  3    1 0         0
#12:  3    2 1         0
#13:  3    3 0         1
#14:  3    4 0         1
#15:  3    5 0         1

Based on the comments from @docendodiscimus, if the values are not 0 for 'y' at the end of each 'id', then we can do

setDT(df1)[, indicator := {
       i1 <- rleid(y) * !y
     if(i1[.N]!= max(i1) & !is.na(i1[.N])) 0L else cumsum(.I %in% .I[which.max(i1)])  }, id]

edited Dec 1, 2017 at 14:59

answered Dec 1, 2017 at 14:29

akrun

891k38 gold badges590 silver badges700 bronze badges

5 Comments

akrun Over a year ago

@docendodiscimus It is not clear about that condition from reading the OP's post. In your code, it is creating all 0s which I am not sure if that is what OP intended

akrun Over a year ago

@docendodiscimus I guess u changed the code. I was copy/pasting ur old code. Now, it is all 0s

akrun Over a year ago

@docendodiscimus Anyway, your code would also break, if the last value i.e. y[15] is NA i.e giving all 1s

akrun Over a year ago

@docendodiscimus I am talking about the current version

akrun Over a year ago

@docendodiscimus Sure, thanks for the constructive criticism. To be frank, I thought the OP's column always have 0s at the end. I will update this

Collectives™ on Stack Overflow

Creating a new variable while using subsequent values in r

2 Answers 2

Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related