0

I have the following data frame:

df1 <- data.frame(id = rep(1:3, each = 5), 
                  time = rep(1:5),
                  y = c(rep(1, 4), 0, 1, 0, 1, 1, 0, 0, 1, rep(0,3)))

df1
##    id time y
## 1   1    1 1
## 2   1    2 1
## 3   1    3 1
## 4   1    4 1
## 5   1    5 0
## 6   2    1 1
## 7   2    2 0
## 8   2    3 1
## 9   2    4 1
## 10  2    5 0
## 11  3    1 0
## 12  3    2 1
## 13  3    3 0
## 14  3    4 0
## 15  3    5 0

I'd like to create a new indicator variable that tells me, for each of the three ids, at what point y = 0 for all subsequent responses. In the example above, for ids 1 and 2 this occurs at the 5th time point, and for id 3 this occurs at the 3rd time point.

I'm getting tripped up on id 2, where y = 1 at time point 2, but then goes back to one -- I'd like to the indicator variable to take subsequent time points into account.

Essentially, I'm looking for the following output:

df1
##    id time y new_col
## 1   1    1 1       0
## 2   1    2 1       0
## 3   1    3 1       0
## 4   1    4 1       0
## 5   1    5 0       1
## 6   2    1 1       0
## 7   2    2 0       0
## 8   2    3 1       0
## 9   2    4 1       0
## 10  2    5 0       1
## 11  3    1 0       0
## 12  3    2 1       0
## 13  3    3 0       1
## 14  3    4 0       1
## 15  3    5 0       1

The new_col variable is indicating whether or not y = 0 at that time point and for all subsequent time points.

4
  • library(dplyr);df1 %>% group_by(id) %>% summarise(zero = match(0, y)) If you need a column, change summarise to mutate. It would be better if you show the expected output as well Commented Dec 1, 2017 at 13:58
  • 1
    what if y was 1 again in row 14? Commented Dec 1, 2017 at 14:02
  • Thanks @akrun. This isn't exactly what I'm going for, since for id 2, your solution doesn't account for the subsequent '1' at time points 3 and 4. Commented Dec 1, 2017 at 14:05
  • 1
    Could you please update with the expected column as well Commented Dec 1, 2017 at 14:05

2 Answers 2

2

I would use a little helper function for that.

foo <- function(x, val) {
  pos <- max(which(x != val)) +1
  as.integer(seq_along(x) >= pos)
}

df1 %>% 
  group_by(id) %>% 
  mutate(indicator = foo(y, 0))

# # A tibble: 15 x 4
# # Groups:   id [3]
#     id  time     y indicator
#   <int> <int> <dbl>     <int>
# 1     1     1     1         0
# 2     1     2     1         0
# 3     1     3     1         0
# 4     1     4     1         0
# 5     1     5     0         1
# 6     2     1     1         0
# 7     2     2     0         0
# 8     2     3     1         0
# 9     2     4     1         0
# 10     2     5     0         1
# 11     3     1     0         0
# 12     3     2     1         0
# 13     3     3     0         1
# 14     3     4     0         1
# 15     3     5     0         1

In case you want to consider NA-values in y, you can adjust foo to:

foo <- function(x, val) {
  pos <- max(which(x != val | is.na(x))) +1
  as.integer(seq_along(x) >= pos)
}

That way, if there's a NA after the last y=0, the indicator will remain 0.

Sign up to request clarification or add additional context in comments.

Comments

0

Here is an option using data.table

library(data.table)
setDT(df1)[,  indicator := cumsum(.I %in% .I[which.max(rleid(y)*!y)]), id]
df1
#    id time y indicator
# 1:  1    1 1         0
# 2:  1    2 1         0
# 3:  1    3 1         0
# 4:  1    4 1         0
# 5:  1    5 0         1
# 6:  2    1 1         0
# 7:  2    2 0         0
# 8:  2    3 1         0
# 9:  2    4 1         0
#10:  2    5 0         1
#11:  3    1 0         0
#12:  3    2 1         0
#13:  3    3 0         1
#14:  3    4 0         1
#15:  3    5 0         1

Based on the comments from @docendodiscimus, if the values are not 0 for 'y' at the end of each 'id', then we can do

setDT(df1)[, indicator := {
       i1 <- rleid(y) * !y
     if(i1[.N]!= max(i1) & !is.na(i1[.N])) 0L else cumsum(.I %in% .I[which.max(i1)])  }, id]

5 Comments

@docendodiscimus It is not clear about that condition from reading the OP's post. In your code, it is creating all 0s which I am not sure if that is what OP intended
@docendodiscimus I guess u changed the code. I was copy/pasting ur old code. Now, it is all 0s
@docendodiscimus Anyway, your code would also break, if the last value i.e. y[15] is NA i.e giving all 1s
@docendodiscimus I am talking about the current version
@docendodiscimus Sure, thanks for the constructive criticism. To be frank, I thought the OP's column always have 0s at the end. I will update this

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.