6

I have searched SO and although there are many QA about conditionally removing rows none of the QA fit my problem.

I have a data.frame containing longitudinal measurements of variable x, y etc... , at various time points time, in several subjects id. Some subjects experience an event ev (denoted as 1, otherwise 0 at some time). I would like to reduce the initial data.frame to:

  • 1) All rows with subjects that have not experienced an event (ok, thats easy) but also include
  • 2) For the subjects that have experienced an event, all rows just prior to the event (that is all rows whith times less that the time of the event of that individual).

so that,

testdf<-data.frame(id=c(rep("A",4),rep("B",4),rep("C",4) ),
                   x=c(NA, NA, 1,2, 3, NA, NA, 1, 2, NA,NA, 5), 
                   y=rev(c(NA, NA, 1,2, 3, NA, NA, 1, 2, NA,NA, 5)),
                   time=c(1,2,3,4,0.1,0.5,10,20,3,2,1,0.5),
                   ev=c(0,0,0,0,0,1,0,0,0,0,0,1))

would reduce to

   id  x  y time ev
1   A NA  5  1.0  0
2   A NA NA  2.0  0
3   A  1 NA  3.0  0
4   A  2  2  4.0  0
5   B  3  1  0.1  0
6   C  2  2  3.0  0
7   C NA  1  2.0  0
8   C NA NA  1.0  0
1
  • 1
    Note that condition 2 implies condition 1, if condition 2 is written as "all rows prior to an event". Commented Jan 26, 2013 at 15:11

4 Answers 4

4

Here's a solution with subset and ave:

subset(testdf, !ave(ev, id, FUN = cumsum))
Sign up to request clarification or add additional context in comments.

Comments

4

A solution in base:

> do.call(rbind, by(testdf, testdf$id, function(x) x[cumsum(x$ev) == 0,]))
     id  x  y time ev
A.1   A NA  5  1.0  0
A.2   A NA NA  2.0  0
A.3   A  1 NA  3.0  0
A.4   A  2  2  4.0  0
B     B  3  1  0.1  0
C.9   C  2  2  3.0  0
C.10  C NA  1  2.0  0
C.11  C NA NA  1.0  0

1 Comment

Or, testdf[with(testdf, ave(ev, id, FUN = cumsum)) == 0, ]
3

Here is an example:

> ddply(testdf, .(id), function(z) z[cumsum(z$ev) == 0, ])
  id  x  y time ev
1  A NA  5  1.0  0
2  A NA NA  2.0  0
3  A  1 NA  3.0  0
4  A  2  2  4.0  0
5  B  3  1  0.1  0
6  C  2  2  3.0  0
7  C NA  1  2.0  0
8  C NA NA  1.0  0

Comments

3

This solution using data.table seems to work on your testdf. The idea is to use cumsum to track the positions after the start of the first event.

require(data.table)
dt <- data.table(testdf, key=c("id"))
dt.out <- dt[, .SD[cumsum(ev) == 0], by=id]
> dt.out

#    id  x  y time ev
# 1:  A NA  5  1.0  0
# 2:  A NA NA  2.0  0
# 3:  A  1 NA  3.0  0
# 4:  A  2  2  4.0  0
# 5:  B  3  1  0.1  0
# 6:  C  2  2  3.0  0
# 7:  C NA  1  2.0  0
# 8:  C NA NA  1.0  0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.