Say I have a data set where sequences of length 1 are illegal, length 2 are legal, greater than length 5 are illegal but it is allowed to break longer sequences up into <=5 sequences.
set.seed(1)
DT1 <- data.table(smp = 1, R=sample(0:1, 20000, rep=TRUE), Seq = 0L)
DT1[, smp:=1:length(smp)]
DT1[, Seq:=seq(.N), by=list(cumsum(c(0, abs(diff(R)))))]
This last line comes directly from: Creating a sequence in a data.table depending on a column
DT1[, fix_min:=ifelse((R==TRUE & Seq==1) | (R==FALSE), FALSE, TRUE)]
fixmin_idx2 <- which(DT1[, fix_min==TRUE])
DT1[fixmin_idx2 -1, fix_min:=TRUE]
Now my length 2 legals are properly marked. Break up the >5s.
DT1[R==1 & Seq==6, fix_min:=FALSE]
DT1[,Seq2:=seq(.N), by=list(cumsum(c(0, abs(diff(fix_min)))))]
DT1[R==1 & Seq2==6, fix_min:=FALSE]
fixSeq2_idx7 <- which(DT1[,fix_min==TRUE] & DT1[,Seq2==7])
fixSeq2_idx7
[1] 10203 13228
DT1[fixSeq2_idx7,]
smp R Seq fix_min Seq2
1: 10203 1 13 TRUE 7
2: 13228 1 13 TRUE 7
DT1[fixSeq2_idx7 + 1,]
smp R Seq fix_min Seq2
1: 10204 1 14 TRUE 8
2: 13229 0 1 FALSE 1
And now to test if a Seq2==7 is followed by an Seq2==8, which would be a legal 2 length. I one 7 followed by an 8 and one not followed by an 8. And there I'm stuck. Everything I've tried either sets all fix_min to TRUE or alternation TRUE and FALSE.
Any guidance greatly appreciated.
ifelse((R==TRUE & Seq==1) | (R==FALSE), FALSE, TRUE)should be just!(R==1 & Seq==1). Note thatRis 0/1 not FALSE/TRUE. Elsewhere, I strongly suspect that you do not need so many parentheses. Inby=, for example, you do not need to wrap a single vector in alist().DT1[, if (.N > 1L) .SD[rep(seq_len(min(.N, 5L)), length.out=.N)], by=.(rleid(R), R)]. It removes rows whereSeqis just1, and if1:9, it changes it to1:5, 1:4.. This is to be executed after your first block of code..SD, use:=and updateSeqby checking for appropriate conditions. I think the logic is quite straightforward to get to from the previous comment?