2

Say I have a data set where sequences of length 1 are illegal, length 2 are legal, greater than length 5 are illegal but it is allowed to break longer sequences up into <=5 sequences.

set.seed(1)
DT1 <- data.table(smp = 1, R=sample(0:1, 20000, rep=TRUE), Seq = 0L)
DT1[, smp:=1:length(smp)]
DT1[, Seq:=seq(.N), by=list(cumsum(c(0, abs(diff(R)))))]

This last line comes directly from: Creating a sequence in a data.table depending on a column

DT1[, fix_min:=ifelse((R==TRUE & Seq==1) | (R==FALSE), FALSE, TRUE)]
fixmin_idx2 <- which(DT1[, fix_min==TRUE])
DT1[fixmin_idx2 -1, fix_min:=TRUE]

Now my length 2 legals are properly marked. Break up the >5s.

DT1[R==1 & Seq==6, fix_min:=FALSE]
DT1[,Seq2:=seq(.N), by=list(cumsum(c(0, abs(diff(fix_min)))))]
DT1[R==1 & Seq2==6, fix_min:=FALSE]
fixSeq2_idx7 <- which(DT1[,fix_min==TRUE] & DT1[,Seq2==7])
fixSeq2_idx7
[1] 10203 13228
DT1[fixSeq2_idx7,]
 smp R Seq fix_min Seq2
1: 10203 1  13    TRUE    7
2: 13228 1  13    TRUE    7
DT1[fixSeq2_idx7 + 1,]
 smp R Seq fix_min Seq2
1: 10204 1  14    TRUE    8
2: 13229 0   1   FALSE    1

And now to test if a Seq2==7 is followed by an Seq2==8, which would be a legal 2 length. I one 7 followed by an 8 and one not followed by an 8. And there I'm stuck. Everything I've tried either sets all fix_min to TRUE or alternation TRUE and FALSE.

Any guidance greatly appreciated.

8
  • Minor fix: ifelse((R==TRUE & Seq==1) | (R==FALSE), FALSE, TRUE) should be just !(R==1 & Seq==1). Note that R is 0/1 not FALSE/TRUE. Elsewhere, I strongly suspect that you do not need so many parentheses. In by=, for example, you do not need to wrap a single vector in a list(). Commented Oct 29, 2015 at 14:11
  • Not sure, but does this give what you expect? DT1[, if (.N > 1L) .SD[rep(seq_len(min(.N, 5L)), length.out=.N)], by=.(rleid(R), R)]. It removes rows where Seq is just 1, and if 1:9, it changes it to 1:5, 1:4.. This is to be executed after your first block of code. Commented Oct 29, 2015 at 14:23
  • @Arun - Yes, except that I don't want to remove rows in the data at this point because the illegals represent another condition of interest in the data. Commented Oct 29, 2015 at 14:39
  • In that case, instead of .SD, use := and update Seq by checking for appropriate conditions. I think the logic is quite straightforward to get to from the previous comment? Commented Oct 29, 2015 at 14:41
  • @Arun - I'll work on it as you suggest, but think I'll do some head scratching as well. Commented Oct 29, 2015 at 14:46

1 Answer 1

2

If I understand your question correctly, you want to set the fix_min to FALSE when R == 0 or when R == 1 & (1 =< Seq < 6 | Seq > 6). Then the following should give you what you want:

# recreating the data from your first code block
set.seed(1)
DT1 <- data.table(R=sample(0:1, 20000, rep=TRUE))[, smp:=.I
                                                  ][, Seq:=seq(.N), by=rleid(R)
                                                    ][, Seq2 := Seq[.N], by=rleid(R)]

# adding the needed 'fix_min' column
DT1[, fix_min := (R==1 & Seq[.N] > 1 & Seq%%6!=0), by=rleid(R)
    ][R==1 & Seq%%6==1 & Seq2%%6==1 & Seq==Seq2, fix_min := FALSE]

Explanation:

  • data.table(R=sample(0:1, 20000, rep=TRUE)) creates the base of the data.table
  • [, smp:=.I] creates an index and adds it to the data.table
  • by=rleid(R) identifies the sequences; to see what it does try: data.table(R=sample(0:1, 20000, rep=TRUE))[, seq.id:=rleid(R)]
  • [, Seq:=seq(.N), by=rleid(R)] creates an index for each sequence and adds it to the data.table; the sequences are identified by rleid(R)
  • [, Seq2 := Seq[.N], by=rleid(R)] creates a variable that contains a value indicating the length of the sequence
  • fix_min := (R==1 & Seq[.N] > 1 & Seq%%6!=0) creates a logical vector with TRUE values where R==1 & the length of the sequence is larger than one (Seq[.N] > 1) excluding the values where the sequence number is a multiple of 6 (Seq%%6!=0)
  • R==1 & Seq%%6==1 & Seq2%%6==1 & Seq==Seq2 filters the data.table as follows: only rows where R==1 & the sequence value is 7, 13, 19, etc (Seq%%6==1) & the length of the sequence is 7, 13, 19, etc and only selects the last row (Seq==Seq2) from the sequences that meet the other conditions. With fix_min := FALSE you set them to FALSE.
Sign up to request clarification or add additional context in comments.

11 Comments

Well, no. If you look at DT1[19950:20000] i see a case, starting at 19989 that should be fix_min TRUE for 19989:19993, FALSE for 19994 and then TRUE for 1995:19997. Which is why, in my initial approach, I chose to index the Fix_min again as Seq2, rather than relying on Seq, though frankly I can say I was guessing.
@Chris See the new update. For the cases you mentioned it's now giving the correct result. Could you check if this is what you want?
@Jaap- Writing it out and visually inspecting I find 21 cases of a trailing singleton 7 after a 6 (2103/4.4834/5,5703/4,5802/3, 8468/9, 9275/6, 9956/7,10493/4,10822/3,11835/6,12618/9,13055/6, 13353/4,13551/2, 14308/9, 14423/4, 16389/90, 17449/50,17834/5, 19803/4) where 6 was properly set to FALSE, and two cases 8869-8680 where a second run of 6 was allowed and 13216-13228 where a second run of 7 was allowed, both after setting 6 to FALSE.
@Chris See the update. For that second run of 7 it still gives a TRUE for Seq==13 (which is not you want I guess). I'm trying to find a solution for that.
@Chris I think I found the correct solution. Can you check?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.