9

Say I have the following data.table:

library(data.table)

DT <- data.table(R=sample(0:1, 10000, rep=TRUE), Seq=0)

Which returns something like:

       R Seq
    1: 1   0
    2: 1   0
    3: 0   0
    4: 0   0
    5: 1   0
   ---      
 9996: 1   0
 9997: 0   0
 9998: 0   0
 9999: 0   0
10000: 1   0

I want to generate a sequence (1, 2, 3,..., n) that resets whenever R changes from the previous row. Think of it like I'm counting a streak of random numbers.

So the above would then look like:

       R Seq
    1: 1   1
    2: 1   2
    3: 0   1
    4: 0   2
    5: 1   1
   ---      
 9996: 1   5
 9997: 0   1
 9998: 0   2
 9999: 0   3
10000: 1   2

Thoughts?

2
  • You need to add set.seed here Commented Aug 20, 2014 at 22:54
  • 1
    Perhaps, although honestly it doesn't matter which set of random numbers gets generated, I just want to count the size of each streak, one by one Commented Aug 20, 2014 at 22:57

2 Answers 2

9

Here is an option:

set.seed(1)
DT <- data.table(R=sample(0:1, 10000, rep=TRUE), Seq=0L)
DT[, Seq:=seq(.N), by=list(cumsum(c(0, abs(diff(R)))))]
DT

We create a counter that increments every time your 0-1 variable changes using cumsum(abs(diff(R))). The c(0, part is to ensure we get the correct length vector. Then we split by it with by. This produces:

       R Seq
    1: 0   1
    2: 0   2
    3: 1   1
    4: 1   2
    5: 0   1
   ---      
 9996: 1   1
 9997: 0   1
 9998: 1   1
 9999: 1   2
10000: 1   3

EDIT: Addressing request for clarification:

lets look at the computation I'm using in by, broken down into two new columns:

DT[, diff:=c(0, diff(R))]
DT[, cumsum:=cumsum(abs(diff))]
print(DT, topn=10)

Produces:

       R Seq diff cumsum
    1: 0   1    0      0
    2: 0   2    0      0
    3: 1   1    1      1
    4: 1   2    0      1
    5: 0   1   -1      2
    6: 1   1    1      3
    7: 1   2    0      3
    8: 1   3    0      3
    9: 1   4    0      3
   10: 0   1   -1      4
   ---                  
 9991: 1   2    0   5021
 9992: 1   3    0   5021
 9993: 1   4    0   5021
 9994: 1   5    0   5021
 9995: 0   1   -1   5022
 9996: 1   1    1   5023
 9997: 0   1   -1   5024
 9998: 1   1    1   5025
 9999: 1   2    0   5025
10000: 1   3    0   5025

You can see how the cumulative sum of the absolute of the diff increments by one each time R changes. We can then use that cumsum column to break up the data.table into chunks, and for each chunk, generate a sequence using seq(.N) that counts to the number of items in the chunk (.N represents exactly that, how many items in each by group).

Sign up to request clarification or add additional context in comments.

5 Comments

Ok, yup, that looks like it works, can you explain what happened here?
Also, something like sequence(rle(.)$lengths) perhaps is faster?
Amazing, thats pretty clever
@alexis_laz - Looks like rle would actually slow things down considerably, but that's a neat answer regardless.
@thelatemail : Hm, indeed about rle; my initial assumption was on it avoiding calculation by a group
3

Old question, but just in case someone needs a faster and easier way:

DT[, Seq := rowid(rleid(R))]

Explanation:

  • rleid creates an index incrementing every time a new group of consecutive values is encountered. So rleid(c('a','a','b','b','a','a')) returns 1 1 2 2 3 3
  • rowid creates for each value an index that gets incremented every time this value is repeated (but not necessarily consecutive). So rowid(c('a','a','b','b','a','a')) returns 1 2 1 2 3 4

It only takes a fraction of a second on this example with 10 million rows.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.