Creating a sequence in a data.table depending on a column

Question

Say I have the following data.table:

library(data.table)

DT <- data.table(R=sample(0:1, 10000, rep=TRUE), Seq=0)

Which returns something like:

       R Seq
    1: 1   0
    2: 1   0
    3: 0   0
    4: 0   0
    5: 1   0
   ---      
 9996: 1   0
 9997: 0   0
 9998: 0   0
 9999: 0   0
10000: 1   0

I want to generate a sequence (1, 2, 3,..., n) that resets whenever R changes from the previous row. Think of it like I'm counting a streak of random numbers.

So the above would then look like:

       R Seq
    1: 1   1
    2: 1   2
    3: 0   1
    4: 0   2
    5: 1   1
   ---      
 9996: 1   5
 9997: 0   1
 9998: 0   2
 9999: 0   3
10000: 1   2

Thoughts?

Perhaps, although honestly it doesn't matter which set of random numbers gets generated, I just want to count the size of each streak, one by one — wizard_draziw
– wizard_draziw, Commented Aug 20, 2014 at 22:57

BrodieG · Accepted Answer · 2014-08-20 23:04:20Z

9

Here is an option:

set.seed(1)
DT <- data.table(R=sample(0:1, 10000, rep=TRUE), Seq=0L)
DT[, Seq:=seq(.N), by=list(cumsum(c(0, abs(diff(R)))))]
DT

We create a counter that increments every time your 0-1 variable changes using cumsum(abs(diff(R))). The c(0, part is to ensure we get the correct length vector. Then we split by it with by. This produces:

       R Seq
    1: 0   1
    2: 0   2
    3: 1   1
    4: 1   2
    5: 0   1
   ---      
 9996: 1   1
 9997: 0   1
 9998: 1   1
 9999: 1   2
10000: 1   3

EDIT: Addressing request for clarification:

lets look at the computation I'm using in by, broken down into two new columns:

DT[, diff:=c(0, diff(R))]
DT[, cumsum:=cumsum(abs(diff))]
print(DT, topn=10)

Produces:

       R Seq diff cumsum
    1: 0   1    0      0
    2: 0   2    0      0
    3: 1   1    1      1
    4: 1   2    0      1
    5: 0   1   -1      2
    6: 1   1    1      3
    7: 1   2    0      3
    8: 1   3    0      3
    9: 1   4    0      3
   10: 0   1   -1      4
   ---                  
 9991: 1   2    0   5021
 9992: 1   3    0   5021
 9993: 1   4    0   5021
 9994: 1   5    0   5021
 9995: 0   1   -1   5022
 9996: 1   1    1   5023
 9997: 0   1   -1   5024
 9998: 1   1    1   5025
 9999: 1   2    0   5025
10000: 1   3    0   5025

You can see how the cumulative sum of the absolute of the diff increments by one each time R changes. We can then use that cumsum column to break up the data.table into chunks, and for each chunk, generate a sequence using seq(.N) that counts to the number of items in the chunk (.N represents exactly that, how many items in each by group).

edited Aug 20, 2014 at 23:04

answered Aug 20, 2014 at 22:54

BrodieG

52.8k9 gold badges99 silver badges148 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

wizard_draziw Over a year ago

Ok, yup, that looks like it works, can you explain what happened here?

alexis_laz Over a year ago

Also, something like sequence(rle(.)$lengths) perhaps is faster?

wizard_draziw Over a year ago

Amazing, thats pretty clever

thelatemail Over a year ago

@alexis_laz - Looks like rle would actually slow things down considerably, but that's a neat answer regardless.

alexis_laz Over a year ago

@thelatemail : Hm, indeed about rle; my initial assumption was on it avoiding calculation by a group

YvanR · Accepted Answer · 2020-02-11 03:48:48Z

3

Old question, but just in case someone needs a faster and easier way:

DT[, Seq := rowid(rleid(R))]

Explanation:

rleid creates an index incrementing every time a new group of consecutive values is encountered. So rleid(c('a','a','b','b','a','a')) returns 1 1 2 2 3 3
rowid creates for each value an index that gets incremented every time this value is repeated (but not necessarily consecutive). So rowid(c('a','a','b','b','a','a')) returns 1 2 1 2 3 4

It only takes a fraction of a second on this example with 10 million rows.

answered Feb 11, 2020 at 3:48

YvanR

3852 silver badges6 bronze badges

Collectives™ on Stack Overflow

Creating a sequence in a data.table depending on a column

2 Answers 2

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related