11

I wish to count different things by id, and by order (time). For example, with:

dt = data.table( id=c(1,1,1,2,2,2,3,3,3), hour=c(1,5,5,6,7,8,23,23,23), ip=c(1,1,45,2,2,2,3,1,1), target=c(1,0,0,1,1,1,1,1,0), day=c(1,1,1,1,1,1,3,2,1))

   id hour ip target day
1:  1    1  1      1   1
2:  1    5  1      0   1
3:  1    5 45      0   1
4:  2    6  2      1   1
5:  2    7  2      1   1
6:  2    8  2      1   1
7:  3   23  3      1   3
8:  3   23  1      1   2
9:  3   23  1      0   1

I wish to count, for each id, the number of active days, and active hours, so far, for each row. Which means I wish to get the following output:

   id hour ip target day  nb_active_hours_so_far
1:  1    1  1      1   1  0  (first occurence of id when ordered by hour)
2:  1    5  1      0   1  1  (has been active in hour "1")
3:  1    5 45      0   1  2  (has been active in hour "1" and "5")
4:  2    6  2      1   1  0  (first occurence)
5:  2    7  2      1   1  1  (has been active in hour "6")
6:  2    8  2      1   1  2  (has been active in hour "6" and "7")
7:  3   23  3      1   3  0  (first occurence)
8:  3   23  1      1   2  1  (has been active in hour "23")
9:  3   23  1      0   1  1  (has been active in hour "23" only)

To get the total count of active hours I would do:

dt[, nb_active_hours := length(unique(hour)), by=id]

but I want to have the so far part as well. I do not know how to do that... Any help would be appreciated.

5
  • So you don't want 8 in id == 2? Commented Jun 29, 2015 at 10:31
  • 2
    The active part is not clear. Commented Jun 29, 2015 at 10:32
  • Sorry if it's unclear. I wish to count the number of unique 'hour' values seen so far (meaning for all rows before, not using current row). Commented Jun 29, 2015 at 10:39
  • 1
    @akrun By 'active', I just mean to count the number of unique values seen so far. This is a log of usage with 'hour' being the hour of usage. I want to be able to say: this id has been seen so far at 3 different hours, or 5 different days. Commented Jun 29, 2015 at 10:41
  • @David Arenburg I want to count only rows from the past, so not count the current row, hence I do not count hour=='8' there, but I would count it for the next time this user shows in the log. Commented Jun 29, 2015 at 10:43

3 Answers 3

7

This is seem to work (though havn't tested on different cases)

dt[, nb_active_hours_so_far := cumsum(c(0:1, diff(hour[-.N]))>0), by = id]
#    id hour ip target day temp nb_active_hours_so_far
# 1:  1    1  1      1   1    0                      0
# 2:  1    5  1      0   1    1                      1
# 3:  1    5 45      0   1    1                      2
# 4:  2    6  2      1   1    0                      0
# 5:  2    7  2      1   1    1                      1
# 6:  2    8  2      1   1    2                      2
# 7:  3   23  3      1   3    0                      0
# 8:  3   23  1      1   2    0                      1
# 9:  3   23  1      0   1    0                      1
Sign up to request clarification or add additional context in comments.

2 Comments

@akrun that's looks promising, wonder why I haven't think of rleid. You should post it.
Thanks, I posted that as a separate solution. @Frank I took your shift version, hope you don't mind.
7

Yerk. I have this ugly solution:

library(data.table)
dt[ ,nb_active_hours_so_far:=c(0,head(cumsum(c(1,diff(hour)>0)), -1)),id][]

#   id hour ip target day nb_active_hours_so_far
#1:  1    1  1      1   1                      0
#2:  1    5  1      0   1                      1
#3:  1    5 45      0   1                      2
#4:  2    6  2      1   1                      0
#5:  2    7  2      1   1                      1
#6:  2    8  2      1   1                      2
#7:  3   23  3      1   3                      0
#8:  3   23  1      1   2                      1
#9:  3   23  1      0   1                      1

Comments

7

Or you could make use of the functions rleid/shift from the devel version of data.table, i.e. v1.9.5. Instructions to install the devel version are here. (Thanks to @Frank for the shift)

 library(data.table)
 dt[,nb_active_hours_so_far := shift(rleid(hour),fill=0L), id]
 #   id hour ip target day nb_active_hours_so_far
 #1:  1    1  1      1   1                      0
 #2:  1    5  1      0   1                      1
 #3:  1    5 45      0   1                      2
 #4:  2    6  2      1   1                      0
 #5:  2    7  2      1   1                      1
 #6:  2    8  2      1   1                      2
 #7:  3   23  3      1   3                      0
 #8:  3   23  1      1   2                      1
 #9:  3   23  1      0   1                      1

1 Comment

@DavidArenburg I guess both would have the same performance .

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.