7

The data comes from another question I was playing around with:

dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
                 country=c(rep(1,4),rep(2,6)),
                 event=1:10, key="user")
#    user country event
#1:     3       1     1
#2:     3       1     2
#3:     3       1     3
#4:     3       1     4
#5:     3       2     5
#6:     4       2     6
#7:     4       2     7
#8:     4       2     8
#9:     4       2     9
#10:    4       2    10

And here's the surprising behavior:

dt[user == 3, as.data.frame(table(country))]
#  country Freq
#1       1    4
#2       2    1

dt[user == 4, as.data.frame(table(country))]
#  country Freq
#1       2    5

dt[, as.data.frame(table(country)), by = user]
#   user country Freq
#1:    3       1    4
#2:    3       2    1
#3:    4       1    5
#             ^^^ - why is this 1 instead of 2?!

Thanks mnel and Victor K. The natural follow-up is - shouldn't it be 2, i.e. is this a bug? I expected

dt[, blah, by = user]

to return identical result to

rbind(dt[user == 3, blah], dt[user == 4, blah])

Is that expectation incorrect?

5
  • 2
    Is country in as.data.frame(table(country)) a factor? If so this it is because the levels aren't the same in both. Commented Apr 24, 2013 at 20:44
  • 1
    @mnel, while you are right in that it is due as.data.frame coercing to factor, the expected behavior would be for the value to represent the label. I think this is probably the same thing going on as with rbindlist: stackoverflow.com/questions/15933846/… Commented Apr 24, 2013 at 21:02
  • @eddi, see update to my answer. Commented Apr 24, 2013 at 22:06
  • @eddi -- I think the issues with the naive use of factors within j when using by, might be worth an entry FAQ -- it isn't a bug. it is documented behaivour (see FAQ 2.3, but it really needs more explanation0 -- it is also consistent with naive use of factors in many cases. Commented Apr 25, 2013 at 9:30
  • @mnel so you're saying that I should not expect rbind and by above to return the same result for any blah and should instead expect their equality to be blah-dependent? Commented Apr 25, 2013 at 19:19

2 Answers 2

7

The idiomatic data.table approach is to use .N

 dt[ , .N, by = list(user, country)]

This will be far quicker and it will also retain country as the same class as in the original.

Sign up to request clarification or add additional context in comments.

Comments

5

As mnel noted in comments, as.data.frame(table(...)) produces a data frame where the first variable is a factor. For user == 4, there is only one level in the factor, which is stored internally as 1.

What you want is factor levels, but what you get is how factors are stored internally (as integers, starting from 1). The following provides the expected result:

> dt[, lapply(as.data.frame(table(country)), as.character), by = user]
   user country Freq
1:    3       1    4
2:    3       2    1
3:    4       2    5

Update. Regarding your second question: no, I think data.table behaviour is correct. Same thing happens in plain R when you join two factors with different levels:

> a <- factor(3:5)
> b <- factor(6:8)
> a
[1] 3 4 5
Levels: 3 4 5
> b
[1] 6 7 8
Levels: 6 7 8
> c(a,b)
[1] 1 2 3 1 2 3

3 Comments

As a note of interest dt[, lapply(as.data.frame.table(country), as.character), by = user] gives an error.
But this has probably nothing to do with data.table: e.g. as.data.frame.table(dt$country) produces the same error.
@VictorK. Ok, do you then think rbind does the wrong thing?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.