data.table and table unexpected behavior

Question

The data comes from another question I was playing around with:

dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
                 country=c(rep(1,4),rep(2,6)),
                 event=1:10, key="user")
#    user country event
#1:     3       1     1
#2:     3       1     2
#3:     3       1     3
#4:     3       1     4
#5:     3       2     5
#6:     4       2     6
#7:     4       2     7
#8:     4       2     8
#9:     4       2     9
#10:    4       2    10

And here's the surprising behavior:

dt[user == 3, as.data.frame(table(country))]
#  country Freq
#1       1    4
#2       2    1

dt[user == 4, as.data.frame(table(country))]
#  country Freq
#1       2    5

dt[, as.data.frame(table(country)), by = user]
#   user country Freq
#1:    3       1    4
#2:    3       2    1
#3:    4       1    5
#             ^^^ - why is this 1 instead of 2?!

Thanks mnel and Victor K. The natural follow-up is - shouldn't it be 2, i.e. is this a bug? I expected

dt[, blah, by = user]

to return identical result to

rbind(dt[user == 3, blah], dt[user == 4, blah])

Is that expectation incorrect?

Is country in as.data.frame(table(country)) a factor? If so this it is because the levels aren't the same in both. — mnel
– mnel, Commented Apr 24, 2013 at 20:44
@mnel, while you are right in that it is due as.data.frame coercing to factor, the expected behavior would be for the value to represent the label. I think this is probably the same thing going on as with rbindlist: stackoverflow.com/questions/15933846/… — Ricardo Saporta
– Ricardo Saporta, Commented Apr 24, 2013 at 21:02
@eddi -- I think the issues with the naive use of factors within j when using by, might be worth an entry FAQ -- it isn't a bug. it is documented behaivour (see FAQ 2.3, but it really needs more explanation0 -- it is also consistent with naive use of factors in many cases. — mnel
– mnel, Commented Apr 25, 2013 at 9:30
@mnel so you're saying that I should not expect rbind and by above to return the same result for any blah and should instead expect their equality to be blah-dependent? — eddi
– eddi, Commented Apr 25, 2013 at 19:19

mnel · Accepted Answer · 2013-04-24 21:47:43Z

7

The idiomatic data.table approach is to use .N

 dt[ , .N, by = list(user, country)]

This will be far quicker and it will also retain country as the same class as in the original.

answered Apr 24, 2013 at 21:47

mnel

116k28 gold badges269 silver badges255 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Victor K. · Accepted Answer · 2013-04-24 22:06:13Z

5

As mnel noted in comments, as.data.frame(table(...)) produces a data frame where the first variable is a factor. For user == 4, there is only one level in the factor, which is stored internally as 1.

What you want is factor levels, but what you get is how factors are stored internally (as integers, starting from 1). The following provides the expected result:

> dt[, lapply(as.data.frame(table(country)), as.character), by = user]
   user country Freq
1:    3       1    4
2:    3       2    1
3:    4       2    5

Update. Regarding your second question: no, I think data.table behaviour is correct. Same thing happens in plain R when you join two factors with different levels:

> a <- factor(3:5)
> b <- factor(6:8)
> a
[1] 3 4 5
Levels: 3 4 5
> b
[1] 6 7 8
Levels: 6 7 8
> c(a,b)
[1] 1 2 3 1 2 3

edited Apr 24, 2013 at 22:06

answered Apr 24, 2013 at 20:55

Victor K.

4,0943 gold badges27 silver badges38 bronze badges

3 Comments

G. Grothendieck Over a year ago

As a note of interest dt[, lapply(as.data.frame.table(country), as.character), by = user] gives an error.

Victor K. Over a year ago

But this has probably nothing to do with data.table: e.g. as.data.frame.table(dt$country) produces the same error.

eddi Over a year ago

@VictorK. Ok, do you then think rbind does the wrong thing?

Collectives™ on Stack Overflow

data.table and table unexpected behavior

2 Answers 2

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related