1

When I have to add missing observations I usually use CJ in data.table after setting the keys.

In the following example, two columns are the combinations of each other, and therefore take the same values. I want to add the missing combinations. What is happening when I use "only" l instead of l[1:3]?

Data:

library(data.table)

l <- letters[1:3]

DT <- CJ(from = l, to = l)
DT <- DT[, .(year = 1994:1995), .(from, to)]
DT[, g := runif(.N)]

# drop some rows
DT <- DT[-c(2, 6)]

# set the keys
setkey(DT, from, to, year)


head(DT)
#    from to year         g
# 1:    a  a 1994 0.6436420
# 2:    a  b 1994 0.6974629
# 3:    a  b 1995 0.5686354
# 4:    a  c 1994 0.6436957
# 5:    b  a 1994 0.6219831
# 6:    b  a 1995 0.7191914

# unexpected result:
DT[CJ(l,l,1994:1995)]
#     from to year         g l
#  1:    a  a 1994 0.6436420 a
#  2:    a  a 1995 0.6436420 a
#  3:    a  a 1994 0.6436420 b
#  4:    a  a 1995 0.6436420 b
#  5:    a  a 1994 0.6436420 c
#  6:    a  a 1995        NA c
#  7:    b  b 1994 0.7617087 a
#  8:    b  b 1995 0.7617087 a
#  9:    b  b 1994 0.7617087 b
# 10:    b  b 1995 0.7617087 b
# 11:    b  b 1994 0.7617087 c
# 12:    b  b 1995 0.8479245 c
# 13:    c  c 1994 0.4891899 a
# 14:    c  c 1995 0.4891899 a
# 15:    c  c 1994 0.4891899 b
# 16:    c  c 1995 0.4891899 b
# 17:    c  c 1994 0.4891899 c
# 18:    c  c 1995 0.7520622 c

# what I expected:
DT[CJ(l[1:3],l[1:3],1994:1995)]

#     from to year          g
#  1:    a  a 1994 0.64364200
#  2:    a  a 1995         NA
#  3:    a  b 1994 0.69746294
#  4:    a  b 1995 0.56863539
#  5:    a  c 1994 0.64369566
#  6:    a  c 1995         NA
#  7:    b  a 1994 0.62198311
#  8:    b  a 1995 0.71919139
#  9:    b  b 1994 0.76170866
# 10:    b  b 1995 0.84792449
# 11:    b  c 1994 0.15793127
# 12:    b  c 1995 0.26623733
# 13:    c  a 1994 0.89921463
# 14:    c  a 1995 0.55417635
# 15:    c  b 1994 0.38938166
# 16:    c  b 1995 0.03778206
# 17:    c  c 1994 0.48918988
# 18:    c  c 1995 0.75206221
4
  • 3
    I think it comes from the default sorted=TRUE in CJ. Try DT[CJ(l,l,1994:1995, sorted=FALSE)] Commented Feb 3, 2021 at 23:25
  • 1
    hi @desval, FWIW, i have edited your post to use some simplifications to accomplish the same code, hopefully these tips are helpful Commented Feb 4, 2021 at 7:58
  • The more I look at it, the more I think this is buggy behavior. I filed this issue in case you are interested in a more stylized version: github.com/Rdatatable/data.table/issues/4891 Commented Feb 4, 2021 at 8:27
  • thanks a lot, and sorry for the sloppy example. Indeed, it is somewhat inconsistent that when using l[1:3] the behavior changes Commented Feb 4, 2021 at 8:31

1 Answer 1

2

I believe the issue is that CJ(l, l, 1994:1995) has duplicate names. This is hinted at by verbose=TRUE:

DT[CJ(l,l,1994:1995), verbose=TRUE]
# forder.c received a vector type 'character' length 3
# forder.c received a vector type 'character' length 3
# forder.c received a vector type 'integer' length 2
# i.l has same type (character) as x.from. No coercion needed.
# i.l has same type (character) as x.to. No coercion needed.
# i.V3 has same type (integer) as x.year. No coercion needed.
# on= matches existing key, using key
# Starting bmerge ...
# bmerge done in 0.000s elapsed (0.000s cpu) 
# Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 

This is in a gray area between being a bug or not... better behavior might be to error instead of proceed with potentially wrong results.

Anyway, you can get around this by naming the CJ arguments:

DT[CJ(from = l, to = l, year = 1994:1995)]
#     from to year          g
#  1:    a  a 1994 0.64364200
#  2:    a  a 1995         NA
#  3:    a  b 1994 0.69746294
#  4:    a  b 1995 0.56863539
#  5:    a  c 1994 0.64369566
#  6:    a  c 1995         NA
#  7:    b  a 1994 0.62198311
#  8:    b  a 1995 0.71919139
#  9:    b  b 1994 0.76170866
# 10:    b  b 1995 0.84792449
# 11:    b  c 1994 0.15793127
# 12:    b  c 1995 0.26623733
# 13:    c  a 1994 0.89921463
# 14:    c  a 1995 0.55417635
# 15:    c  b 1994 0.38938166
# 16:    c  b 1995 0.03778206
# 17:    c  c 1994 0.48918988
# 18:    c  c 1995 0.75206221

Note that we could also accomplish this without keys:

setkey(DT, NULL)
# for those more familiar with SQL syntax, this is a NATURAL JOIN;
#   it's equivalent to `on = c("from", "to", "year")`
DT[CJ(from = l, to = l, year = 1994:1995), on = .NATURAL]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.