Binary search for integer64 in data.table

Question

I have a integer64 indexed data.table object:

library(data.table)
library(bit64)

some_data = as.integer64(c(1514772184120000026, 1514772184120000068, 1514772184120000042, 1514772184120000078,1514772184120000011, 1514772184120000043, 1514772184120000094, 1514772184120000085,
1514772184120000083, 1514772184120000017, 1514772184120000013, 1514772184120000060, 1514772184120000032, 1514772184120000059, 1514772184120000029))

#
n <- 10
x <- setDT(data.frame(a = runif(n)))
x[, new_col := some_data[1:n]]
setorder(x, new_col)

Then I have a bunch of other integer64 that I need to binary-search for in the indexes of my original data.table object (x):

search_values <- some_data[(n+1):length(some_data)]

If these where native integers I could use findInterval() to solve the problem:

values_index  <- findInterval(search_values, x$new_col)

but when the arguments to findInterval are integer64, I get:

Warning messages:
1: In as.double.integer64(vec) :
  integer precision lost while converting to double
2: In as.double.integer64(x) :
  integer precision lost while converting to double

and wrong indexes:

> values_index
[1] 10 10 10 10 10

e.g. it is not true that the entries of search_values are all larger than all entries of x$new_col.

Edit:

Desired output:

print(values_index)
9 10  6 10  1

Why?:

value_index has as many entries as search_values. For each entries of search_values, the corresponding entry in value_index gives the rank that entry of search_values would have if it where inserted inside x$new_col. So the first entry of value_index is 9 because the first entry of search_values (1514772184120000045) would have rank 9 among the entries of x$new_col.

can you edit your example with fixed now variable and show your desired output? — minem
– minem, Commented Aug 8, 2018 at 11:04

minem · Accepted Answer · 2018-08-09 11:45:11Z

1

Maybe you want something like this:

findInterval2 <- function(y, x) {
  toadd <- y[!(y %in% x$new_col)] # search_values that is not in data
  x2 <- copy(x)
  x2[, i := .I] # mark the original data set
  x2 <- rbindlist(list(x2, data.table(new_col = toadd)),
                  use.names = T, fill = T) # add missing search_values
  setkey(x2, new_col) # order
  x2[, index := cumsum(!is.na(i))]
  x2[match(y, new_col), index]
}
# x2 is:
#              a             new_col  i index
#  1: 0.56602278 1514772184120000011  1     1
#  2:         NA 1514772184120000013 NA     1
#  3: 0.29408237 1514772184120000017  2     2
#  4: 0.28532378 1514772184120000026  3     3
#  5:         NA 1514772184120000029 NA     3
#  6:         NA 1514772184120000032 NA     3
#  7: 0.66844754 1514772184120000042  4     4
#  8: 0.83008829 1514772184120000043  5     5
#  9:         NA 1514772184120000059 NA     5
# 10:         NA 1514772184120000060 NA     5
# 11: 0.76992760 1514772184120000068  6     6
# 12: 0.57049677 1514772184120000078  7     7
# 13: 0.14406169 1514772184120000083  8     8
# 14: 0.02044602 1514772184120000085  9     9
# 15: 0.68016024 1514772184120000094 10    10
findInterval2(search_values, x)
# [1] 1 5 3 5 3

If not, then maybe you could change the code as needed.

update

look at this integer example to see that this function gives the same result as base findInterval

now <- 10
n <- 10
n2 <- 10
some_data = as.integer(now + sample.int(n + n2, n + n2))
x <- setDT(data.frame(a = runif(n)))
x[, new_col := some_data[1:n]]
setorder(x, new_col)
search_values <- some_data[(n + 1):length(some_data)]

r1 <- findInterval2(search_values, x)
r2 <- findInterval(search_values, x$new_col)
all.equal(r1, r2)

edited Aug 9, 2018 at 11:45

answered Aug 9, 2018 at 9:52

minem

3,6502 gold badges19 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user189035 Over a year ago

Thanks, but as you can see the output is not the expected one.

minem Over a year ago

@user189035 look at your example data then once more! there is not 1514772184120000045 value.

user189035 Over a year ago

to see the problem do the following changes in your test code: now<- as.integer64(1514772184120000026); some_data = as.integer64(now + sample.int(n + n2, n + n2)). In other words: the problem is when the data is in integer64 format. I'm sure findInterval/findInterval2 work with the native R data formats

minem Over a year ago

@ no, you do not understand what I am saying. Currently findInterval2 should work for integer64 and integer. Test it out. the example of integers is only to show that the function does what findInterval do. but maybe you all along did want some kind of different results?

user189035 Over a year ago

sorry. Please gimme a bit of time to evaluate your answer (I moved to a different sub-task: I will come back to this asap)

minem · Accepted Answer · 2018-08-08 13:42:54Z

1

If I get what you want, then a quick workaround could be:

toadd <- search_values[!(search_values %in% x$new_col)] # search_values that is not in data
x[, i := .I] # mark the original data set
x <- rbindlist(list(x, data.table(new_col = toadd)),
               use.names = T, fill = T) # add missing search_values
setkey(x, new_col) # order
x[, index := new_col %in% search_values] # mark where the values are
x[, index := cumsum(index)] # get indexes
x <- x[!is.na(i)] # remove added rows
x$index # should contain your desired output

answered Aug 8, 2018 at 13:42

minem

3,6502 gold badges19 silver badges31 bronze badges

1 Comment

user189035 Over a year ago

Sorry I had to run and could just see your comment now. Saddly x$index is not what I am after. I edited the question.

Collectives™ on Stack Overflow

Binary search for integer64 in data.table

Edit:

2 Answers 2

update

5 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Edit:

2 Answers 2

update

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related