0

I have a integer64 indexed data.table object:

library(data.table)
library(bit64)

some_data = as.integer64(c(1514772184120000026, 1514772184120000068, 1514772184120000042, 1514772184120000078,1514772184120000011, 1514772184120000043, 1514772184120000094, 1514772184120000085,
1514772184120000083, 1514772184120000017, 1514772184120000013, 1514772184120000060, 1514772184120000032, 1514772184120000059, 1514772184120000029))

#
n <- 10
x <- setDT(data.frame(a = runif(n)))
x[, new_col := some_data[1:n]]
setorder(x, new_col)

Then I have a bunch of other integer64 that I need to binary-search for in the indexes of my original data.table object (x):

search_values <- some_data[(n+1):length(some_data)]

If these where native integers I could use findInterval() to solve the problem:

values_index  <- findInterval(search_values, x$new_col)

but when the arguments to findInterval are integer64, I get:

Warning messages:
1: In as.double.integer64(vec) :
  integer precision lost while converting to double
2: In as.double.integer64(x) :
  integer precision lost while converting to double

and wrong indexes:

> values_index
[1] 10 10 10 10 10

e.g. it is not true that the entries of search_values are all larger than all entries of x$new_col.

Edit:

Desired output:

print(values_index)
9 10  6 10  1

Why?:

value_index has as many entries as search_values. For each entries of search_values, the corresponding entry in value_index gives the rank that entry of search_values would have if it where inserted inside x$new_col. So the first entry of value_index is 9 because the first entry of search_values (1514772184120000045) would have rank 9 among the entries of x$new_col.

3
  • 1
    can you edit your example with fixed now variable and show your desired output? Commented Aug 8, 2018 at 11:04
  • @minem: did my edits address your comments? Commented Aug 8, 2018 at 11:55
  • and what is your desired output? Commented Aug 8, 2018 at 12:13

2 Answers 2

1

Maybe you want something like this:

findInterval2 <- function(y, x) {
  toadd <- y[!(y %in% x$new_col)] # search_values that is not in data
  x2 <- copy(x)
  x2[, i := .I] # mark the original data set
  x2 <- rbindlist(list(x2, data.table(new_col = toadd)),
                  use.names = T, fill = T) # add missing search_values
  setkey(x2, new_col) # order
  x2[, index := cumsum(!is.na(i))]
  x2[match(y, new_col), index]
}
# x2 is:
#              a             new_col  i index
#  1: 0.56602278 1514772184120000011  1     1
#  2:         NA 1514772184120000013 NA     1
#  3: 0.29408237 1514772184120000017  2     2
#  4: 0.28532378 1514772184120000026  3     3
#  5:         NA 1514772184120000029 NA     3
#  6:         NA 1514772184120000032 NA     3
#  7: 0.66844754 1514772184120000042  4     4
#  8: 0.83008829 1514772184120000043  5     5
#  9:         NA 1514772184120000059 NA     5
# 10:         NA 1514772184120000060 NA     5
# 11: 0.76992760 1514772184120000068  6     6
# 12: 0.57049677 1514772184120000078  7     7
# 13: 0.14406169 1514772184120000083  8     8
# 14: 0.02044602 1514772184120000085  9     9
# 15: 0.68016024 1514772184120000094 10    10
findInterval2(search_values, x)
# [1] 1 5 3 5 3

If not, then maybe you could change the code as needed.

update

look at this integer example to see that this function gives the same result as base findInterval

now <- 10
n <- 10
n2 <- 10
some_data = as.integer(now + sample.int(n + n2, n + n2))
x <- setDT(data.frame(a = runif(n)))
x[, new_col := some_data[1:n]]
setorder(x, new_col)
search_values <- some_data[(n + 1):length(some_data)]

r1 <- findInterval2(search_values, x)
r2 <- findInterval(search_values, x$new_col)
all.equal(r1, r2)
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks, but as you can see the output is not the expected one.
@user189035 look at your example data then once more! there is not 1514772184120000045 value.
to see the problem do the following changes in your test code: now<- as.integer64(1514772184120000026); some_data = as.integer64(now + sample.int(n + n2, n + n2)). In other words: the problem is when the data is in integer64 format. I'm sure findInterval/findInterval2 work with the native R data formats
@ no, you do not understand what I am saying. Currently findInterval2 should work for integer64 and integer. Test it out. the example of integers is only to show that the function does what findInterval do. but maybe you all along did want some kind of different results?
sorry. Please gimme a bit of time to evaluate your answer (I moved to a different sub-task: I will come back to this asap)
1

If I get what you want, then a quick workaround could be:

toadd <- search_values[!(search_values %in% x$new_col)] # search_values that is not in data
x[, i := .I] # mark the original data set
x <- rbindlist(list(x, data.table(new_col = toadd)),
               use.names = T, fill = T) # add missing search_values
setkey(x, new_col) # order
x[, index := new_col %in% search_values] # mark where the values are
x[, index := cumsum(index)] # get indexes
x <- x[!is.na(i)] # remove added rows
x$index # should contain your desired output

1 Comment

Sorry I had to run and could just see your comment now. Saddly x$index is not what I am after. I edited the question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.