4

How to select values from data.table based on a vector of column indexes.

I have an integer vector of the same length as the number of rows in a data.table:

set.seed(100)
col.indexes <- sample(c(1:4), 150, replace = TRUE)

How to create a vector of values based on it? e.g. this without for loop:

iris <- setDT(iris)
res <- c()
for(i in 1:150) {
  res[i] <- iris[i, .SD, .SDcols = col.indexes[i]]
}
res <- unlist(res)

this is loosely based on this question: How to subset the next column in R

3 Answers 3

3

We can do a group by sequence of rows and extract the values

res <- iris[, col1 := col.indexes][, .SD[[col1[1]]], 1:nrow(iris)]$V1

Or in base R, it can be done in a vectorized way

iris <- setDF(iris)
iris[1:4][cbind(seq_len(nrow(iris)), col.indexes)]
Sign up to request clarification or add additional context in comments.

1 Comment

For reference - see FR 657 discussion on making the base R an option in data.table github.com/Rdatatable/data.table/issues/657
3

Here's a complicated answer using melt and a join. Using a data.frame is better for this:

library(data.table)
dt <- as.data.table(iris)

dt[, ID := .I]
dt[, Species := NULL]

melt(dt, id.vars = 'ID'
     )[, variable := as.integer(variable)
       ][data.frame(col.indexes, ID = seq_len(150))
         , on = .(ID, variable = col.indexes)
         , value
         ]

Here's @akrun's base method doing awesome:

# A tibble: 7 x 13
  expression           min   median `itr/sec` mem_alloc
  <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>
1 akrun_dt          4.49ms   4.78ms     190.   107.84KB
2 akrun_base         122us  127.4us    7575.     8.44KB
3 cole_melt         3.99ms   4.24ms     233.   271.41KB
4 Pavo_diag         3.32ms   3.45ms     283.   449.44KB
5 OP_loop          83.08ms  84.03ms      11.9    4.86MB
6 OP_loop_dt_mod    1.32ms   1.36ms     712.    13.76KB
7 OP_loop_mat_mod  373.9us  389.2us    2472.    17.17KB

I also did 1E5 rows per @Bulat's comment. I got an error with @PavoDive's method so I excluded it.

# A tibble: 7 x 13
  expression           min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 akrun_dt           2.19s    2.19s    0.456     6.58MB    1.37      1     3
2 akrun_base        2.73ms   2.88ms   58.5       8.79MB    7.63     46     6
3 cole_melt        30.53ms  33.63ms   29.6      17.79MB    1.97     15     1
4 OP_loop            1.07m    1.07m    0.0156    3.16GB    0.810     1    52
5 OP_loop_df_mod  991.45ms 991.45ms    1.01       3.3MB    2.02      1     2
6 OP_loop_dt_mod     1.07s    1.07s    0.930      3.3MB    1.86      1     2
7 OP_loop_mat_mod 218.95ms 235.98ms    4.29      4.58MB    1.43      3     1

Then I upped it to 1E7 rows:

# A tibble: 2 x 13
  expression   min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 akrun_base 2.21s  2.21s     0.452   877.4MB    0.904     1     2      2.21s
2 cole_melt  4.88s  4.88s     0.205    1.71GB    0.410     1     2      4.88s

Complete code for benchmarks:

library(data.table)

set.seed(100)
ind <- 1E5

col.indexes <- sample(c(1:4), ind, replace = TRUE)
dt1 <- as.data.table(iris[sample(nrow(iris), ind, replace = T), ])

bench::mark(
  akrun_dt = {
    dt <- copy(dt1)
    dt[, col1 := col.indexes][, .SD[[col1[1]]], 1:nrow(dt)]$V1
  }
  ,
  akrun_base = {
    DF <- copy(dt1)
    setDF(DF)
    DF[1:4][cbind(seq_len(nrow(DF)), col.indexes)]
  }
  ,
  cole_melt = {
    dt <- copy(dt1)
    dt[, ID := .I]
    dt[, Species := NULL]

    melt(dt, id.vars = 'ID'
    )[, variable := as.integer(variable)
      ][data.frame(col.indexes, ID = seq_len(ind))
        , on = .(ID, variable = col.indexes)
        , value
        ]
  }
  # ,Pavo_diag = {
  #   diag(as.matrix(dt1[, .SD, .SDcols = col.indexes]))
  # }
  ,
  OP_loop = {
    res <- c()

    for(i in seq_len(ind)) {
      res[i] <- dt1[i, .SD, .SDcols = col.indexes[i]]
    }
    unlist(res)
  }
  ,
  OP_loop_df_mod = {
    sapply(seq_len(ind), function(i) DF[[col.indexes[i]]][i])
  }
  ,
  OP_loop_dt_mod = {
    sapply(seq_len(ind), function(i) dt1[[col.indexes[i]]][i])
  }
  ,
  OP_loop_mat_mod = {
    mat <- as.matrix(DF[1:4])
    colnames(mat) <- NULL
    unlist(lapply(seq_len(ind), function(i) mat[i, col.indexes[i]]), use.names = F)
  }
)

3 Comments

I think data.table benchmarks start to makes sense on over 10^5 rows
@Bulat see edit. base is still the way to go for this.
Thanks, this is interesting that there seems to be no native way to do this in data.table. I was actually expecting it to work in a similar way as base does, but could not make it.
2

I see another option:

res2 <- diag(as.matrix(iris[, .SD, .SDcols = col.indexes]))
all.equal(res2, res)

[1] TRUE

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.