Creating a vector from data.table row without using apply

Question

Let's say I want to create a column in a data.table, in which the value in each row is equal to the standard deviation of the values in three other cells in the same row. E.g., if I make

DT <- data.table(a = 1:4, b = c(5, 7, 9, 11), c = c(13, 16, 19, 22), d = c(25, 29, 33, 37))
DT
   a  b  c  d
1: 1  5 13 25
2: 2  7 16 29
3: 3  9 19 33
4: 4 11 22 37

and I'd like to add a column that contains the standard deviation of a, b, and d for each row, like this:

   a  b  c  d abdSD
1: 1  5 13 23 12.86
2: 2  7 16 27 14.36
3: 3  9 19 31 15.87
4: 4 11 22 35 17.39

I could of course write a for-loop or use an apply function to calculate this. Unfortunately, what I actually want to do needs to be applied to millions of rows, isn't as simple a function as calculating a standard deviation, and needs to finish within a fraction of a second, so I really need a vectorized solution. I want to write something like

DT[, abdSD := sd(c(a, b, d))]

but unfortunately that doesn't give the right answer. Is there any data.table syntax that can create a vector out of different values within the same row, and make that vector accessible to a function populating a new cell within that row? Any help would be greatly appreciated. @Arun

It's filed here github.com/Rdatatable/data.table/issues/1063 — Frank
– Frank, Commented Apr 4, 2018 at 3:45

chinsoon12 · Accepted Answer · 2018-04-04 09:27:04Z

2

Depending on the size of your data, you might want to convert the data into a long format, then calculate the result as follows:

complexFunc <- function(x) sd(x)

cols <- c("a", "b", "d")
rowres <- melt(DT[, rn:=.I], id.vars="rn", variable.factor=FALSE)[, 
    list(abdRes=complexFunc(value[variable %chin% cols])), by=.(rn)]
DT[rowres, on=.(rn)]

or if your complex function has 3 arguments, you can do something like

DT[, abdSD := mapply(complexFunc, a, b, d)]

answered Apr 4, 2018 at 9:27

chinsoon12

25.2k4 gold badges27 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

SunLisa · Accepted Answer · 2018-04-04 15:47:59Z

1

As @Frank mentioned, I could avoid adding a column by doing by=1:nrow(DT)

DT[, abdSD:=sd(c(a,b,d)),by=1:nrow(DT)]

output:

   a  b  c  d    abdSD
1: 1  5 13 25 12.85820
2: 2  7 16 29 14.36431
3: 3  9 19 33 15.87451
4: 4 11 22 37 17.38774

if you add a row_name column, it would be ultra easy

DT$row_id<-row.names(DT)

Simply by=row_id, would get you the result you want

DT[, abdSD:=sd(c(a,b,d)),by=row_id]

Result would have:

   a  b  c  d row_id    abdSD
1: 1  5 13 25      1 12.85820
2: 2  7 16 29      2 14.36431
3: 3  9 19 33      3 15.87451
4: 4 11 22 37      4 17.38774

If you want row_id removed, simply adding [,row_id:=NULL]

DT[, abdSD:=sd(c(a,b,d)),by=row_id][,row_id:=NULL]

This line would get everything you want

   a  b  c  d    abdSD
1: 1  5 13 25 12.85820
2: 2  7 16 29 14.36431
3: 3  9 19 33 15.87451
4: 4 11 22 37 17.38774

You just gotta do it by row.

data.frame does it by row on default, data.table does it by column on default I think. It's a bit tricky

Hope this helps

edited Apr 4, 2018 at 15:47

answered Apr 4, 2018 at 7:50

SunLisa

1345 bronze badges

2 Comments

Frank Over a year ago

Fyi, you can do by=1:nrow(DT) without making a new column, too.

SunLisa Over a year ago

Thank you Frank! I was wondering how to do it. I did by=nrow(DT) and it didn't work. Turns out 1:nrow(DT) would

Prem · Accepted Answer · 2018-04-04 05:57:24Z

0

I think you should try matrixStats package

library(matrixStats)

#sample data
dt <- data.table(a = 1:4, b = c(5, 7, 9, 11), c = c(13, 16, 19, 22), d = c(25, 29, 33, 37))

dt[, `:=`(abdSD = rowSds(as.matrix(.SD), na.rm=T)), .SDcols=c('a','b','d')]
dt

Output is:

   a  b  c  d    abdSD
1: 1  5 13 25 12.85820
2: 2  7 16 29 14.36431
3: 3  9 19 33 15.87451
4: 4 11 22 37 17.38774

answered Apr 4, 2018 at 5:57

Prem

12k1 gold badge21 silver badges34 bronze badges

Comments

Dave · Accepted Answer · 2018-04-04 07:15:55Z

Not an answer, but just trying to show the difference between using apply and the solution provided by Prem above :

I have blown up the sample data to 40,000 rows to show solid time differences :

library(matrixStats)

#sample data
dt <- data.table(a = 1:40000, b = rep(c(5, 7, 9, 11),10000), c = rep(c(13, 16, 19, 22),10000), d = rep(c(25, 29, 33, 37),10000))
df <- data.frame(a = 1:40000, b = rep(c(5, 7, 9, 11),10000), c = rep(c(13, 16, 19, 22),10000), d = rep(c(25, 29, 33, 37),10000))

t0 = Sys.time()
dt[, `:=`(abdSD = rowSds(as.matrix(.SD), na.rm=T)), .SDcols=c('a','b','d')]
print(paste("Time taken for data table operation = ",Sys.time() - t0))
# [1] "Time taken for data table operation =  0.117115020751953"


t0 = Sys.time()
df$abdSD <- apply(df[,c("a","b","d")],1, function(x){sd(x)})
print(paste("Time taken for apply opertaion = ",Sys.time() - t0))
# [1] "Time taken for apply opertaion =  2.93488311767578"

Using DT and matrixStats clearly wins the race

eddi · Accepted Answer · 2018-04-04 16:18:05Z

0

It's not hard to vectorize the sd for this situation:

vecSD = function(x) {
  n = ncol(x)
  sqrt((n/(n-1)) * (Reduce(`+`, x*x)/n - (Reduce(`+`, x)/n)^2))
}

DT[, vecSD(.SD), .SDcols = c('a', 'b', 'd')]
#[1] 12.85820 14.36431 15.87451 17.38774

answered Apr 4, 2018 at 16:18

eddi

49.5k6 gold badges109 silver badges157 bronze badges

Collectives™ on Stack Overflow

Creating a vector from data.table row without using apply

5 Answers 5

Comments

2 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related