3

Let's say I want to create a column in a data.table, in which the value in each row is equal to the standard deviation of the values in three other cells in the same row. E.g., if I make

DT <- data.table(a = 1:4, b = c(5, 7, 9, 11), c = c(13, 16, 19, 22), d = c(25, 29, 33, 37))
DT
   a  b  c  d
1: 1  5 13 25
2: 2  7 16 29
3: 3  9 19 33
4: 4 11 22 37

and I'd like to add a column that contains the standard deviation of a, b, and d for each row, like this:

   a  b  c  d abdSD
1: 1  5 13 23 12.86
2: 2  7 16 27 14.36
3: 3  9 19 31 15.87
4: 4 11 22 35 17.39

I could of course write a for-loop or use an apply function to calculate this. Unfortunately, what I actually want to do needs to be applied to millions of rows, isn't as simple a function as calculating a standard deviation, and needs to finish within a fraction of a second, so I really need a vectorized solution. I want to write something like

DT[, abdSD := sd(c(a, b, d))]

but unfortunately that doesn't give the right answer. Is there any data.table syntax that can create a vector out of different values within the same row, and make that vector accessible to a function populating a new cell within that row? Any help would be greatly appreciated. @Arun

1

5 Answers 5

2

Depending on the size of your data, you might want to convert the data into a long format, then calculate the result as follows:

complexFunc <- function(x) sd(x)

cols <- c("a", "b", "d")
rowres <- melt(DT[, rn:=.I], id.vars="rn", variable.factor=FALSE)[, 
    list(abdRes=complexFunc(value[variable %chin% cols])), by=.(rn)]
DT[rowres, on=.(rn)]

or if your complex function has 3 arguments, you can do something like

DT[, abdSD := mapply(complexFunc, a, b, d)]
Sign up to request clarification or add additional context in comments.

Comments

1

As @Frank mentioned, I could avoid adding a column by doing by=1:nrow(DT)

DT[, abdSD:=sd(c(a,b,d)),by=1:nrow(DT)]

output:

   a  b  c  d    abdSD
1: 1  5 13 25 12.85820
2: 2  7 16 29 14.36431
3: 3  9 19 33 15.87451
4: 4 11 22 37 17.38774

if you add a row_name column, it would be ultra easy

DT$row_id<-row.names(DT)

Simply by=row_id, would get you the result you want

DT[, abdSD:=sd(c(a,b,d)),by=row_id]

Result would have:

   a  b  c  d row_id    abdSD
1: 1  5 13 25      1 12.85820
2: 2  7 16 29      2 14.36431
3: 3  9 19 33      3 15.87451
4: 4 11 22 37      4 17.38774

If you want row_id removed, simply adding [,row_id:=NULL]

DT[, abdSD:=sd(c(a,b,d)),by=row_id][,row_id:=NULL]

This line would get everything you want

   a  b  c  d    abdSD
1: 1  5 13 25 12.85820
2: 2  7 16 29 14.36431
3: 3  9 19 33 15.87451
4: 4 11 22 37 17.38774

You just gotta do it by row.

data.frame does it by row on default, data.table does it by column on default I think. It's a bit tricky

Hope this helps

2 Comments

Fyi, you can do by=1:nrow(DT) without making a new column, too.
Thank you Frank! I was wondering how to do it. I did by=nrow(DT) and it didn't work. Turns out 1:nrow(DT) would
0

I think you should try matrixStats package

library(matrixStats)

#sample data
dt <- data.table(a = 1:4, b = c(5, 7, 9, 11), c = c(13, 16, 19, 22), d = c(25, 29, 33, 37))

dt[, `:=`(abdSD = rowSds(as.matrix(.SD), na.rm=T)), .SDcols=c('a','b','d')]
dt

Output is:

   a  b  c  d    abdSD
1: 1  5 13 25 12.85820
2: 2  7 16 29 14.36431
3: 3  9 19 33 15.87451
4: 4 11 22 37 17.38774

Comments

0

Not an answer, but just trying to show the difference between using apply and the solution provided by Prem above :

I have blown up the sample data to 40,000 rows to show solid time differences :

library(matrixStats)

#sample data
dt <- data.table(a = 1:40000, b = rep(c(5, 7, 9, 11),10000), c = rep(c(13, 16, 19, 22),10000), d = rep(c(25, 29, 33, 37),10000))
df <- data.frame(a = 1:40000, b = rep(c(5, 7, 9, 11),10000), c = rep(c(13, 16, 19, 22),10000), d = rep(c(25, 29, 33, 37),10000))

t0 = Sys.time()
dt[, `:=`(abdSD = rowSds(as.matrix(.SD), na.rm=T)), .SDcols=c('a','b','d')]
print(paste("Time taken for data table operation = ",Sys.time() - t0))
# [1] "Time taken for data table operation =  0.117115020751953"


t0 = Sys.time()
df$abdSD <- apply(df[,c("a","b","d")],1, function(x){sd(x)})
print(paste("Time taken for apply opertaion = ",Sys.time() - t0))
# [1] "Time taken for apply opertaion =  2.93488311767578"

Using DT and matrixStats clearly wins the race

Comments

0

It's not hard to vectorize the sd for this situation:

vecSD = function(x) {
  n = ncol(x)
  sqrt((n/(n-1)) * (Reduce(`+`, x*x)/n - (Reduce(`+`, x)/n)^2))
}

DT[, vecSD(.SD), .SDcols = c('a', 'b', 'd')]
#[1] 12.85820 14.36431 15.87451 17.38774

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.