data.table execute function on groups of columns

Question

If I have the following data table

m = matrix(1:12, ncol=4)
colnames(m) = c('A1','A2','B1','B2')
d = data.table(m)

is it possible to execute a function on sets of columns?

For example the following would be the sum of A1,A2 and B1,B2.

The solution would preferably work with a 500k x 100 matrix

That storage format is not very good for working with data in R, in my opinion. Better to go to long format: melt(d, meas=patterns("A","B"), value.name=c("A","B")) From there, how to sum should be obvious if you've gone through the data.table package vignettes. — Frank
– Frank, Commented Aug 30, 2016 at 15:02
The problem is that if i have 100 columns, each with 500k rows, it becomes quite tedious to melt it. — user680111
– user680111, Commented Aug 30, 2016 at 16:02
Well, things will be even more tedious if you don't melt it, I suspect. If you really need to do a lot of across-columns stuff, maybe you ought to stick to a matrix or array (see ?array for what that means in R), where you can use rowSums and similar. — Frank
– Frank, Commented Aug 30, 2016 at 16:36

setempler · Accepted Answer · 2016-08-31 06:49:56Z

Solution

A trick would be to split the column into groups.

Then you can use rowSums as Frank suggests (see comments on question):

# using your data example
m <- matrix(1:12, ncol = 4)
colnames(m) <- c('A1', 'A2', 'B1', 'B2')
d <- data.table(m)

# 1) group columns
groups <- split(colnames(d), substr(colnames(d), 1, 1))

# 2) group wise row sums
d[,lapply(groups, function(i) {rowSums(d[, i, with = FALSE])})]

Result

This will return the data.table:

Explanation

split creates a list of column names for each group, defined by a (something coercable to a) factor.
substr(colnames(m), 1, 1) takes the first letter as group id, use a different approach (e.g. sub("([A-Z]).*", "\\1", colnames(m)) for variable number of letters).
lapply is commonly used to apply functions over multiple columns in a data.table. Here we create a list output, named as the groups, containing the rowSums. with = FALSE is important to use the value of i to get the respective columns from d.

s_baldur · Accepted Answer · 2016-08-30 15:12:25Z

0

Definitely possible...

d[, ":=" (A = A1 + A2, B = B1 + B2)]
d
   A1 A2 B1 B2 A  B
1:  1  4  7 10 5 17
2:  2  5  8 11 7 19
3:  3  6  9 12 9 21

# Want to drop the old columns?
set(d, j = which(names(d) %in% c("A1", "B1", "A2", "B2")), value = NULL)
d
   A  B
1: 5 17
2: 7 19
3: 9 21

Whether it is desirable I shall not tell. Probably better to follow Frank's advice (see comments).

edited Aug 30, 2016 at 15:12

answered Aug 30, 2016 at 15:05

s_baldur

34.5k4 gold badges43 silver badges80 bronze badges

2 Comments

eddi Over a year ago

d[, .(A = A1 + A2, B = B1 + B2)]

desval Over a year ago

I guess the point was the there are 100 columns, and therefore typing in all the combinations is not an option

Collectives™ on Stack Overflow

data.table execute function on groups of columns

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related