1

If I have the following data table

m = matrix(1:12, ncol=4)
colnames(m) = c('A1','A2','B1','B2')
d = data.table(m)

is it possible to execute a function on sets of columns?

For example the following would be the sum of A1,A2 and B1,B2.

   A  B
1: 5 17
2: 7 19
3: 9 21

The solution would preferably work with a 500k x 100 matrix

3
  • That storage format is not very good for working with data in R, in my opinion. Better to go to long format: melt(d, meas=patterns("A","B"), value.name=c("A","B")) From there, how to sum should be obvious if you've gone through the data.table package vignettes. Commented Aug 30, 2016 at 15:02
  • The problem is that if i have 100 columns, each with 500k rows, it becomes quite tedious to melt it. Commented Aug 30, 2016 at 16:02
  • 1
    Well, things will be even more tedious if you don't melt it, I suspect. If you really need to do a lot of across-columns stuff, maybe you ought to stick to a matrix or array (see ?array for what that means in R), where you can use rowSums and similar. Commented Aug 30, 2016 at 16:36

2 Answers 2

1

Solution

A trick would be to split the column into groups.

Then you can use rowSums as Frank suggests (see comments on question):

# using your data example
m <- matrix(1:12, ncol = 4)
colnames(m) <- c('A1', 'A2', 'B1', 'B2')
d <- data.table(m)

# 1) group columns
groups <- split(colnames(d), substr(colnames(d), 1, 1))

# 2) group wise row sums
d[,lapply(groups, function(i) {rowSums(d[, i, with = FALSE])})]

Result

This will return the data.table:

   A  B
1: 5 17
2: 7 19
3: 9 21

Explanation

  • split creates a list of column names for each group, defined by a (something coercable to a) factor.
  • substr(colnames(m), 1, 1) takes the first letter as group id, use a different approach (e.g. sub("([A-Z]).*", "\\1", colnames(m)) for variable number of letters).
  • lapply is commonly used to apply functions over multiple columns in a data.table. Here we create a list output, named as the groups, containing the rowSums. with = FALSE is important to use the value of i to get the respective columns from d.
Sign up to request clarification or add additional context in comments.

Comments

0

Definitely possible...

d[, ":=" (A = A1 + A2, B = B1 + B2)]
d
   A1 A2 B1 B2 A  B
1:  1  4  7 10 5 17
2:  2  5  8 11 7 19
3:  3  6  9 12 9 21

# Want to drop the old columns?
set(d, j = which(names(d) %in% c("A1", "B1", "A2", "B2")), value = NULL)
d
   A  B
1: 5 17
2: 7 19
3: 9 21

Whether it is desirable I shall not tell. Probably better to follow Frank's advice (see comments).

2 Comments

d[, .(A = A1 + A2, B = B1 + B2)]
I guess the point was the there are 100 columns, and therefore typing in all the combinations is not an option

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.