0

I have a data frame with many columns, some of which are measure variables. I would like to extract a bunch of summary statistics from the latter using data.table. My problem is the following: how to rename the aggregated columns according to the function that was used?

I want to have an aggregated data.table with column names like: c("measure1_mean", "measure1_sd", "measure2_mean", "measure2_sd", ...)

My code looks like this:

library(data.table)
library(stringr)

dt <- data.table(meas1=1:10,
                 meas2=seq(5,25, length.out = 10),
                 meas3=rnorm(10),
                 groupvar=rep(LETTERS[1:5], each=2))
measure_cols <- colnames(dt)[str_detect(colnames(dt), "^meas")]
dt_agg <- dt[, c(lapply(.SD, mean),
                 lapply(.SD, sd)),
               by=groupvar, .SDcols = measure_cols]

# Does not work because of duplicates in rep(measure_cols, 3)
agg_names <- c(measure_cols, paste(rep(c("mean", "sd"), each=length(measure_cols)), measure_cols, sep="_"))
setnames(dt_agg, rep(measure_cols,3), agg_names)

This chunk effectively extracts the statistics but returns columns with identical names. Therefore I cannot use something like setnames(dt, old, new) because duplicates exist in my 'old' vector.

I came across this post: Rename aggregated columns using data.table in R. But I do not like the accepted solution because it relies on column index, and not names, to rename the columns.

4
  • 3
    Please show a small reproducible example Commented Nov 28, 2018 at 9:29
  • 1
    I think these may get you going: Calculate multiple aggregations with lapply(.SD, …) and Apply multiple functions to multiple columns Commented Nov 28, 2018 at 9:41
  • as per first link given by @Henrik , dt[, as.list(unlist(lapply(.SD, function(meas) list(mean=mean(meas), sd=sd(meas))))), by=groupvar, .SDcols = measure_cols] gives you directly what you're after Commented Nov 28, 2018 at 10:00
  • @Henrik thank you for pointing me to relevant posts. I find the syntax rather counter-intuitive but it works like a charm! Commented Nov 28, 2018 at 14:38

1 Answer 1

0
library(data.table)

dt <- data.table(meas1=1:10,
                 meas2=seq(5,25, length.out = 10),
                 meas3=rnorm(10),
                 groupvar=rep(LETTERS[1:5], each=2))
measure_cols <- colnames(dt)[str_detect(colnames(dt), "^meas")]
dt_agg <- dt[, c(lapply(.SD, mean),
                 lapply(.SD, sd)),
             by=groupvar, .SDcols = measure_cols]

you can create a vector with names... use the each argument to paste the function,names behind the measure_cols.

function.names <- c("mean", "sd")
column.names <- paste0( measure_cols, "_", rep( function.names, each = length( measure_cols ) ) )
setnames( dt_agg, c("groupvar", column.names ))

#    groupvar meas1_mean meas2_mean meas3_mean  meas1_sd meas2_sd  meas3_sd
# 1:        A        1.5   6.111111  0.2346044 0.7071068 1.571348 1.6733804
# 2:        B        3.5  10.555556  0.5144621 0.7071068 1.571348 0.0894364
# 3:        C        5.5  15.000000 -0.5469839 0.7071068 1.571348 2.1689620
# 4:        D        7.5  19.444444 -0.3898213 0.7071068 1.571348 1.0007027
# 5:        E        9.5  23.888889  0.5569743 0.7071068 1.571348 1.4499413
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your answer! However this still relies on the assumption that your columns are ordered this way, which I would specifically like to avoid.
@majpark you determine the order of columns in the lines where you calculate the mean, sd, etc... so they will always line up with a vector created with the same order of functions...

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.