dplyr – get certain summary statics for multiple columns of a dataframe

Question

I want to create a summary statistics table for some summary functions for multiple variables. I've managed to do it using summarise and across, but I get a wide dataframe which is hard to read. Is there a better alternative (perhaps using purrr), or is there an easy way of reshaping the data?

Here is a reproducible example (the funs list contains additional functions I've created myself):

data <- as.data.frame(cbind(estimator1 = rnorm(3), 
                            estimator2 = runif(3)))
funs <- list(mean = mean, median = median)

If I use summarise and across I obtain:

estimator1_mean estimator1_median estimator2_mean estimator2_median
0.9506083          1.138536       0.5789924         0.7598719

What I would like to obtain is:

         estimator1 estimator2
mean     0.9506083  0.5789924        
median   1.138536   0.7598719

You could do: tidyr::pivot_longer(df, everything(), names_sep = "_", names_to = c(".value", "metric")) after summarise/across — harre
– harre, Commented Mar 8, 2023 at 10:32

TarJae · Accepted Answer · 2023-03-08 17:15:49Z

2

base R approach:

Using sapply:

sapply(data, \(x) sapply(funs, \(f) f(x) )) is applying a nested sapply() function to data and funs. For each element x of data, it applies each function f in funs to x using the inner sapply() function.

Both functions applied are anonymous functions defined with the \(f) syntax, which takes one argument f.

Having our given funs <- list(mean = mean, median = median)

The code sapply(data, \(x) sapply(funs, \(f) f(x) )) will apply mean() and median() to each element of data and return a matrix with the results:

sapply(data, \(x) sapply(funs, \(f) f(x) ))

       estimator1 estimator2
mean    0.3081365  0.4251447
median  0.2159416  0.3198206

edited Mar 8, 2023 at 17:15

answered Mar 8, 2023 at 10:45

TarJae

80.2k6 gold badges30 silver badges94 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Giacomo Oliva Over a year ago

thank you, this is also very clean. Could you please elaborate on the '(x)' and the '(f) f(x)'? I don't understand what is going on intuitively

Julian · Accepted Answer · 2023-03-08 10:32:44Z

1

You can use pivot_longer() with .value (".value" indicates that the corresponding component of the column name defines the name of the output column containing the cell values, overriding values_to entirely, see here), eg.

  library(dplyr)  
  data |>
    summarise(across(everything(), list(mean = mean, median = median, var = var))) |>
    tidyr::pivot_longer(cols = everything(), names_to = c(".value", "stats"), names_sep = "_")

  stats  estimator1 estimator2
  <chr>       <dbl>      <dbl>
1 mean        0.221    0.448  
2 median      0.110    0.429  
3 var         0.770    0.00288

answered Mar 8, 2023 at 10:32

Julian

9,6452 gold badges20 silver badges43 bronze badges

Collectives™ on Stack Overflow

dplyr – get certain summary statics for multiple columns of a dataframe

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related