21

How to create simple summary statistics using dplyr from multiple variables? Using the summarise_each function seems to be the way to go, however, when applying multiple functions to multiple columns, the result is a wide, hard-to-read data frame.

9 Answers 9

37

Use dplyr in combination with tidyr to reshape the end result.

library(dplyr)
library(tidyr)

df <- tbl_df(mtcars)

df.sum <- df %>%
  select(mpg, cyl, vs, am, gear, carb) %>% # select variables to summarise
  summarise_each(funs(min = min, 
                      q25 = quantile(., 0.25), 
                      median = median, 
                      q75 = quantile(., 0.75), 
                      max = max,
                      mean = mean, 
                      sd = sd))

# the result is a wide data frame
> dim(df.sum)
[1]  1 42

# reshape it using tidyr functions

df.stats.tidy <- df.sum %>% gather(stat, val) %>%
  separate(stat, into = c("var", "stat"), sep = "_") %>%
  spread(stat, val) %>%
  select(var, min, q25, median, q75, max, mean, sd) # reorder columns

> print(df.stats.tidy)

   var  min    q25 median  q75  max     mean        sd
1   am  0.0  0.000    0.0  1.0  1.0  0.40625 0.4989909
2 carb  1.0  2.000    2.0  4.0  8.0  2.81250 1.6152000
3  cyl  4.0  4.000    6.0  8.0  8.0  6.18750 1.7859216
4 gear  3.0  3.000    4.0  4.0  5.0  3.68750 0.7378041
5  mpg 10.4 15.425   19.2 22.8 33.9 20.09062 6.0269481
6   vs  0.0  0.000    0.0  1.0  1.0  0.43750 0.5040161
Sign up to request clarification or add additional context in comments.

4 Comments

I prefer this solution over the stargazer one because it returns a data.frame that I can then use further. But unfortunately I'm running into issues because my variable names have underscores in them which freaks out the separate step of the pipa. I was looking to see if separate had a parameter to just use the rightmost underscore, but had no luck. Any advise on how to generalize this so it does not choke on variables with underscore?
Using summarise_each now throws a warning that it's deprecated and summarise_all is the new function for this kind of use case. dplyr 0.7.8
@hannes101 Even summarise_all has been superseded by across. Check this
Not only is summarise_all deprecated, but also funs, hence it need to look like summarise(across(varlist, list(min = min, q25 = ~quantile(., 0.25), median = median, q75 = ~quantile(., 0.75), max = max, mean = mean, sd = sd))) using the ~ statement. Nevertheless @konrad solution below, is even better than this.
23

A potentially easy solution could created with broom::tidy and purrr::map_df. broom::tidy summarises key objects from statistical ouput into a tibble. purrr::map_df applies function to each element, in this case a column and returns a tibble.

Example

library(tidyverse)
mtcars %>% 
    select(mpg, cyl, vs, am, gear, carb) %>% 
    map_df(.f = ~ broom::tidy(summary(.x)), .id = "variable")

Results

# A tibble: 6 x 7
# variable minimum    q1 median   mean    q3 maximum
# <chr>      <dbl> <dbl>  <dbl>  <dbl> <dbl>   <dbl>
# 1 mpg         10.4  15.4   19.2 20.1    22.8    33.9
# 2 cyl          4     4      6    6.19    8       8  
# 3 vs           0     0      0    0.438   1       1  
# 4 am           0     0      0    0.406   1       1  
# 5 gear         3     3      4    3.69    4       5  
# 6 carb         1     2      2    2.81    4       8  

5 Comments

Very nice solution.
Is it possible to extend your solution that it only includes mean, sd and n?
@Matthew You could do it a number of ways, creating own version of summary function may look clean if you want to pack more transformations in that step. To apply those functions only across will likely provide the cleanest solution. You would be looking to do something on the lines: summarise(across(starts_with("Sepal"), list(mean = mean, sd = sd))) reflecting your functions, as in the provided examples. Dplyr's n doesn't take any arguments so you will have to derive your count using a different method or wrap n to drop argument.
Just in case this updated syntax helps someone (although I really like the summarytools::descr() example): ``` library( tidyverse ); df |> summarize( across( where( is.numeric ), .fns = list( min = min, q1 = ~ quantile( .x, 0.25 ), median = ~ median( .x ), mean = ~ mean( .x ), q3 = ~ quantile( .x, 0.75 ), max = max )) ) # Summary data frame ```
summary_stats <- df %>% group_by(col_x) %>% summarise(across(all_of(cols_list), list(mean = ~mean(.x, na.rm = TRUE), sd = ~sd(.x, na.rm = TRUE), median = ~median(.x, na.rm = TRUE), var = ~var(.x, na.rm = TRUE), max = ~max(.x, na.rm = TRUE), min = ~min(.x, na.rm = TRUE), iqr = ~IQR(.x, na.rm = TRUE), q25 = ~quantile(.x, 0.25, na.rm = TRUE), q75 = ~quantile(.x, 0.75, na.rm = TRUE)))) summary_stats <- as.data.frame(t(summary_stats))
19

I liked paljenczy's idea of just using dplyr/tidy and getting the table in a data.frame/tibble before formatting it. But I ran into robustness issues: Because it relies on parsing variable names it choked on columns with underscores in the names. After trying to fix this within the dplyr framework it seemed like it would always be somewhat fragile because it relied on string parsing.

So in the end I decided on using psych::describe() which is a function designed for exactly this thing. It doesn't do completely arbitrary functions, but pretty much anything one would realistically want to do. A full example duplicating the previous solutions is included below, combining psych::describe() with some tidyverse stuff to get the exact tibble we are looking for.

It is worth noting that this answer has been updated to reflect the changed behavior of as_tibble() with regards to how it handles rownames in data.frames:

library(psych)
library(tidyverse)

# Create an extended version with a bunch of stats 
d.summary.extended <- mtcars %>%
    select(mpg, cyl, vs, am, gear, carb) %>%
    psych::describe(quant=c(.25,.75)) %>%
    as_tibble(rownames="rowname")  %>%
    print()

<OUTPUT>
# A tibble: 6 x 16
  rowname  vars     n     mean        sd median    trimmed     mad   min   max range       skew  kurtosis         se  Q0.25 Q0.75
    <chr> <int> <dbl>    <dbl>     <dbl>  <dbl>      <dbl>   <dbl> <dbl> <dbl> <dbl>      <dbl>     <dbl>      <dbl>  <dbl> <dbl>
1     mpg     1    32 20.09062 6.0269481   19.2 19.6961538 5.41149  10.4  33.9  23.5  0.6106550 -0.372766 1.06542396 15.425  22.8
2     cyl     2    32  6.18750 1.7859216    6.0  6.2307692 2.96520   4.0   8.0   4.0 -0.1746119 -1.762120 0.31570933  4.000   8.0
3      vs     3    32  0.43750 0.5040161    0.0  0.4230769 0.00000   0.0   1.0   1.0  0.2402577 -2.001938 0.08909831  0.000   1.0
4      am     4    32  0.40625 0.4989909    0.0  0.3846154 0.00000   0.0   1.0   1.0  0.3640159 -1.924741 0.08820997  0.000   1.0
5    gear     5    32  3.68750 0.7378041    4.0  3.6153846 1.48260   3.0   5.0   2.0  0.5288545 -1.069751 0.13042656  3.000   4.0
6    carb     6    32  2.81250 1.6152000    2.0  2.6538462 1.48260   1.0   8.0   7.0  1.0508738  1.257043 0.28552971  2.000   4.0
</OUTPUT>

# Select stats for comparison with other solutions
d.summary <- d.summary.extended %>%
    select(var=rowname, min, q25=Q0.25, median, q75=Q0.75, max, mean, sd) %>%
    print()

<OUTPUT>
# A tibble: 6 x 8
    var   min    q25 median   q75   max     mean        sd
  <chr> <dbl>  <dbl>  <dbl> <dbl> <dbl>    <dbl>     <dbl>
1   mpg  10.4 15.425   19.2  22.8  33.9 20.09062 6.0269481
2   cyl   4.0  4.000    6.0   8.0   8.0  6.18750 1.7859216
3    vs   0.0  0.000    0.0   1.0   1.0  0.43750 0.5040161
4    am   0.0  0.000    0.0   1.0   1.0  0.40625 0.4989909
5  gear   3.0  3.000    4.0   4.0   5.0  3.68750 0.7378041
6  carb   1.0  2.000    2.0   4.0   8.0  2.81250 1.6152000    
</OUTPUT>

3 Comments

Thanks your answer is great, although it does not show the var names, it reports the columns numbers.
@Amleto, I have updated the answer and it should now work again. The problem was that the behavior of as_tibble() was modified in a recent release of the tidyverse so now the default behavior seems to be to drop rownames. I've now specified in the example that the rowname should be included in the rowname variable in the resulting tibble (using as_tibble(rownames="rowname")).
This worked great for me. The "correct" answer almost worked for me but I ran into the same problem someone else did, having to do with variable names having underscores in them. This answer was much simpler in my use case.
14

If you want to create a summary table for publication (not for further calculations) you may want to look at the excellent stargazer package.

df <- data.frame(mtcars)
cols <- c('mpg', 'cyl', 'vs', 'am', 'gear', 'carb')
stargazer(
    df[, cols], type = "text", 
    summary.stat = c("min", "p25", "median", "p75", "max", "median", "sd")
)

================================================================
Statistic  Min   Pctl(25) Median Pctl(75)  Max   Median St. Dev.
----------------------------------------------------------------
mpg       10.400  15.430  19.200  22.800  33.900 19.200  6.027
cyl         4       4       6       8       8      6     1.786
vs          0       0       0       1       1      0     0.504
am          0       0       0       1       1      0     0.499
gear        3       3       4       4       5      4     0.738
carb        1       2       2       4       8      2     1.615
----------------------------------------------------------------

You can change type to 'latex' and 'html' as well and save it to file with specifying the file giving 'out' argument.

Comments

4

There's a "new" package called skimr that has a function called skim() that gives wonderful output describing individual variables in a data.fame/tibble.

Try:

skimr::skim(mtcars)

and you'll get:

── Data Summary ────────────────────────
                           Values
Name                       mtcars
Number of rows             32    
Number of columns          11    
_______________________          
Column type frequency:           
  numeric                  11    
________________________         
Group variables            None  

── Variable type: numeric ───────────────────────────────────────────────────────────────────────────
   skim_variable n_missing complete_rate    mean      sd    p0    p25    p50    p75   p100 hist 
 1 mpg                   0             1  20.1     6.03  10.4   15.4   19.2   22.8   33.9  ▃▇▅▁▂
 2 cyl                   0             1   6.19    1.79   4      4      6      8      8    ▆▁▃▁▇
 3 disp                  0             1 231.    124.    71.1  121.   196.   326    472    ▇▃▃▃▂
 4 hp                    0             1 147.     68.6   52     96.5  123    180    335    ▇▇▆▃▁
 5 drat                  0             1   3.60    0.535  2.76   3.08   3.70   3.92   4.93 ▇▃▇▅▁
 6 wt                    0             1   3.22    0.978  1.51   2.58   3.32   3.61   5.42 ▃▃▇▁▂
 7 qsec                  0             1  17.8     1.79  14.5   16.9   17.7   18.9   22.9  ▃▇▇▂▁
 8 vs                    0             1   0.438   0.504  0      0      0      1      1    ▇▁▁▁▆
 9 am                    0             1   0.406   0.499  0      0      0      1      1    ▇▁▁▁▆
10 gear                  0             1   3.69    0.738  3      3      4      4      5    ▇▁▆▁▂
11 carb                  0             1   2.81    1.62   1      2      2      4      8    ▇▂▅▁▁

it is customizable and works well with pipes etc. see ?skimr::skim() and vignette("Using_skimr", package = "skimr")

1 Comment

I like skimr too.
2

Similar to the accepted answer, but tidied up a bit into a function:

summarise_continuous = function(d, cvars) {
  d %>%
    select(all_of(cvars)) %>%
    mutate_all(as.numeric) %>%
    summarise(across(all_of(cvars), list(N = ~sum(!is.na(.)), 
                                         mean = ~mean(., na.rm=T), 
                                         sd = ~sd(., na.rm=T), 
                                         median = ~median(., na.rm=T),
                                         min = ~min(., na.rm=T),
                                         max = ~max(., na.rm=T)))) %>% 
    pivot_longer(everything(), 
                 names_to = c("variable",".value"),
                 names_pattern = "(.+)_(.+)") # %>%
    # knitr::kable()
    # uncomment these bits if you want a nicely formatted table in a .Rmd document
}

summarise_continuous(mtcars, c("mpg", "cyl", "vs", "am", "gear", "carb"))

Comments

1

You can achieve the same result using data.table as well. You might consider using it if your table is big.

dt <- data.table(mtcars)

cols <- c('mpg', 'cyl', 'vs', 'am', 'gear', 'carb')
functions <- c('min', 'q25', 'median', 'q75', 'max', 'mean', 'sd')

dt.sum <- dt[
    , 
    lapply(
        .SD, 
        function(x) list(
                min(x), quantile(x, 0.25), median(x), 
                quantile(x, 0.75), max(x), mean(x), sd(x)
        )
    ),
    .SDcols = cols
]

dt.sum
     mpg   cyl     vs     am   gear  carb
1:  10.4     4      0      0      3     1
2: 15.43     4      0      0      3     2
3:  19.2     6      0      0      4     2
4:  22.8     8      1      1      4     4
5:  33.9     8      1      1      5     8
6: 20.09 6.188 0.4375 0.4062  3.688 2.812
7: 6.027 1.786  0.504  0.499 0.7378 1.615

# transpose and provide meaningful names
dt.sum.t <- as.data.table(t(sum))[]
setnames(dt.sum.t, names(dt.sum.t), functions)
dt.sum.t[, var := cols]
setcolorder(dt.sum.t, c("var", functions))

dt.sum.t
    var  min   q25 median  q75  max   mean     sd
1:  mpg 10.4 15.43   19.2 22.8 33.9  20.09  6.027
2:  cyl    4     4      6    8    8  6.188  1.786
3:   vs    0     0      0    1    1 0.4375  0.504
4:   am    0     0      0    1    1 0.4062  0.499
5: gear    3     3      4    4    5  3.688 0.7378
6: carb    1     2      2    4    8  2.812  1.615

Comments

1

Or, if you want a one-line solution, you can combine dplyr's select with descr() from the package summarytools:

library(dplyr); library(summarytools)
data <- mtcars
data %>% select(mpg, cyl, vs, am, gear, carb) %>% descr()

which results in:

Descriptive Statistics  

                        am     carb      cyl     gear      mpg       vs

             Mean     0.41     2.81     6.19     3.69    20.09     0.44
          Std.Dev     0.50     1.62     1.79     0.74     6.03     0.50
              Min     0.00     1.00     4.00     3.00    10.40     0.00
               Q1     0.00     2.00     4.00     3.00    15.35     0.00
           Median     0.00     2.00     6.00     4.00    19.20     0.00
               Q3     1.00     4.00     8.00     4.00    22.80     1.00
              Max     1.00     8.00     8.00     5.00    33.90     1.00
              MAD     0.00     1.48     2.97     1.48     5.41     0.00
              IQR     1.00     2.00     4.00     1.00     7.38     1.00
               CV     1.23     0.57     0.29     0.20     0.30     1.15
         Skewness     0.36     1.05    -0.17     0.53     0.61     0.24
      SE.Skewness     0.41     0.41     0.41     0.41     0.41     0.41
         Kurtosis    -1.92     1.26    -1.76    -1.07    -0.37    -2.00
          N.Valid    32.00    32.00    32.00    32.00    32.00    32.00
        Pct.Valid   100.00   100.00   100.00   100.00   100.00   100.00

Comments

0

I'm seeing this many years later. Some of the functions are deprecated in the new versions of the dplyr, so you will have to use different ones.

A simple alternative could be to create variables and arrange it like this:

describe(reframe(df, mpg, cyl, vs, am, gear, carb))

1 Comment

Hi @Evil_Lynn, I also bumped into this outdated package issue in many of the answers. Could you be kind to provide the packages for the function describe and reframe. I tried with dplyr and they don't work. Thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.