dplyr - summary table for multiple variables

Question

How to create simple summary statistics using dplyr from multiple variables? Using the summarise_each function seems to be the way to go, however, when applying multiple functions to multiple columns, the result is a wide, hard-to-read data frame.

paljenczy · Accepted Answer · 2016-01-04 15:51:29Z

37

Use dplyr in combination with tidyr to reshape the end result.

library(dplyr)
library(tidyr)

df <- tbl_df(mtcars)

df.sum <- df %>%
  select(mpg, cyl, vs, am, gear, carb) %>% # select variables to summarise
  summarise_each(funs(min = min, 
                      q25 = quantile(., 0.25), 
                      median = median, 
                      q75 = quantile(., 0.75), 
                      max = max,
                      mean = mean, 
                      sd = sd))

# the result is a wide data frame
> dim(df.sum)
[1]  1 42

# reshape it using tidyr functions

df.stats.tidy <- df.sum %>% gather(stat, val) %>%
  separate(stat, into = c("var", "stat"), sep = "_") %>%
  spread(stat, val) %>%
  select(var, min, q25, median, q75, max, mean, sd) # reorder columns

> print(df.stats.tidy)

   var  min    q25 median  q75  max     mean        sd
1   am  0.0  0.000    0.0  1.0  1.0  0.40625 0.4989909
2 carb  1.0  2.000    2.0  4.0  8.0  2.81250 1.6152000
3  cyl  4.0  4.000    6.0  8.0  8.0  6.18750 1.7859216
4 gear  3.0  3.000    4.0  4.0  5.0  3.68750 0.7378041
5  mpg 10.4 15.425   19.2 22.8 33.9 20.09062 6.0269481
6   vs  0.0  0.000    0.0  1.0  1.0  0.43750 0.5040161

edited Jan 4, 2016 at 15:51

answered Jan 4, 2016 at 15:37

paljenczy

4,9298 gold badges36 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Magnus Over a year ago

I prefer this solution over the stargazer one because it returns a data.frame that I can then use further. But unfortunately I'm running into issues because my variable names have underscores in them which freaks out the separate step of the pipa. I was looking to see if separate had a parameter to just use the rightmost underscore, but had no luck. Any advise on how to generalize this so it does not choke on variables with underscore?

hannes101 Over a year ago

Using summarise_each now throws a warning that it's deprecated and summarise_all is the new function for this kind of use case. dplyr 0.7.8

user9026 Over a year ago

@hannes101 Even summarise_all has been superseded by across. Check this

zerocool Over a year ago

Not only is summarise_all deprecated, but also funs, hence it need to look like

summarise(across(varlist, list(min = min, q25 = ~quantile(., 0.25), median = median, q75 = ~quantile(., 0.75), max = max, mean = mean, sd = sd)))

using the ~ statement. Nevertheless @konrad solution below, is even better than this.

Konrad · Accepted Answer · 2019-08-19 16:04:05Z

23

A potentially easy solution could created with broom::tidy and purrr::map_df. broom::tidy summarises key objects from statistical ouput into a tibble. purrr::map_df applies function to each element, in this case a column and returns a tibble.

Example

library(tidyverse)
mtcars %>% 
    select(mpg, cyl, vs, am, gear, carb) %>% 
    map_df(.f = ~ broom::tidy(summary(.x)), .id = "variable")

Results

# A tibble: 6 x 7
# variable minimum    q1 median   mean    q3 maximum
# <chr>      <dbl> <dbl>  <dbl>  <dbl> <dbl>   <dbl>
# 1 mpg         10.4  15.4   19.2 20.1    22.8    33.9
# 2 cyl          4     4      6    6.19    8       8  
# 3 vs           0     0      0    0.438   1       1  
# 4 am           0     0      0    0.406   1       1  
# 5 gear         3     3      4    3.69    4       5  
# 6 carb         1     2      2    2.81    4       8

edited Aug 19, 2019 at 16:04

answered Aug 19, 2019 at 11:50

Konrad

18.8k17 gold badges117 silver badges186 bronze badges

5 Comments

paljenczy Over a year ago

Very nice solution.

Matthew Over a year ago

Is it possible to extend your solution that it only includes mean, sd and n?

Konrad Over a year ago

@Matthew You could do it a number of ways, creating own version of summary function may look clean if you want to pack more transformations in that step. To apply those functions only across will likely provide the cleanest solution. You would be looking to do something on the lines: summarise(across(starts_with("Sepal"), list(mean = mean, sd = sd))) reflecting your functions, as in the provided examples. Dplyr's n doesn't take any arguments so you will have to derive your count using a different method or wrap n to drop argument.

Michael Over a year ago

Just in case this updated syntax helps someone (although I really like the summarytools::descr() example): ``` library( tidyverse ); df |> summarize( across( where( is.numeric ), .fns = list( min = min, q1 = ~ quantile( .x, 0.25 ), median = ~ median( .x ), mean = ~ mean( .x ), q3 = ~ quantile( .x, 0.75 ), max = max )) ) # Summary data frame ```

Sophia Cardoso Oct 3 at 16:29

summary_stats <- df %>% group_by(col_x) %>% summarise(across(all_of(cols_list), list(mean = ~mean(.x, na.rm = TRUE), sd = ~sd(.x, na.rm = TRUE), median = ~median(.x, na.rm = TRUE), var = ~var(.x, na.rm = TRUE), max = ~max(.x, na.rm = TRUE), min = ~min(.x, na.rm = TRUE), iqr = ~IQR(.x, na.rm = TRUE), q25 = ~quantile(.x, 0.25, na.rm = TRUE), q75 = ~quantile(.x, 0.75, na.rm = TRUE)))) summary_stats <- as.data.frame(t(summary_stats))

Magnus · Accepted Answer · 2019-08-19 11:28:46Z

19

I liked paljenczy's idea of just using dplyr/tidy and getting the table in a data.frame/tibble before formatting it. But I ran into robustness issues: Because it relies on parsing variable names it choked on columns with underscores in the names. After trying to fix this within the dplyr framework it seemed like it would always be somewhat fragile because it relied on string parsing.

So in the end I decided on using psych::describe() which is a function designed for exactly this thing. It doesn't do completely arbitrary functions, but pretty much anything one would realistically want to do. A full example duplicating the previous solutions is included below, combining psych::describe() with some tidyverse stuff to get the exact tibble we are looking for.

It is worth noting that this answer has been updated to reflect the changed behavior of as_tibble() with regards to how it handles rownames in data.frames:

library(psych)
library(tidyverse)

# Create an extended version with a bunch of stats 
d.summary.extended <- mtcars %>%
    select(mpg, cyl, vs, am, gear, carb) %>%
    psych::describe(quant=c(.25,.75)) %>%
    as_tibble(rownames="rowname")  %>%
    print()

<OUTPUT>
# A tibble: 6 x 16
  rowname  vars     n     mean        sd median    trimmed     mad   min   max range       skew  kurtosis         se  Q0.25 Q0.75
    <chr> <int> <dbl>    <dbl>     <dbl>  <dbl>      <dbl>   <dbl> <dbl> <dbl> <dbl>      <dbl>     <dbl>      <dbl>  <dbl> <dbl>
1     mpg     1    32 20.09062 6.0269481   19.2 19.6961538 5.41149  10.4  33.9  23.5  0.6106550 -0.372766 1.06542396 15.425  22.8
2     cyl     2    32  6.18750 1.7859216    6.0  6.2307692 2.96520   4.0   8.0   4.0 -0.1746119 -1.762120 0.31570933  4.000   8.0
3      vs     3    32  0.43750 0.5040161    0.0  0.4230769 0.00000   0.0   1.0   1.0  0.2402577 -2.001938 0.08909831  0.000   1.0
4      am     4    32  0.40625 0.4989909    0.0  0.3846154 0.00000   0.0   1.0   1.0  0.3640159 -1.924741 0.08820997  0.000   1.0
5    gear     5    32  3.68750 0.7378041    4.0  3.6153846 1.48260   3.0   5.0   2.0  0.5288545 -1.069751 0.13042656  3.000   4.0
6    carb     6    32  2.81250 1.6152000    2.0  2.6538462 1.48260   1.0   8.0   7.0  1.0508738  1.257043 0.28552971  2.000   4.0
</OUTPUT>

# Select stats for comparison with other solutions
d.summary <- d.summary.extended %>%
    select(var=rowname, min, q25=Q0.25, median, q75=Q0.75, max, mean, sd) %>%
    print()

<OUTPUT>
# A tibble: 6 x 8
    var   min    q25 median   q75   max     mean        sd
  <chr> <dbl>  <dbl>  <dbl> <dbl> <dbl>    <dbl>     <dbl>
1   mpg  10.4 15.425   19.2  22.8  33.9 20.09062 6.0269481
2   cyl   4.0  4.000    6.0   8.0   8.0  6.18750 1.7859216
3    vs   0.0  0.000    0.0   1.0   1.0  0.43750 0.5040161
4    am   0.0  0.000    0.0   1.0   1.0  0.40625 0.4989909
5  gear   3.0  3.000    4.0   4.0   5.0  3.68750 0.7378041
6  carb   1.0  2.000    2.0   4.0   8.0  2.81250 1.6152000    
</OUTPUT>

edited Aug 19, 2019 at 11:28

answered Oct 17, 2017 at 9:31

Magnus

26.3k1 gold badge33 silver badges28 bronze badges

3 Comments

Amleto Over a year ago

Thanks your answer is great, although it does not show the var names, it reports the columns numbers.

Magnus Over a year ago

@Amleto, I have updated the answer and it should now work again. The problem was that the behavior of as_tibble() was modified in a recent release of the tidyverse so now the default behavior seems to be to drop rownames. I've now specified in the example that the rowname should be included in the rowname variable in the resulting tibble (using as_tibble(rownames="rowname")).

Patrick Williams Over a year ago

This worked great for me. The "correct" answer almost worked for me but I ran into the same problem someone else did, having to do with variable names having underscores in them. This answer was much simpler in my use case.

janosdivenyi · Accepted Answer · 2016-01-30 09:24:16Z

If you want to create a summary table for publication (not for further calculations) you may want to look at the excellent stargazer package.

df <- data.frame(mtcars)
cols <- c('mpg', 'cyl', 'vs', 'am', 'gear', 'carb')
stargazer(
    df[, cols], type = "text", 
    summary.stat = c("min", "p25", "median", "p75", "max", "median", "sd")
)

================================================================
Statistic  Min   Pctl(25) Median Pctl(75)  Max   Median St. Dev.
----------------------------------------------------------------
mpg       10.400  15.430  19.200  22.800  33.900 19.200  6.027
cyl         4       4       6       8       8      6     1.786
vs          0       0       0       1       1      0     0.504
am          0       0       0       1       1      0     0.499
gear        3       3       4       4       5      4     0.738
carb        1       2       2       4       8      2     1.615
----------------------------------------------------------------

You can change type to 'latex' and 'html' as well and save it to file with specifying the file giving 'out' argument.

qwr · Accepted Answer · 2024-05-20 23:05:24Z

There's a "new" package called skimr that has a function called skim() that gives wonderful output describing individual variables in a data.fame/tibble.

Try:

skimr::skim(mtcars)

and you'll get:

── Data Summary ────────────────────────
                           Values
Name                       mtcars
Number of rows             32    
Number of columns          11    
_______________________          
Column type frequency:           
  numeric                  11    
________________________         
Group variables            None  

── Variable type: numeric ───────────────────────────────────────────────────────────────────────────
   skim_variable n_missing complete_rate    mean      sd    p0    p25    p50    p75   p100 hist 
 1 mpg                   0             1  20.1     6.03  10.4   15.4   19.2   22.8   33.9  ▃▇▅▁▂
 2 cyl                   0             1   6.19    1.79   4      4      6      8      8    ▆▁▃▁▇
 3 disp                  0             1 231.    124.    71.1  121.   196.   326    472    ▇▃▃▃▂
 4 hp                    0             1 147.     68.6   52     96.5  123    180    335    ▇▇▆▃▁
 5 drat                  0             1   3.60    0.535  2.76   3.08   3.70   3.92   4.93 ▇▃▇▅▁
 6 wt                    0             1   3.22    0.978  1.51   2.58   3.32   3.61   5.42 ▃▃▇▁▂
 7 qsec                  0             1  17.8     1.79  14.5   16.9   17.7   18.9   22.9  ▃▇▇▂▁
 8 vs                    0             1   0.438   0.504  0      0      0      1      1    ▇▁▁▁▆
 9 am                    0             1   0.406   0.499  0      0      0      1      1    ▇▁▁▁▆
10 gear                  0             1   3.69    0.738  3      3      4      4      5    ▇▁▆▁▂
11 carb                  0             1   2.81    1.62   1      2      2      4      8    ▇▂▅▁▁

it is customizable and works well with pipes etc. see ?skimr::skim() and vignette("Using_skimr", package = "skimr")

Brian D · Accepted Answer · 2021-12-06 21:43:47Z

Similar to the accepted answer, but tidied up a bit into a function:

summarise_continuous = function(d, cvars) {
  d %>%
    select(all_of(cvars)) %>%
    mutate_all(as.numeric) %>%
    summarise(across(all_of(cvars), list(N = ~sum(!is.na(.)), 
                                         mean = ~mean(., na.rm=T), 
                                         sd = ~sd(., na.rm=T), 
                                         median = ~median(., na.rm=T),
                                         min = ~min(., na.rm=T),
                                         max = ~max(., na.rm=T)))) %>% 
    pivot_longer(everything(), 
                 names_to = c("variable",".value"),
                 names_pattern = "(.+)_(.+)") # %>%
    # knitr::kable()
    # uncomment these bits if you want a nicely formatted table in a .Rmd document
}

summarise_continuous(mtcars, c("mpg", "cyl", "vs", "am", "gear", "carb"))

janosdivenyi · Accepted Answer · 2016-01-20 14:53:25Z

You can achieve the same result using data.table as well. You might consider using it if your table is big.

dt <- data.table(mtcars)

cols <- c('mpg', 'cyl', 'vs', 'am', 'gear', 'carb')
functions <- c('min', 'q25', 'median', 'q75', 'max', 'mean', 'sd')

dt.sum <- dt[
    , 
    lapply(
        .SD, 
        function(x) list(
                min(x), quantile(x, 0.25), median(x), 
                quantile(x, 0.75), max(x), mean(x), sd(x)
        )
    ),
    .SDcols = cols
]

dt.sum
     mpg   cyl     vs     am   gear  carb
1:  10.4     4      0      0      3     1
2: 15.43     4      0      0      3     2
3:  19.2     6      0      0      4     2
4:  22.8     8      1      1      4     4
5:  33.9     8      1      1      5     8
6: 20.09 6.188 0.4375 0.4062  3.688 2.812
7: 6.027 1.786  0.504  0.499 0.7378 1.615

# transpose and provide meaningful names
dt.sum.t <- as.data.table(t(sum))[]
setnames(dt.sum.t, names(dt.sum.t), functions)
dt.sum.t[, var := cols]
setcolorder(dt.sum.t, c("var", functions))

dt.sum.t
    var  min   q25 median  q75  max   mean     sd
1:  mpg 10.4 15.43   19.2 22.8 33.9  20.09  6.027
2:  cyl    4     4      6    8    8  6.188  1.786
3:   vs    0     0      0    1    1 0.4375  0.504
4:   am    0     0      0    1    1 0.4062  0.499
5: gear    3     3      4    4    5  3.688 0.7378
6: carb    1     2      2    4    8  2.812  1.615

dr_E · Accepted Answer · 2023-10-19 10:15:13Z

Or, if you want a one-line solution, you can combine dplyr's select with descr() from the package summarytools:

library(dplyr); library(summarytools)
data <- mtcars
data %>% select(mpg, cyl, vs, am, gear, carb) %>% descr()

which results in:

Descriptive Statistics  

                        am     carb      cyl     gear      mpg       vs

             Mean     0.41     2.81     6.19     3.69    20.09     0.44
          Std.Dev     0.50     1.62     1.79     0.74     6.03     0.50
              Min     0.00     1.00     4.00     3.00    10.40     0.00
               Q1     0.00     2.00     4.00     3.00    15.35     0.00
           Median     0.00     2.00     6.00     4.00    19.20     0.00
               Q3     1.00     4.00     8.00     4.00    22.80     1.00
              Max     1.00     8.00     8.00     5.00    33.90     1.00
              MAD     0.00     1.48     2.97     1.48     5.41     0.00
              IQR     1.00     2.00     4.00     1.00     7.38     1.00
               CV     1.23     0.57     0.29     0.20     0.30     1.15
         Skewness     0.36     1.05    -0.17     0.53     0.61     0.24
      SE.Skewness     0.41     0.41     0.41     0.41     0.41     0.41
         Kurtosis    -1.92     1.26    -1.76    -1.07    -0.37    -2.00
          N.Valid    32.00    32.00    32.00    32.00    32.00    32.00
        Pct.Valid   100.00   100.00   100.00   100.00   100.00   100.00

Evil_Lynn · Accepted Answer · 2023-12-06 18:05:29Z

0

I'm seeing this many years later. Some of the functions are deprecated in the new versions of the dplyr, so you will have to use different ones.

A simple alternative could be to create variables and arrange it like this:

describe(reframe(df, mpg, cyl, vs, am, gear, carb))

answered Dec 6, 2023 at 18:05

Evil_Lynn

1

1 Comment

Francis van Oordt Jan 21 at 14:29

Hi @Evil_Lynn, I also bumped into this outdated package issue in many of the answers. Could you be kind to provide the packages for the function describe and reframe. I tried with dplyr and they don't work. Thanks

Collectives™ on Stack Overflow

dplyr - summary table for multiple variables

9 Answers 9

4 Comments

Example

Results

5 Comments

3 Comments

Comments

1 Comment

Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

4 Comments

Example

Results

5 Comments

3 Comments

Comments

1 Comment

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related