How to create simple summary statistics using dplyr from multiple variables? Using the summarise_each function seems to be the way to go, however, when applying multiple functions to multiple columns, the result is a wide, hard-to-read data frame.
9 Answers
Use dplyr in combination with tidyr to reshape the end result.
library(dplyr)
library(tidyr)
df <- tbl_df(mtcars)
df.sum <- df %>%
select(mpg, cyl, vs, am, gear, carb) %>% # select variables to summarise
summarise_each(funs(min = min,
q25 = quantile(., 0.25),
median = median,
q75 = quantile(., 0.75),
max = max,
mean = mean,
sd = sd))
# the result is a wide data frame
> dim(df.sum)
[1] 1 42
# reshape it using tidyr functions
df.stats.tidy <- df.sum %>% gather(stat, val) %>%
separate(stat, into = c("var", "stat"), sep = "_") %>%
spread(stat, val) %>%
select(var, min, q25, median, q75, max, mean, sd) # reorder columns
> print(df.stats.tidy)
var min q25 median q75 max mean sd
1 am 0.0 0.000 0.0 1.0 1.0 0.40625 0.4989909
2 carb 1.0 2.000 2.0 4.0 8.0 2.81250 1.6152000
3 cyl 4.0 4.000 6.0 8.0 8.0 6.18750 1.7859216
4 gear 3.0 3.000 4.0 4.0 5.0 3.68750 0.7378041
5 mpg 10.4 15.425 19.2 22.8 33.9 20.09062 6.0269481
6 vs 0.0 0.000 0.0 1.0 1.0 0.43750 0.5040161
4 Comments
summarise_each now throws a warning that it's deprecated and summarise_all is the new function for this kind of use case. dplyr 0.7.8summarise_all deprecated, but also funs, hence it need to look like summarise(across(varlist, list(min = min, q25 = ~quantile(., 0.25), median = median, q75 = ~quantile(., 0.75), max = max, mean = mean, sd = sd))) using the ~ statement. Nevertheless @konrad solution below, is even better than this.A potentially easy solution could created with broom::tidy and purrr::map_df. broom::tidy summarises key objects from statistical ouput into a tibble. purrr::map_df applies function to each element, in this case a column and returns a tibble.
Example
library(tidyverse)
mtcars %>%
select(mpg, cyl, vs, am, gear, carb) %>%
map_df(.f = ~ broom::tidy(summary(.x)), .id = "variable")
Results
# A tibble: 6 x 7
# variable minimum q1 median mean q3 maximum
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 mpg 10.4 15.4 19.2 20.1 22.8 33.9
# 2 cyl 4 4 6 6.19 8 8
# 3 vs 0 0 0 0.438 1 1
# 4 am 0 0 0 0.406 1 1
# 5 gear 3 3 4 3.69 4 5
# 6 carb 1 2 2 2.81 4 8
5 Comments
summary function may look clean if you want to pack more transformations in that step. To apply those functions only across will likely provide the cleanest solution. You would be looking to do something on the lines: summarise(across(starts_with("Sepal"), list(mean = mean, sd = sd))) reflecting your functions, as in the provided examples. Dplyr's n doesn't take any arguments so you will have to derive your count using a different method or wrap n to drop argument.I liked paljenczy's idea of just using dplyr/tidy and getting the table in a data.frame/tibble before formatting it. But I ran into robustness issues: Because it relies on parsing variable names it choked on columns with underscores in the names. After trying to fix this within the dplyr framework it seemed like it would always be somewhat fragile because it relied on string parsing.
So in the end I decided on using psych::describe() which is a function designed for exactly this thing. It doesn't do completely arbitrary functions, but pretty much anything one would realistically want to do. A full example duplicating the previous solutions is included below, combining psych::describe() with some tidyverse stuff to get the exact tibble we are looking for.
It is worth noting that this answer has been updated to reflect the changed behavior of as_tibble() with regards to how it handles rownames in data.frames:
library(psych)
library(tidyverse)
# Create an extended version with a bunch of stats
d.summary.extended <- mtcars %>%
select(mpg, cyl, vs, am, gear, carb) %>%
psych::describe(quant=c(.25,.75)) %>%
as_tibble(rownames="rowname") %>%
print()
<OUTPUT>
# A tibble: 6 x 16
rowname vars n mean sd median trimmed mad min max range skew kurtosis se Q0.25 Q0.75
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mpg 1 32 20.09062 6.0269481 19.2 19.6961538 5.41149 10.4 33.9 23.5 0.6106550 -0.372766 1.06542396 15.425 22.8
2 cyl 2 32 6.18750 1.7859216 6.0 6.2307692 2.96520 4.0 8.0 4.0 -0.1746119 -1.762120 0.31570933 4.000 8.0
3 vs 3 32 0.43750 0.5040161 0.0 0.4230769 0.00000 0.0 1.0 1.0 0.2402577 -2.001938 0.08909831 0.000 1.0
4 am 4 32 0.40625 0.4989909 0.0 0.3846154 0.00000 0.0 1.0 1.0 0.3640159 -1.924741 0.08820997 0.000 1.0
5 gear 5 32 3.68750 0.7378041 4.0 3.6153846 1.48260 3.0 5.0 2.0 0.5288545 -1.069751 0.13042656 3.000 4.0
6 carb 6 32 2.81250 1.6152000 2.0 2.6538462 1.48260 1.0 8.0 7.0 1.0508738 1.257043 0.28552971 2.000 4.0
</OUTPUT>
# Select stats for comparison with other solutions
d.summary <- d.summary.extended %>%
select(var=rowname, min, q25=Q0.25, median, q75=Q0.75, max, mean, sd) %>%
print()
<OUTPUT>
# A tibble: 6 x 8
var min q25 median q75 max mean sd
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mpg 10.4 15.425 19.2 22.8 33.9 20.09062 6.0269481
2 cyl 4.0 4.000 6.0 8.0 8.0 6.18750 1.7859216
3 vs 0.0 0.000 0.0 1.0 1.0 0.43750 0.5040161
4 am 0.0 0.000 0.0 1.0 1.0 0.40625 0.4989909
5 gear 3.0 3.000 4.0 4.0 5.0 3.68750 0.7378041
6 carb 1.0 2.000 2.0 4.0 8.0 2.81250 1.6152000
</OUTPUT>
3 Comments
var names, it reports the columns numbers.as_tibble() was modified in a recent release of the tidyverse so now the default behavior seems to be to drop rownames. I've now specified in the example that the rowname should be included in the rowname variable in the resulting tibble (using as_tibble(rownames="rowname")).If you want to create a summary table for publication (not for further calculations) you may want to look at the excellent stargazer package.
df <- data.frame(mtcars)
cols <- c('mpg', 'cyl', 'vs', 'am', 'gear', 'carb')
stargazer(
df[, cols], type = "text",
summary.stat = c("min", "p25", "median", "p75", "max", "median", "sd")
)
================================================================
Statistic Min Pctl(25) Median Pctl(75) Max Median St. Dev.
----------------------------------------------------------------
mpg 10.400 15.430 19.200 22.800 33.900 19.200 6.027
cyl 4 4 6 8 8 6 1.786
vs 0 0 0 1 1 0 0.504
am 0 0 0 1 1 0 0.499
gear 3 3 4 4 5 4 0.738
carb 1 2 2 4 8 2 1.615
----------------------------------------------------------------
You can change type to 'latex' and 'html' as well and save it to file with specifying the file giving 'out' argument.
Comments
There's a "new" package called skimr that has a function called skim() that gives wonderful output describing individual variables in a data.fame/tibble.
Try:
skimr::skim(mtcars)
and you'll get:
── Data Summary ────────────────────────
Values
Name mtcars
Number of rows 32
Number of columns 11
_______________________
Column type frequency:
numeric 11
________________________
Group variables None
── Variable type: numeric ───────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 mpg 0 1 20.1 6.03 10.4 15.4 19.2 22.8 33.9 ▃▇▅▁▂
2 cyl 0 1 6.19 1.79 4 4 6 8 8 ▆▁▃▁▇
3 disp 0 1 231. 124. 71.1 121. 196. 326 472 ▇▃▃▃▂
4 hp 0 1 147. 68.6 52 96.5 123 180 335 ▇▇▆▃▁
5 drat 0 1 3.60 0.535 2.76 3.08 3.70 3.92 4.93 ▇▃▇▅▁
6 wt 0 1 3.22 0.978 1.51 2.58 3.32 3.61 5.42 ▃▃▇▁▂
7 qsec 0 1 17.8 1.79 14.5 16.9 17.7 18.9 22.9 ▃▇▇▂▁
8 vs 0 1 0.438 0.504 0 0 0 1 1 ▇▁▁▁▆
9 am 0 1 0.406 0.499 0 0 0 1 1 ▇▁▁▁▆
10 gear 0 1 3.69 0.738 3 3 4 4 5 ▇▁▆▁▂
11 carb 0 1 2.81 1.62 1 2 2 4 8 ▇▂▅▁▁
it is customizable and works well with pipes etc.
see ?skimr::skim()
and vignette("Using_skimr", package = "skimr")
1 Comment
Similar to the accepted answer, but tidied up a bit into a function:
summarise_continuous = function(d, cvars) {
d %>%
select(all_of(cvars)) %>%
mutate_all(as.numeric) %>%
summarise(across(all_of(cvars), list(N = ~sum(!is.na(.)),
mean = ~mean(., na.rm=T),
sd = ~sd(., na.rm=T),
median = ~median(., na.rm=T),
min = ~min(., na.rm=T),
max = ~max(., na.rm=T)))) %>%
pivot_longer(everything(),
names_to = c("variable",".value"),
names_pattern = "(.+)_(.+)") # %>%
# knitr::kable()
# uncomment these bits if you want a nicely formatted table in a .Rmd document
}
summarise_continuous(mtcars, c("mpg", "cyl", "vs", "am", "gear", "carb"))
Comments
You can achieve the same result using data.table as well. You might consider using it if your table is big.
dt <- data.table(mtcars)
cols <- c('mpg', 'cyl', 'vs', 'am', 'gear', 'carb')
functions <- c('min', 'q25', 'median', 'q75', 'max', 'mean', 'sd')
dt.sum <- dt[
,
lapply(
.SD,
function(x) list(
min(x), quantile(x, 0.25), median(x),
quantile(x, 0.75), max(x), mean(x), sd(x)
)
),
.SDcols = cols
]
dt.sum
mpg cyl vs am gear carb
1: 10.4 4 0 0 3 1
2: 15.43 4 0 0 3 2
3: 19.2 6 0 0 4 2
4: 22.8 8 1 1 4 4
5: 33.9 8 1 1 5 8
6: 20.09 6.188 0.4375 0.4062 3.688 2.812
7: 6.027 1.786 0.504 0.499 0.7378 1.615
# transpose and provide meaningful names
dt.sum.t <- as.data.table(t(sum))[]
setnames(dt.sum.t, names(dt.sum.t), functions)
dt.sum.t[, var := cols]
setcolorder(dt.sum.t, c("var", functions))
dt.sum.t
var min q25 median q75 max mean sd
1: mpg 10.4 15.43 19.2 22.8 33.9 20.09 6.027
2: cyl 4 4 6 8 8 6.188 1.786
3: vs 0 0 0 1 1 0.4375 0.504
4: am 0 0 0 1 1 0.4062 0.499
5: gear 3 3 4 4 5 3.688 0.7378
6: carb 1 2 2 4 8 2.812 1.615
Comments
Or, if you want a one-line solution, you can combine dplyr's select with descr() from the package summarytools:
library(dplyr); library(summarytools)
data <- mtcars
data %>% select(mpg, cyl, vs, am, gear, carb) %>% descr()
which results in:
Descriptive Statistics
am carb cyl gear mpg vs
Mean 0.41 2.81 6.19 3.69 20.09 0.44
Std.Dev 0.50 1.62 1.79 0.74 6.03 0.50
Min 0.00 1.00 4.00 3.00 10.40 0.00
Q1 0.00 2.00 4.00 3.00 15.35 0.00
Median 0.00 2.00 6.00 4.00 19.20 0.00
Q3 1.00 4.00 8.00 4.00 22.80 1.00
Max 1.00 8.00 8.00 5.00 33.90 1.00
MAD 0.00 1.48 2.97 1.48 5.41 0.00
IQR 1.00 2.00 4.00 1.00 7.38 1.00
CV 1.23 0.57 0.29 0.20 0.30 1.15
Skewness 0.36 1.05 -0.17 0.53 0.61 0.24
SE.Skewness 0.41 0.41 0.41 0.41 0.41 0.41
Kurtosis -1.92 1.26 -1.76 -1.07 -0.37 -2.00
N.Valid 32.00 32.00 32.00 32.00 32.00 32.00
Pct.Valid 100.00 100.00 100.00 100.00 100.00 100.00
Comments
I'm seeing this many years later. Some of the functions are deprecated in the new versions of the dplyr, so you will have to use different ones.
A simple alternative could be to create variables and arrange it like this:
describe(reframe(df, mpg, cyl, vs, am, gear, carb))