dplyr - Multiple summary functions

Question

I am trying to calculate multiple stats for a dataframe.

I tried dplyr's summarise_each. However, the results are returned in a flat, single-row with the function's name added as a suffix.

Is there a direct way - using dplyr or base r - where I can get the results in a data frame, with the columns as the data frame's columns and the rows as the summary functions?

library(dplyr)

df = data.frame(A = sample(1:100, 20), 
                B = sample(110:200, 20), 
                C = sample(c(0,1), 20, replace = T))

df %>% summarise_each(funs(min, max)) 
# A_min B_min C_min A_max B_max C_max
# 1    13   117     0    98   188     1

# Desired format
summary(df)
# A               B               C       
# Min.   :13.00   Min.   :117.0   Min.   :0.00  
# 1st Qu.:34.75   1st Qu.:134.2   1st Qu.:0.00  
# Median :45.00   Median :148.0   Median :1.00  
# Mean   :52.35   Mean   :149.9   Mean   :0.65  
# 3rd Qu.:62.50   3rd Qu.:168.8   3rd Qu.:1.00  
# Max.   :98.00   Max.   :188.0   Max.   :1.00

@agenis Thank you! Yes, that's close. I was just wondering if there is a simpler way. — Deena
– Deena, Commented Nov 2, 2016 at 8:36
summarise returns one row per group, so without grouping it returns one row. — Axeman
– Axeman, Commented Nov 2, 2016 at 8:50

Axeman · Accepted Answer · 2016-11-02 08:57:01Z

10

How about:

library(tidyr)
gather(df) %>% group_by(key) %>% summarise_all(funs(min, max))

# A tibble: 3 × 3
    key   min   max
  <chr> <dbl> <dbl>
1     A     2    92
2     B   111   194
3     C     0     1

answered Nov 2, 2016 at 8:57

Axeman

35.7k8 gold badges87 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jaap · Accepted Answer · 2018-06-07 10:48:12Z

Why not just simply use sapply with summary?

sapply(df, summary)

gives:

            A     B    C
Min.     1.00 112.0 0.00
1st Qu. 23.75 134.5 0.00
Median  57.00 148.5 1.00
Mean    50.15 149.9 0.55
3rd Qu. 77.50 167.2 1.00
Max.    94.00 191.0 1.00

To get a dataframe back, just wrap the sapply call in data.frame(): data.frame(sapply(df, summary)). If you want to keep summary statisticnames in a column, you can extract them with rownames(df) with df$rn <- rownames(df) or use the keep.rownames-parameter from data.table:

library(data.table)
dt <- data.table(sapply(df, summary), keep.rownames = TRUE)

which gives:

> dt
        rn     A     B   C
1:    Min. 11.00 113.0 0.0
2: 1st Qu. 21.50 126.8 0.0
3:  Median 55.00 138.0 0.5
4:    Mean 53.65 145.2 0.5
5: 3rd Qu. 83.25 160.5 1.0
6:    Max. 98.00 193.0 1.0

agenis - code en bois · Accepted Answer · 2016-11-02 09:45:19Z

4

With the data.frame you proposed, and using the library purrr

library(purrr)
out <- df %>% map(~summary(.)) %>% rbind.data.frame
row.names(out) <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.")
####             A     B   C
#### Min.     7.00 110.0 0.0
#### 1st Qu. 36.75 132.5 0.0
#### Median  53.50 143.5 0.5
#### Mean    55.45 151.8 0.5
#### 3rd Qu. 82.00 167.0 1.0
#### Max.    99.00 199.0 1.0

There you go. let me just mention that this code works for input data.frame with 100% numeric variables only. It will return an error if there is for instance, a character/factor variable, since the output of summary is quite different.

edited Nov 2, 2016 at 9:45

answered Nov 2, 2016 at 9:00

agenis - code en bois

8,4177 gold badges60 silver badges110 bronze badges

Comments

cderv · Accepted Answer · 2016-11-02 08:51:58Z

It is not the only way to go but you could reshape your data.frame as you want using dplyr and tidyr. (and stringr or other to trim character.)

library(dplyr)

df = data.frame(A = sample(1:100, 20), 
                B = sample(110:200, 20), 
                C = sample(c(0,1), 20, replace = T))

as_data_frame(summary(df)) %>%
  # some blank character could be trim
  mutate(Var2 = stringr::str_trim(Var2)) %>% 
  # you don't need Var1
  select(-Var1) %>%
  # Get the type of summary and the value
  tidyr::separate(n, c("Type", "value"), sep = ":") %>%
  # Convert value to numeric
  mutate(value = as.numeric(value)) %>%
  # reshape as you wish
  tidyr::spread(Var2, value, drop = T)
#> # A tibble: 6 x 4
#>      Type     A     B     C
#> *   <chr> <dbl> <dbl> <dbl>
#> 1 1st Qu. 36.25 122.2  1.00
#> 2 3rd Qu. 77.25 164.5  1.00
#> 3 Max.    95.00 193.0  1.00
#> 4 Mean    57.30 144.6  0.85
#> 5 Median  63.00 143.5  1.00
#> 6 Min.     8.00 111.0  0.00

pe-perry · Accepted Answer · 2016-11-02 09:08:35Z

An approach without using tidyr nor dplyr:

df <- data.frame(A = sample(1:100, 20), 
                 B = sample(110:200, 20), 
                 C = sample(c(0,1), 20, replace = T))
df %>% 
    lapply(summary) %>% 
    do.call("rbind", .)

Output:

  Min. 1st Qu. Median   Mean 3rd Qu. Max.
A    9    32.5   50.5  49.65   70.25   84
B  116   137.2  162.5 157.70  178.20  196
C    0     0.0    0.0   0.45    1.00    1

If you want to do it with dplyr, try:

df %>% 
    gather(attribute, value) %>% 
    group_by(attribute) %>% 
    do(as.data.frame(as.list(summary(.$value))))

Output:

Source: local data frame [3 x 7]
Groups: attribute [3]

  attribute  Min. X1st.Qu. Median   Mean X3rd.Qu.  Max.
      <chr> <dbl>    <dbl>  <dbl>  <dbl>    <dbl> <dbl>
1         A     9     32.5   50.5  49.65    70.25    84
2         B   116    137.2  162.5 157.70   178.20   196
3         C     0      0.0    0.0   0.45     1.00     1

Deena · Accepted Answer · 2016-11-02 09:23:02Z

Thank you all very much for your help! After some cherry-picking, I used the following method.

# Dataframe 
df = data.frame(A = sample(1:100, 20), 
                B = sample(110:200, 20), 
                C = sample(c(0,1), 20, replace = T))

# Add summary functions to a list
summaryFns = list(
  NA.n  = function(x) sum(is.na(x)),
  NA.percent = function(x) sum(is.na(x))/length(x),
  unique.n = function(x) ifelse(sum(is.na(x)) > 0, length(unique(x)) - 1, length(unique(x))),
  min = function(x) min(x, na.rm=TRUE),
  max = function(x) max(x, na.rm=TRUE))


# Summarise data frame with each function 
# Using dplyr:
library(dplyr)
sapply(summaryFns, function(fn){df %>% summarise_all(fn)})
#   NA.n NA.percent unique.n min max
# A 0    0          20       1   98 
# B 0    0          20       114 200
# C 0    0          2        0   1  

# Using base-r:
sapply(summaryFns, function(fn){sapply(df, fn)})
#     NA.n NA.percent unique.n min max
# A    0          0       20   1  98
# B    0          0       20 114 200
# C    0          0        2   0   1

I think that is the most straight forward and flexible method.
Further comments, modifications and suggestions are appreciated.

Collectives™ on Stack Overflow

dplyr - Multiple summary functions

6 Answers 6

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related