6

I am trying to calculate multiple stats for a dataframe.

I tried dplyr's summarise_each. However, the results are returned in a flat, single-row with the function's name added as a suffix.

Is there a direct way - using dplyr or base r - where I can get the results in a data frame, with the columns as the data frame's columns and the rows as the summary functions?

library(dplyr)

df = data.frame(A = sample(1:100, 20), 
                B = sample(110:200, 20), 
                C = sample(c(0,1), 20, replace = T))

df %>% summarise_each(funs(min, max)) 
# A_min B_min C_min A_max B_max C_max
# 1    13   117     0    98   188     1

# Desired format
summary(df)
# A               B               C       
# Min.   :13.00   Min.   :117.0   Min.   :0.00  
# 1st Qu.:34.75   1st Qu.:134.2   1st Qu.:0.00  
# Median :45.00   Median :148.0   Median :1.00  
# Mean   :52.35   Mean   :149.9   Mean   :0.65  
# 3rd Qu.:62.50   3rd Qu.:168.8   3rd Qu.:1.00  
# Max.   :98.00   Max.   :188.0   Max.   :1.00  
4
  • 1
    close to this one? stackoverflow.com/q/38892219/3871924 Commented Nov 2, 2016 at 8:33
  • @agenis Thank you! Yes, that's close. I was just wondering if there is a simpler way. Commented Nov 2, 2016 at 8:36
  • summarise returns one row per group, so without grouping it returns one row. Commented Nov 2, 2016 at 8:50
  • Maybe data.frame(apply(df,2,summary)) Commented Nov 2, 2016 at 8:55

6 Answers 6

10

How about:

library(tidyr)
gather(df) %>% group_by(key) %>% summarise_all(funs(min, max))
# A tibble: 3 × 3
    key   min   max
  <chr> <dbl> <dbl>
1     A     2    92
2     B   111   194
3     C     0     1
Sign up to request clarification or add additional context in comments.

Comments

10

Why not just simply use sapply with summary?

sapply(df, summary)

gives:

            A     B    C
Min.     1.00 112.0 0.00
1st Qu. 23.75 134.5 0.00
Median  57.00 148.5 1.00
Mean    50.15 149.9 0.55
3rd Qu. 77.50 167.2 1.00
Max.    94.00 191.0 1.00

To get a dataframe back, just wrap the sapply call in data.frame(): data.frame(sapply(df, summary)). If you want to keep summary statisticnames in a column, you can extract them with rownames(df) with df$rn <- rownames(df) or use the keep.rownames-parameter from data.table:

library(data.table)
dt <- data.table(sapply(df, summary), keep.rownames = TRUE)

which gives:

> dt
        rn     A     B   C
1:    Min. 11.00 113.0 0.0
2: 1st Qu. 21.50 126.8 0.0
3:  Median 55.00 138.0 0.5
4:    Mean 53.65 145.2 0.5
5: 3rd Qu. 83.25 160.5 1.0
6:    Max. 98.00 193.0 1.0

Comments

4

With the data.frame you proposed, and using the library purrr

library(purrr)
out <- df %>% map(~summary(.)) %>% rbind.data.frame
row.names(out) <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.")
####             A     B   C
#### Min.     7.00 110.0 0.0
#### 1st Qu. 36.75 132.5 0.0
#### Median  53.50 143.5 0.5
#### Mean    55.45 151.8 0.5
#### 3rd Qu. 82.00 167.0 1.0
#### Max.    99.00 199.0 1.0

There you go. let me just mention that this code works for input data.frame with 100% numeric variables only. It will return an error if there is for instance, a character/factor variable, since the output of summary is quite different.

Comments

2

It is not the only way to go but you could reshape your data.frame as you want using dplyr and tidyr. (and stringr or other to trim character.)

library(dplyr)

df = data.frame(A = sample(1:100, 20), 
                B = sample(110:200, 20), 
                C = sample(c(0,1), 20, replace = T))

as_data_frame(summary(df)) %>%
  # some blank character could be trim
  mutate(Var2 = stringr::str_trim(Var2)) %>% 
  # you don't need Var1
  select(-Var1) %>%
  # Get the type of summary and the value
  tidyr::separate(n, c("Type", "value"), sep = ":") %>%
  # Convert value to numeric
  mutate(value = as.numeric(value)) %>%
  # reshape as you wish
  tidyr::spread(Var2, value, drop = T)
#> # A tibble: 6 x 4
#>      Type     A     B     C
#> *   <chr> <dbl> <dbl> <dbl>
#> 1 1st Qu. 36.25 122.2  1.00
#> 2 3rd Qu. 77.25 164.5  1.00
#> 3 Max.    95.00 193.0  1.00
#> 4 Mean    57.30 144.6  0.85
#> 5 Median  63.00 143.5  1.00
#> 6 Min.     8.00 111.0  0.00

Comments

1

An approach without using tidyr nor dplyr:

df <- data.frame(A = sample(1:100, 20), 
                 B = sample(110:200, 20), 
                 C = sample(c(0,1), 20, replace = T))
df %>% 
    lapply(summary) %>% 
    do.call("rbind", .)

Output:

  Min. 1st Qu. Median   Mean 3rd Qu. Max.
A    9    32.5   50.5  49.65   70.25   84
B  116   137.2  162.5 157.70  178.20  196
C    0     0.0    0.0   0.45    1.00    1

If you want to do it with dplyr, try:

df %>% 
    gather(attribute, value) %>% 
    group_by(attribute) %>% 
    do(as.data.frame(as.list(summary(.$value))))

Output:

Source: local data frame [3 x 7]
Groups: attribute [3]

  attribute  Min. X1st.Qu. Median   Mean X3rd.Qu.  Max.
      <chr> <dbl>    <dbl>  <dbl>  <dbl>    <dbl> <dbl>
1         A     9     32.5   50.5  49.65    70.25    84
2         B   116    137.2  162.5 157.70   178.20   196
3         C     0      0.0    0.0   0.45     1.00     1

Comments

0

Thank you all very much for your help! After some cherry-picking, I used the following method.

# Dataframe 
df = data.frame(A = sample(1:100, 20), 
                B = sample(110:200, 20), 
                C = sample(c(0,1), 20, replace = T))

# Add summary functions to a list
summaryFns = list(
  NA.n  = function(x) sum(is.na(x)),
  NA.percent = function(x) sum(is.na(x))/length(x),
  unique.n = function(x) ifelse(sum(is.na(x)) > 0, length(unique(x)) - 1, length(unique(x))),
  min = function(x) min(x, na.rm=TRUE),
  max = function(x) max(x, na.rm=TRUE))


# Summarise data frame with each function 
# Using dplyr:
library(dplyr)
sapply(summaryFns, function(fn){df %>% summarise_all(fn)})
#   NA.n NA.percent unique.n min max
# A 0    0          20       1   98 
# B 0    0          20       114 200
# C 0    0          2        0   1  

# Using base-r:
sapply(summaryFns, function(fn){sapply(df, fn)})
#     NA.n NA.percent unique.n min max
# A    0          0       20   1  98
# B    0          0       20 114 200
# C    0          0        2   0   1

I think that is the most straight forward and flexible method.
Further comments, modifications and suggestions are appreciated.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.