0

Lets call summary(my_data):

      year         quarter         employed         newhires       separations      jobscreated     jobsdestroyed  
 Min.   :1990   Min.   :1.000   Min.   :  6976   Min.   :  2321   Min.   :  1922   Min.   :  1091   Min.   :  520  
 1st Qu.:2000   1st Qu.:2.000   1st Qu.: 28049   1st Qu.: 16858   1st Qu.: 13912   1st Qu.:  6595   1st Qu.: 3862  
 Median :2003   Median :3.000   Median : 64836   Median : 39188   Median : 32018   Median : 14148   Median : 7727  
 Mean   :2003   Mean   :2.509   Mean   : 94468   Mean   : 59336   Mean   : 48973   Mean   : 22036   Mean   :11843  
 3rd Qu.:2007   3rd Qu.:4.000   3rd Qu.:121905   3rd Qu.: 75960   3rd Qu.: 61976   3rd Qu.: 26829   3rd Qu.:14993  
 Max.   :2010   Max.   :4.000   Max.   :571419   Max.   :448423   Max.   :391454   Max.   :166022   Max.   :80338  
                                                 NA's   :49                        NA's   :49       NA's   :49     

I want to convert this output into a data.table formatted as follows, where all entries (omitted in this depiction) are the raw values of min, 1st quartile. etc. :

            year         quarter         employed         newhires       separations      jobscreated     jobsdestroyed  
 Min.      
 1st Qu.   
 Median    
 Mean     
 3rd Qu.   
 Max.            
 NA's          

The following almost achieves this result, except for the fact that Min. , 1st Qu. , Median , Mean , 3rd Qu. , Max. , and NA's carry over into each entry. I want purely the raw numbers.

data.frame(unclass(summary(my_data)), check.names = FALSE, stringsAsFactors = FALSE)
             year         quarter         employed         newhires      separations      jobscreated   jobsdestroyed
X   Min.   :1990   Min.   :1.000   Min.   :  6976   Min.   :  2321   Min.   :  1922   Min.   :  1091   Min.   :  520  
X.1 1st Qu.:2000   1st Qu.:2.000   1st Qu.: 28049   1st Qu.: 16858   1st Qu.: 13912   1st Qu.:  6595   1st Qu.: 3862  
X.2 Median :2003   Median :3.000   Median : 64836   Median : 39188   Median : 32018   Median : 14148   Median : 7727  
X.3 Mean   :2003   Mean   :2.509   Mean   : 94468   Mean   : 59336   Mean   : 48973   Mean   : 22036   Mean   :11843  
X.4 3rd Qu.:2007   3rd Qu.:4.000   3rd Qu.:121905   3rd Qu.: 75960   3rd Qu.: 61976   3rd Qu.: 26829   3rd Qu.:14993  
X.5 Max.   :2010   Max.   :4.000   Max.   :571419   Max.   :448423   Max.   :391454   Max.   :166022   Max.   :80338  
X.6           <NA>            <NA>             <NA>     NA's   :49               <NA>     NA's   :49      NA's   :49  

Potential solutions include (1) deriving the table directly from summary(), or (2) using the output above and finding a way to remove Min. , 1st Qu. , Median , Mean , 3rd Qu. , Max. , and NA labels from reach entry and instead list them as column names. Your help is much appreicated!

7
  • 1
    Does this answer your question? Convert summary to data.frame Commented Nov 30, 2021 at 19:12
  • 1
    Keep reading the answer. do.call(cbind, lapply(mydf, summary)) works fine. At least with mtcars dataset. However I can't tell if it is ok with NaN values. Commented Nov 30, 2021 at 19:17
  • 1
    If you provide an example of your data I can try it. Commented Nov 30, 2021 at 19:18
  • 1
    @RobertoT, this code (yours and mine) does not work well with NA/NaN, since summary uses table(..., useNA="ifany") hard-coded ... so unless all columns have at least one, that will always fail. An alternative is to use fixed_summary <- function(object, ...) { o <- summary(c(object, NA), ...); o["NA's"] <- o["NA's"] - 1L; o; } and then as.data.frame(sapply(mtcars, fixed_summary)) (tested with mtcars[2,2] <- NA; mtcars[3,2] <- NaN). Commented Nov 30, 2021 at 20:10
  • 1
    Good to know! I wasn't sure. @r2evans Commented Nov 30, 2021 at 20:11

3 Answers 3

1

An adaptation of the dupe-link code, since that does not work as cleanly with incomplete (NA/NaN) data (e.g., newhires):

mtcars[2,2] <- NA
mtcars[3,2] <- NaN

as.data.frame(sapply(mtcars, summary))
# Error in dimnames(x) <- dnx : 'dimnames' applied to non-array

This fails because summary hard-codes (bleh) table(..., useNA="ifany"), which means that some columns might return length 6, some length 7, which defeats most casual attempts to conform into a data.frame-like structure.

One way around this (short of rewriting summary from scratch to fix that bug) is to add one NA to all vectors and subtract it from the result; that forces all summaries to include that field, and once subtracted it should represent the data and be rectangular-enough for as.data.frame:

fixed_summary <- function(object, ...) {
  o <- summary(c(object, NA), ...)
  o["NA's"] <- o["NA's"] - 1L
  o
}

ret <- as.data.frame(sapply(mtcars, fixed_summary))
ret
#              mpg      cyl     disp       hp     drat      wt     qsec     vs      am   gear   carb
# Min.    10.40000 4.000000  71.1000  52.0000 2.760000 1.51300 14.50000 0.0000 0.00000 3.0000 1.0000
# 1st Qu. 15.42500 4.000000 120.8250  96.5000 3.080000 2.58125 16.89250 0.0000 0.00000 3.0000 2.0000
# Median  19.20000 6.000000 196.3000 123.0000 3.695000 3.32500 17.71000 0.0000 0.00000 4.0000 2.0000
# Mean    20.09062 6.266667 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125
# 3rd Qu. 22.80000 8.000000 326.0000 180.0000 3.920000 3.61000 18.90000 1.0000 1.00000 4.0000 4.0000
# Max.    33.90000 8.000000 472.0000 335.0000 4.930000 5.42400 22.90000 1.0000 1.00000 5.0000 8.0000
# NA's     0.00000 2.000000   0.0000   0.0000 0.000000 0.00000  0.00000 0.0000 0.00000 0.0000 0.0000

And, per your concern, the "Min." (etc) labels do not carry-over for each column: they are row-names only.

Many R tools do not guarantee preserving row-names; in fact some go out of their way to wipe them. My preference is to not rely on row-names, instead bringing them in as an explicit column. This is mostly subjective, partially defensive programming, and certainly not a requirement.

ret$rownames <- rownames(ret)
rownames(ret) <- NULL
ret
#        mpg      cyl     disp       hp     drat      wt     qsec     vs      am   gear   carb rownames
# 1 10.40000 4.000000  71.1000  52.0000 2.760000 1.51300 14.50000 0.0000 0.00000 3.0000 1.0000     Min.
# 2 15.42500 4.000000 120.8250  96.5000 3.080000 2.58125 16.89250 0.0000 0.00000 3.0000 2.0000  1st Qu.
# 3 19.20000 6.000000 196.3000 123.0000 3.695000 3.32500 17.71000 0.0000 0.00000 4.0000 2.0000   Median
# 4 20.09062 6.266667 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125     Mean
# 5 22.80000 8.000000 326.0000 180.0000 3.920000 3.61000 18.90000 1.0000 1.00000 4.0000 4.0000  3rd Qu.
# 6 33.90000 8.000000 472.0000 335.0000 4.930000 5.42400 22.90000 1.0000 1.00000 5.0000 8.0000     Max.
# 7  0.00000 2.000000   0.0000   0.0000 0.000000 0.00000  0.00000 0.0000 0.00000 0.0000 0.0000     NA's

(The order of columns is completely malleable.)

Sign up to request clarification or add additional context in comments.

Comments

0
# adjust summary(.) 
# returns summary of numeric (including factor) columns of a data frame 
# stats_along='row', put summary stats on the rows and variables along the columns
my_summ <- function(df, stats_along='row') {

  df_nonchar = df[, !sapply(df, typeof) %in% "character"]
  summ = data.frame(summary(df_nonchar), row.names = NULL)

  # test for empty columns:
  # # usually the 1st column is empty as a result of coercing an obj of
  # class(summary obj) "table" to data.frame.
  empty = sapply(summ, function(x) all(x == ""))
  summ = summ[, !empty]
  summ = setNames(summ, c("var_name", "stats"))
  summ = summ[which(!is.na(summ$stats)), ]

  # just in case if there are multiple :'s, we need to split only at the first match
  summ$stats = sub(":", "-;-", summ$stats)
  summ = data.frame(summ[1], do.call(rbind, strsplit(summ$stats, "-;-")))
  names(summ)[-1] = c("stats", "value")
  summ$var_name = trimws(summ$var_name) # rm white spaces

  # pivot into wide form, using 'stats' column as a key.
  stats_along = match.arg(stats_along, c('row', 'col'))
  if (stats_along == 'row') {
    idvar = "stats"
    timevar = "var_name"
  } else if (stats_along == 'col') {
    idvar = "var_name"
    timevar = "stats"
  }

  summ = reshape(
    summ,
    direction = "wide",
    idvar = idvar,
    timevar = timevar,
    v.names = "value",
    sep = "_"

  )

  var_nms = sub("(value_)(.+)", "\\2", names(summ)[-1])
  names(summ)[-1] = var_nms
  rownames(summ) = NULL


  # remove white spaces from cells
  summ[] = lapply(summ, function(x) gsub("\\s+$", "", x))

  # when vars in the dataset contain NAs, we may have two additional columns in
  # summary call
  nas = "NA's" %in% colnames(summ)
  if (any(nas)) {
    names(summ)[names(summ) == "NA's"] = "missing"
  }
  summ
}
my_summ(mtcars)

    stats   mpg   cyl  disp    hp  drat    wt  qsec     vs     am  gear  carb
1    Min. 10.40 4.000  71.1  52.0 2.760 1.513 14.50 0.0000 0.0000 3.000 1.000
2 1st Qu. 15.43 4.000 120.8  96.5 3.080 2.581 16.89 0.0000 0.0000 3.000 2.000
3  Median 19.20 6.000 196.3 123.0 3.695 3.325 17.71 0.0000 0.0000 4.000 2.000
4    Mean 20.09 6.188 230.7 146.7 3.597 3.217 17.85 0.4375 0.4062 3.688 2.812
5 3rd Qu. 22.80 8.000 326.0 180.0 3.920 3.610 18.90 1.0000 1.0000 4.000 4.000
6    Max. 33.90 8.000 472.0 335.0 4.930 5.424 22.90 1.0000 1.0000 5.000 8.000

If vars along the row and stats along the column axes are prefered, then,

my_summ(mtcars, 'col')

   var_name Min.    1st Qu. Median  Mean    3rd Qu. Max.   
1       mpg   10.40   15.43   19.20   20.09   22.80   33.90
2       cyl   4.000   4.000   6.000   6.188   8.000   8.000
3      disp    71.1   120.8   196.3   230.7   326.0   472.0
4        hp    52.0    96.5   123.0   146.7   180.0   335.0
5      drat   2.760   3.080   3.695   3.597   3.920   4.930
6        wt   1.513   2.581   3.325   3.217   3.610   5.424
7      qsec   14.50   16.89   17.71   17.85   18.90   22.90
8        vs  0.0000  0.0000  0.0000  0.4375  1.0000  1.0000
9        am  0.0000  0.0000  0.0000  0.4062  1.0000  1.0000
10     gear   3.000   3.000   4.000   3.688   4.000   5.000
11     carb   1.000   2.000   2.000   2.812   4.000   8.000


  • Note: my_summ(.) |> as.data.table() if you need a data table instead.

Comments

0

Another option is to create your own summary function that adds a sixth element if it is needed:

 ownSummary = function(x) {
    x = summary(x)
    if(length(x)==6) x[7] = 0
    x
  } 

An then you can run this.

data.table(sapply(mtcars, ownSummary), keep.rownames = T)

        rn      mpg      cyl     disp       hp     drat      wt     qsec     vs      am   gear   carb
1:    Min. 10.40000 4.000000  71.1000  52.0000 2.760000 1.51300 14.50000 0.0000 0.00000 3.0000 1.0000
2: 1st Qu. 15.42500 4.000000 120.8250  96.5000 3.080000 2.58125 16.89250 0.0000 0.00000 3.0000 2.0000
3:  Median 19.20000 6.000000 196.3000 123.0000 3.695000 3.32500 17.71000 0.0000 0.00000 4.0000 2.0000
4:    Mean 20.09062 6.266667 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125
5: 3rd Qu. 22.80000 8.000000 326.0000 180.0000 3.920000 3.61000 18.90000 1.0000 1.00000 4.0000 4.0000
6:    Max. 33.90000 8.000000 472.0000 335.0000 4.930000 5.42400 22.90000 1.0000 1.00000 5.0000 8.0000
7:          0.00000 2.000000   0.0000   0.0000 0.000000 0.00000  0.00000 0.0000 0.00000 0.0000 0.0000

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.