Calculating summary statistics for a dataframe automatically and creating new table

Question

I have the following dataframe:

col1 <- c("avi","chi","chi","bov","fox","bov","fox","avi","bov",
          "chi","avi","chi","chi","bov","bov","fox","avi","bov","chi")
col2 <- c("low","med","high","high","low","low","med","med","med","high",
          "low","low","high","high","med","med","low","low","med")
col3 <- c(0,1,1,1,0,1,0,0,0,0,0,0,1,1,1,1,0,1,0)

test_data <- cbind(col1, col2, col3)
test_data <- as.data.frame(test_data)

And I want to end up with something like this table (values are random):

Species  Pop.density  %Resistance  CI_low  CI_high   Total samples
avi      low          2.0          1.2     2.2       30
avi      med          0            0       0.5       20
avi      high         3.5          2.9     4.2       10
chi      low          0.5          0.3     0.7       20
chi      med          2.0          1.9     2.1       150
chi      high         6.5          6.2     6.6       175

The % resistance column is based on the col3 above, where 1 = resistant, and 0 = nonresistant. I have tried the following:

library(dplyr)
test_data<-test_data %>%
  count(col1,col2,col3) %>%
  group_by(col1, col2) %>%
  mutate(perc_res = prop.table(n)*100)

I tried this, and it seem to almost do the trick, as i get the percentage of total 1s and 0s in col3, for each value in col1 and 2, however the total samples is wrong since I am counting all three columns, when the correct count would be for only col1 and 2.

For the confidence interval i would use the following:

binom.test(resistant samples,total samples)$conf.int*100

However I am not sure how to implement it together with the rest. Is there a simple and quick way to do this?

I'd suggest using the group_by and then the summarise function. — Jul
– Jul, Commented Sep 15, 2017 at 14:59
Use data.frame(col1, col2, col3), not cbind, which forces every column to string here. — Frank
– Frank, Commented Sep 15, 2017 at 15:25
Your example data does not have an ("avi", "high") pair. Would you want that row to appear anyways (with NAs and zero samples count)? — Frank
– Frank, Commented Sep 15, 2017 at 15:36

Frank · Accepted Answer · 2017-09-15 16:11:49Z

I'd do...

library(data.table)
setDT(DT)

DT[, { 
  bt <- binom.test(sum(resists), .N)$conf.int*100
  .(res_rate = mean(resists)*100, res_lo = bt[1], res_hi = bt[2], n = .N)
}, keyby=.(species, popdens)]

    species popdens  res_rate    res_lo    res_hi n
 1:     avi     low   0.00000  0.000000  70.75982 3
 2:     avi     med   0.00000  0.000000  97.50000 1
 3:     bov     low 100.00000 15.811388 100.00000 2
 4:     bov     med  50.00000  1.257912  98.74209 2
 5:     bov    high 100.00000 15.811388 100.00000 2
 6:     chi     low   0.00000  0.000000  97.50000 1
 7:     chi     med  50.00000  1.257912  98.74209 2
 8:     chi    high  66.66667  9.429932  99.15962 3
 9:     fox     low   0.00000  0.000000  97.50000 1
10:     fox     med  50.00000  1.257912  98.74209 2

To include all levels (combinations of species and population density)...

DT[CJ(species = species, popdens = popdens, unique = TRUE), on=.(species, popdens), {
  bt <- 
    if (.N > 0L) binom.test(sum(resists), .N)$conf.int*100 
    else NA_real_
  .(res_rate = mean(resists)*100, res_lo = bt[1], res_hi = bt[2], n = .N)    
}, by=.EACHI]

    species popdens  res_rate    res_lo    res_hi n
 1:     avi     low   0.00000  0.000000  70.75982 3
 2:     avi     med   0.00000  0.000000  97.50000 1
 3:     avi    high        NA        NA        NA 0
 4:     bov     low 100.00000 15.811388 100.00000 2
 5:     bov     med  50.00000  1.257912  98.74209 2
 6:     bov    high 100.00000 15.811388 100.00000 2
 7:     chi     low   0.00000  0.000000  97.50000 1
 8:     chi     med  50.00000  1.257912  98.74209 2
 9:     chi    high  66.66667  9.429932  99.15962 3
10:     fox     low   0.00000  0.000000  97.50000 1
11:     fox     med  50.00000  1.257912  98.74209 2
12:     fox    high        NA        NA        NA 0

How it works

The syntax is DT[i, j, by=] where ...

i determines a subset of rows, sometimes with helper arguments, on= or roll=.
by= determines groups within the subsetted table, switched to keyby= if sorting.
j is code acting on each group.

j should evaluate to a list, with .() being a shortcut to list(). See ?data.table for details.

Data used

(renamed columns, reformatted binary variable back to 0/1 or false/true, set levels of population density in the right order):

DT = data.frame(
  species = col1, 
  popdens = factor(col2, levels=c("low", "med", "high")), 
  resists = col3
)

score 3 · Accepted Answer · 2017-09-15 15:26:51Z

3

This ought to do it.

library(tidyverse)
library(broom)

test_data %>%
  mutate(col3 = ifelse(col3 == 0, "NonResistant", "Resistant")) %>%
  count(col1, col2, col3) %>%
  spread(col3, n, fill = 0) %>%
  mutate(PercentResistant = Resistant / (NonResistant + Resistant)) %>%
  mutate(test = map2(Resistant, NonResistant, ~ binom.test(.x, .x + .y) %>% tidy())) %>%
  unnest() %>%
  transmute(Species = col1, Pop.density = col2, PercentResistant, CI_low = conf.low * 100, CI_high = conf.high * 100, TotalSamples = Resistant + NonResistant)

Mutate the 0/1 resistance column so that it's got readable values.
Count the values in each bucket.
Spread col3/n into two columns, Resistant/NonResistant, and put the counts (n) into those columns. Now each row has everything it needs to do your test.
Calculate the Percent Resistance
Perform your test for each bucket and put the results in a nested frame called test.
Unnest the test data frame so you can work with the test results.
Clean up, give everything nice names.

Results

edited Sep 15, 2017 at 15:26

answered Sep 15, 2017 at 15:02

user6212

7 Comments

pogibas Over a year ago

Great solution! Can I ask where conf.low comes from as after unnest() I see only estimate and statistic?

user6212 Over a year ago

@PoGibas: conf.low comes from the tidy(), which is then unnested. If you see estimate it should be there. Wild guess, your window isn't that wide and there's something like "...with X more variables" below your results?

pogibas Over a year ago

aaa, it's tbl_df, %>% data.frame() shows it. Can't get used to tibble

Haakonkas Over a year ago

This is brilliant! Which part is from the "Broom" package? is it the unnest/nest and transmute?

user6212 Over a year ago

@Haakonkas: broom turns models into data frames with the tidy() method.

|

Collectives™ on Stack Overflow

Calculating summary statistics for a dataframe automatically and creating new table

2 Answers 2

Comments

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related