Differences between manually and ggplot-calculated boxplot statistics / outlier elimination

Question

Note: Question edited to simplify the data.

I have a few datasets combined in a dataframe, which I want to eliminate outliers from. When trying different ways to calculate upper and lower thresholds I came upon discrepacies between the results of ggplot-boxplots and manual calculation. I would like to (1) understand the discrepancies and (2) find a convenient way to eliminate outliers from multiple similar datasets via dplyr.

Example data (taken from the boxplot.stats help page):

library(tidyverse)
require(stats)
x <- c(1:100, 1000)
(b1 <- boxplot.stats(x))

This returns ymin = 1 and max = 100 and 1000 as an obvious outlier.

My initial approach can be reproduced with this data, too... to have it complete, here is my original dataset too (however discussion is based on the simpler one).

# small dataset
values_A30 <- c(0.2079762,0.2605029,0.3054334,0.304067,0.8696487,0.3470931,0.2560001,0.3096838,0.2887556,0.3741472,0.2178375,0.2234682,0.2628923,0.2745458,0.5438208,0.7068278,0.6492924,1.100175,0.2740491,0.2849299,0.7562737,0.3009749,0.2598575,0.3460925,0.3265929,0.4208336,0.4353992,1.132036,0.3856708,0.1978752,0.3676808,0.4196799,0.4486595,0.3282394,0.3725664,0.385373,0.3680049,0.7875058,0.8098903,0.741165,1.260887,0.3521471,0.3883195,1.17124,0.3225514,0.3492051)

bpdata <- data.frame(Var1 = "A", SSW = 30, Value = values_A30)

Usually I start up with ggplot boxplots to get a visual impression.

# test plot
ggplot(as.data.frame(x),
       aes(x = NA, y = x)) +
  geom_boxplot()

#extract stats
gg_stats <- ggplot_build(
  ggplot(as.data.frame(x),
         aes(x = NA, y = x)) +
    geom_boxplot()
)$data[[1]] |>
  select(1:9)

# manual stats
man_stats <- as.data.frame(x) |>
  summarise(
    Qu1 = quantile(x, 0.25),
    Qu3 = quantile(x, 0.75),
    IQR = IQR(x)
  ) |>
  mutate(ymin = Qu1 - (1.5 * IQR),
         ymax = Qu3 + (1.5 * IQR))

Both, boxplot.stats and ggplot_build, lead to the same results. The manual calculation, however, leads to different results, ymin = -49 and ymax = 151. The values for the quartiles and IQR are correct and the rest of the calculation seems not to be too complicated, so: What am I doing wrong?

Insights / Answer

I worked more on the problem described and found the answer with the hints below and also on [this page] (Box-plot R calculating outliers).

To sum it up:

The initially calculated limits (Q1 - 1.5 IQR and Q3 + 1.5 IQR) are before outlier elimination, while the ymin/ymax output of the boxplot-stats (ggplot or R core boxplot function) are based on the actual data points and after outlier elimination. That was what confused me.
There is further complexity possible based on the underlying algorithm (type =) in the quantile and IQR functions.

Thanks again for help @Michiel Duvekot

Michiel Duvekot · Accepted Answer · 2024-06-19 17:44:09Z

0

It's perhaps easier to see what's happening if you use a very simple range of values

Value <- 0:100
Qu1  <- Value  |> quantile(0.25)
Qu3  <- Value  |>  quantile(0.75)
IQR <- Value  |>  IQR()

min <- Qu1 - (1.5 * IQR)
max <- Qu3 + (1.5 * IQR)

cat(min, max)

gives

-50 150

if you want to know which values are considered outliers, you can use

(bpdata %>% filter(Var1 == "A", SSW == 30) %>% pull(Value)  %>% boxplot.stats())$out

If you want to calculate the min and max yourself, you can do what boxplt.stats does:

y <- bpdata %>% filter(Var1 == "A", SSW == 30) %>% pull(Value)

stats <- stats::fivenum(y, na.rm = TRUE)
iqr <- diff(stats[c(2, 4)])
coef <- 1.5
min <- (stats[2] - coef * iqr)
max <- (stats[4] + coef * iqr)

edited Jun 19, 2024 at 17:44

answered Jun 19, 2024 at 16:16

Michiel Duvekot

2,3912 gold badges12 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Sulfatide Over a year ago

Thank you, Michiel. I have edited the example data to make it more comprehensible.

Collectives™ on Stack Overflow

Differences between manually and ggplot-calculated boxplot statistics / outlier elimination

Insights / Answer

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Insights / Answer

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related