0

Note: Question edited to simplify the data.

I have a few datasets combined in a dataframe, which I want to eliminate outliers from. When trying different ways to calculate upper and lower thresholds I came upon discrepacies between the results of ggplot-boxplots and manual calculation. I would like to (1) understand the discrepancies and (2) find a convenient way to eliminate outliers from multiple similar datasets via dplyr.

Example data (taken from the boxplot.stats help page):

library(tidyverse)
require(stats)
x <- c(1:100, 1000)
(b1 <- boxplot.stats(x))

This returns ymin = 1 and max = 100 and 1000 as an obvious outlier.

My initial approach can be reproduced with this data, too... to have it complete, here is my original dataset too (however discussion is based on the simpler one).

# small dataset
values_A30 <- c(0.2079762,0.2605029,0.3054334,0.304067,0.8696487,0.3470931,0.2560001,0.3096838,0.2887556,0.3741472,0.2178375,0.2234682,0.2628923,0.2745458,0.5438208,0.7068278,0.6492924,1.100175,0.2740491,0.2849299,0.7562737,0.3009749,0.2598575,0.3460925,0.3265929,0.4208336,0.4353992,1.132036,0.3856708,0.1978752,0.3676808,0.4196799,0.4486595,0.3282394,0.3725664,0.385373,0.3680049,0.7875058,0.8098903,0.741165,1.260887,0.3521471,0.3883195,1.17124,0.3225514,0.3492051)

bpdata <- data.frame(Var1 = "A", SSW = 30, Value = values_A30)

Usually I start up with ggplot boxplots to get a visual impression.

# test plot
ggplot(as.data.frame(x),
       aes(x = NA, y = x)) +
  geom_boxplot()

#extract stats
gg_stats <- ggplot_build(
  ggplot(as.data.frame(x),
         aes(x = NA, y = x)) +
    geom_boxplot()
)$data[[1]] |>
  select(1:9)

# manual stats
man_stats <- as.data.frame(x) |>
  summarise(
    Qu1 = quantile(x, 0.25),
    Qu3 = quantile(x, 0.75),
    IQR = IQR(x)
  ) |>
  mutate(ymin = Qu1 - (1.5 * IQR),
         ymax = Qu3 + (1.5 * IQR))

Both, boxplot.stats and ggplot_build, lead to the same results. The manual calculation, however, leads to different results, ymin = -49 and ymax = 151. The values for the quartiles and IQR are correct and the rest of the calculation seems not to be too complicated, so: What am I doing wrong?

Insights / Answer

I worked more on the problem described and found the answer with the hints below and also on [this page] (Box-plot R calculating outliers).

To sum it up:

  • The initially calculated limits (Q1 - 1.5 IQR and Q3 + 1.5 IQR) are before outlier elimination, while the ymin/ymax output of the boxplot-stats (ggplot or R core boxplot function) are based on the actual data points and after outlier elimination. That was what confused me.
  • There is further complexity possible based on the underlying algorithm (type =) in the quantile and IQR functions.

Thanks again for help @Michiel Duvekot

1 Answer 1

0

It's perhaps easier to see what's happening if you use a very simple range of values

Value <- 0:100
Qu1  <- Value  |> quantile(0.25)
Qu3  <- Value  |>  quantile(0.75)
IQR <- Value  |>  IQR()

min <- Qu1 - (1.5 * IQR)
max <- Qu3 + (1.5 * IQR)

cat(min, max)

gives

-50 150

if you want to know which values are considered outliers, you can use

(bpdata %>% filter(Var1 == "A", SSW == 30) %>% pull(Value)  %>% boxplot.stats())$out

If you want to calculate the min and max yourself, you can do what boxplt.stats does:

y <- bpdata %>% filter(Var1 == "A", SSW == 30) %>% pull(Value)

stats <- stats::fivenum(y, na.rm = TRUE)
iqr <- diff(stats[c(2, 4)])
coef <- 1.5
min <- (stats[2] - coef * iqr)
max <- (stats[4] + coef * iqr)
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you, Michiel. I have edited the example data to make it more comprehensible.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.