Extreme outlier in real data

Question

I'm looking at the amount of carbon in seven forest pools. For dead trees left on the landscape across many locations and over several harvest retention (logging) treatments, there is an extreme value that I happen to know is real.

Data is fairly zero inflated (10 of 190 obs) and right skewed.

Min. = 0.0000
1st Qu. = 0.1733
Median = 0.6664
Mean = 7.0793
3rd Qu. = 3.2283
Max. = 468.9519

Histogram with outlier:

Histogram of data without outlier:

A massive coastal old growth snag results in a plot having 469 Mg C ha⁻¹ in the dead trees when the next most C-rich measurement is 83 Mg C ha⁻¹. This is a real tree in my actual plot, but it completely skews the estimates of my GLMMs away from meaningful inference of the rest the data. It's random that this tree wound up in this particular treatment as plots were randomly assigned treatments. It is not random that a tree is at this location because it is our most southern/humid research forest.

How do you handle a totally real but seriously destructive outlier?

And what are you trying to achieve from your modelling? The most important question :) — Alex J
– Alex J, Commented yesterday

jginestet · Accepted Answer · 2025-11-28 18:11:45Z

If you were to read some of my past answers/comments about outliers, and outlier removal, you would note that I am very sanguine about people who think very little about removing so-called outliers.

So first let me commend you for at least having some scruples.

And second, maybe surprisingly coming from me, I see nothing wrong with you simply ignoring said "anomaly". As long as you clearly disclose this (as yo did in the question), note that you know it is "true data" (and not an error), but that it is so exceptional as to bias your model, then simply ignore it.
And you may also want to ignore the datapoints at 83 and 61 (only 1 observation in these ranges of your data).
And if you then clearly state that your model is only valid in the range of 0 to ~40 Mg C ha⁻¹ (it really would be pushing it to claim up to 50, as you have essentially no observations in that range), then there is no issue. That is the range your model can claim to be valid in, and you are simply excluding datapoints beyond that range.
And trying to use "robust models" or other such tools is fool's gold; you have basically no observation beyond ~40, so making claims that you are modelling beyond this value is pointless.
Now, if you had observed a lot of values between ~40 and ~400, you could make a broader claim, but then you would not have an outlier, would you? You would have an extreme value, and it would be fair to account for it in your (broader) model.

Peter Flom · Accepted Answer · 2025-11-28 00:00:59Z

4

One possibility is to use multilevel quantile regression and look at the median (or other quantiles) see e.g. this thread, which is somewhat old but seems on point. A paper that may be helpful is Galarza, Lachos, & Bandyopadhyay Quantile regression in linear mixed models

Another possibility is robust linear models, see e.g. the documentation for robustlmm and the works cited there.

answered yesterday

Peter Flom

141k37 gold badges201 silver badges484 bronze badges

Add a comment |

Stack Exchange Network

Extreme outlier in real data

2 Answers 2

Your Answer

Linked

Hot Network Questions

Extreme outlier in real data

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Hot Network Questions