In a dataset of marathon times, one entry says 14 minutes. Either someone discovered teleportation, or it's a typo. Outliers: the data points that don't belong — or do they?
An outlier is a value far away from the rest of the data. It might be an error, or it might be the most interesting point in the set.
Fraud detection, sensor faults, scientific discoveries, data cleaning — half of real-world data work is deciding which weird points to fix, drop, or chase down.
Common ways to flag outliers
- 1.5 × IQR rule — beyond Q1 − 1.5·IQR or Q3 + 1.5·IQR.
- Z-score — more than ±3 standard deviations from the mean.
- Visual — points isolated on a box plot or scatter plot.
- Domain sense — a value that's physically impossible.
Data: 4, 5, 6, 7, 8, 9, 50. The mean jumps from ~6.5 to ~12.7 because of the 50. What does that tell you?
When is an outlier the whole point?
The hole in the ozone layer was first dismissed as an outlier by automated software. Scientists who *investigated* it instead of deleting it found a real, alarming phenomenon.
Don't delete outliers automatically. They might be the discovery, the fraud, or the broken sensor you needed to catch. Investigate first; remove only with a documented reason.
Always report your results with and without outliers. If the conclusion changes, the outliers deserve a paragraph of their own.
- Outlier = a value far from the rest of the data.
- Flag with 1.5×IQR, z-scores, plots, or domain knowledge.
- Investigate before deleting — it could be error or discovery.