Outliers

Data points way outside the pack — notice them, investigate, decide.

In a dataset of marathon times, one entry says 14 minutes. Either someone discovered teleportation, or it's a typo. Outliers: the data points that don't belong — or do they?

An outlier is a value far away from the rest of the data. It might be an error, or it might be the most interesting point in the set.

Where you'll meet this

Fraud detection, sensor faults, scientific discoveries, data cleaning — half of real-world data work is deciding which weird points to fix, drop, or chase down.

data sciencefraudscience

Edit the data set

median = 14.50Q1 = 13.25, Q3 = 16, IQR = 2.75outlier: 42

Common ways to flag outliers

1.5 × IQR rule — beyond Q1 − 1.5·IQR or Q3 + 1.5·IQR.
Z-score — more than ±3 standard deviations from the mean.
Visual — points isolated on a box plot or scatter plot.
Domain sense — a value that's physically impossible.

Your turn

Data: 4, 5, 6, 7, 8, 9, 50. The mean jumps from ~6.5 to ~12.7 because of the 50. What does that tell you?

Try it

When is an outlier the whole point?

The hole in the ozone layer was first dismissed as an outlier by automated software. Scientists who *investigated* it instead of deleting it found a real, alarming phenomenon.

Watch out

Don't delete outliers automatically. They might be the discovery, the fraud, or the broken sensor you needed to catch. Investigate first; remove only with a documented reason.

Always report your results with and without outliers. If the conclusion changes, the outliers deserve a paragraph of their own.

Recap

Outlier = a value far from the rest of the data.
Flag with 1.5×IQR, z-scores, plots, or domain knowledge.
Investigate before deleting — it could be error or discovery.

Box plots Quartiles Median Standard deviation

Outliers

Common ways to flag outliers

Related