# Definition

An

outlieris a data point or observation whose value is quite different from the others in the dataset being analyzed.^{1}

It is an important part of the analysis to identify outliers and to use appropriate techniques to take them into account. But unfortunately 😔

There is no absolute agreement among statisticians about how to define

outliers[…]^{1}

So what can be done? Fortunately 😏

Various rules of thumb have been developed to make the identification of outliers more consistent.

^{1}

One common definition uses the concept of **interquartile range (IQR)**.

# IQR

The

interquartile range[IQR] is the range of the middle 50% of the values in a data set, which is calculated as the difference between the 75th [upper quartile Q3] and 25th percentile [lower quartile Q1] values.^{1}

And now how to use IQR to identify and remove outliers—filter values?

# Using IQR to find outliers

[…] mild outliers are those lower than the 25th quartile [Q1] minus 1.5 x IQR or greater than the 75th quartile [Q3] plus 1.5 x IQR.

^{1}

And about the rationale

Cases this extreme are expected in about 1 in 150 observations in normally distributed data.

^{1}

On other example of the common usage of the **1.5 factor** is that it is generally taken as the default value in box plot implementations like in matplotlib , the python main plotting library.

whis: float, sequence, or string (default = 1.5) As a float, determines the reach of the whiskers to the beyond the first and third quartiles. In other words, where IQR is the interquartile range (Q3-Q1), the upper whisker will extend to last datum less than Q3 + whis x IQR).[^2]

A last word to say that this 1.5 factor can be substituted by higher values.

3 x IQR […] are expected about once per 425 000 observations in a normally distributed data.

^{1}

# In Python

Great, but how to use it in Python + Pandas to filter values in a dataset ? Here is a simple solution taken from a quite popular answer I made on Stack Overflow.

## 1. Producing some test data

```
import pandas as pd
import numpy as np
%matplotlib inline
# Some test data
np.random.seed(33454)
df = (
# A standard distribution
pd.DataFrame({'nb': np.random.randint(0, 100, 20)})
# Adding some outliers
.append(pd.DataFrame({'nb': np.random.randint(100, 200, 2)}))
# Reseting the index
.reset_index(drop=True)
)
```

## 2. Computing IQR

```
Q1 = df['nb'].quantile(0.25)
Q3 = df['nb'].quantile(0.75)
IQR = Q3 - Q1
```

## 3. Filtering data

It makes use of the pandas `query`

method for clarity.

```
#Values between Q1-1.5IQR and Q3+1.5IQR
filtered = df.query('(@Q1 - 1.5 * @IQR) <= nb <= (@Q3 + 1.5 * @IQR)')
```

## 4. Plotting the result to check the difference

`df.join(filtered, rsuffix='_filtered').boxplot()`

*Note: SciPy proposes an implementation of the IQR computing scipy.stats.iqr.*

# Conclusion

Outliers identification based on IQR is a **useful technique simple and generally accepted**. So it can be used at least as a first tool during exploratory analysis.

- Sarah Boslaugh,
*Statistics in a Nutshell*(O’Reilly, 2012)^{[return]}