Outliers

included in data

2017-08-19 492 words 3 minutes

Contents

Definition

An outlier is a data point or observation whose value is quite different from the others in the dataset being analyzed.¹

It is an important part of the analysis to identify outliers and to use appropriate techniques to take them into account. But unfortunately 😔

There is no absolute agreement among statisticians about how to define outliers […]¹

So what can be done? Fortunately 😏

Various rules of thumb have been developed to make the identification of outliers more consistent.¹

One common definition uses the concept of interquartile range (IQR).

IQR

The interquartile range [IQR] is the range of the middle 50% of the values in a data set, which is calculated as the difference between the 75th [upper quartile Q3] and 25th percentile [lower quartile Q1] values.¹

And now how to use IQR to identify and remove outliers—filter values?

Using IQR to find outliers

[…] mild outliers are those lower than the 25th quartile [Q1] minus 1.5 x IQR or greater than the 75th quartile [Q3] plus 1.5 x IQR. ¹

And about the rationale

Cases this extreme are expected in about 1 in 150 observations in normally distributed data.¹

On other example of the common usage of the 1.5 factor is that it is generally taken as the default value in box plot implementations like in matplotlib , the python main plotting library.

whis : float, sequence, or string (default = 1.5) As a float, determines the reach of the whiskers to the beyond the first and third quartiles. In other words, where IQR is the interquartile range (Q3-Q1), the upper whisker will extend to last datum less than Q3 + whis x IQR).[^2]

A last word to say that this 1.5 factor can be substituted by higher values.

3 x IQR […] are expected about once per 425 000 observations in a normally distributed data.¹

In Python

Great, but how to use it in Python + Pandas to filter values in a dataset ? Here is a simple solution taken from a quite popular answer I made on Stack Overflow.

1. Producing some test data

import pandas as pd
import numpy as np
%matplotlib inline

# Some test data
np.random.seed(33454)
df = (
    # A standard distribution
    pd.DataFrame({'nb': np.random.randint(0, 100, 20)})
        # Adding some outliers
        .append(pd.DataFrame({'nb': np.random.randint(100, 200, 2)}))
        # Reseting the index
        .reset_index(drop=True)
    )

2. Computing IQR

Q1 = df['nb'].quantile(0.25)
Q3 = df['nb'].quantile(0.75)
IQR = Q3 - Q1

3. Filtering data

It makes use of the pandas query method for clarity.

#Values between Q1-1.5IQR and Q3+1.5IQR
filtered = df.query('(@Q1 - 1.5 * @IQR) <= nb <= (@Q3 + 1.5 * @IQR)')

4. Plotting the result to check the difference

df.join(filtered, rsuffix='_filtered').boxplot()

Note: SciPy proposes an implementation of the IQR computing scipy.stats.iqr.

Conclusion

Outliers identification based on IQR is a useful technique simple and generally accepted. So it can be used at least as a first tool during exploratory analysis.

Sarah Boslaugh, Statistics in a Nutshell (O’Reilly, 2012) ↩︎