Intuitively, an outlier can be seen as a value that is much smaller or larger than most of the other values in a data set. More formally, an outlier is an observation that lies at an abnormal distance from the other values in a random sample taken from a population.
There is some subjectivity in this definition, as “abnormal distance” can be different for different people and for different problems. Therefore, before identifying outliers, it is necessary to define what a “normal distance” is.
There are several ways to identify an outlier. From plain observation to more systematic techniques, such as using scatter plots, box plots, the Grubbs’ test (when we know that the distribution is normal), or the Mahalanobis Distance.
It should be noticed that an outlier is different from noise. While an outlier is data that differs from the majority and can be compared to the other values, noise represents a random error.
There are several ways to classify an outlier. Generally, it can be categorized according to the number of variables where it occurs as univariate (one variable) or multivariate (two or more variables). According to the environment, an outlier can be global (a data point considered different from the rest of the data), contextual (differences are defined according to a specific context), and collective (outliers belong to a subset of data that is different from the rest of the data).
The identification of outliers is an important step in data analysis. For example, an outlier may indicate a wrong value due to an experimental problem or incorrect coding. But an outlier is not always a negative thing: it may indicate something unexpected that can be confirmed by further studies, or it may reveal errors in our model.
Although expert knowledge may not be necessary to identify some outliers, their interpretation may require so. That is why LogicPlum not only provides a platform for modeling but also expert advice in the matter. Our philosophy is “working together” to attain the best results.
For those wanting to expand their knowledge, a good reference book is
Hawkins, D.M. (1980) Identification of outliers. London: Chapman and Hall
© 2021 LogicPlum, Inc. All rights reserved.