Outliers are data points that differ significantly from other observations in a dataset. They can be due to a variability in the measurement, an indication of novel data, or the result of experimental error. Outliers can occur by chance in any distribution, but they can also indicate novel behavior or structures in the dataset, measurement error, or that the population has a heavy-tailed distribution. Outliers can have many anomalous causes, such as changes in system behavior, fraudulent behavior, human error, instrument error, or simply through natural deviations in populations.
There is no rigid mathematical definition of what constitutes an outlier, and determining whether or not an observation is an outlier is ultimately a subjective exercise. However, there are various methods of outlier detection, some of which are treated as synonymous with novelty detection. Some methods are graphical, such as normal probability plots, while others are model-based.
Here are some common ways to find outliers in a dataset:
-
Box plots: A box plot is a useful graphical display for describing the behavior of the data in the middle as well as at the ends of the distributions. The box plot uses the median and the lower and upper quartiles (defined as the 25th and 75th percentiles). If the lower quartile is Q1 and the upper quartile is Q3, then the difference (Q3 - Q1) is called the interquartile range or IQ. A point beyond an inner fence on either side is considered a mild outlier, while a point beyond an outer fence is considered an extreme outlier.
-
Sorting: Sorting your values from low to high and checking for values that are significantly higher or lower than the rest of the data points can help identify outliers.
-
Calculating the interquartile range: The interquartile range (IQR) is the difference between the third quartile (Q3) and the first quartile (Q1). An observation is considered an outlier if it is less than Q1 - 1.5(IQR) or greater than Q3 + 1.5(IQR).
-
Using subject-area knowledge: Finding outliers depends on subject-area knowledge and an understanding of the data collection process. Using in-depth knowledge about all the variables when analyzing data can help identify potential outliers.
-
Plotting the data: Plotting the data on a number line as a dot plot can help identify outliers.
Its important to note that outliers can have a big impact on statistical analyses and skew the results of any hypothesis test if they are inaccurate. However, outliers can also hold useful information about the data and give helpful insights into the data being studied.