Anomaly Detection in Machine Learning
Tons of information beginning from this wiki starting point:
In a given data set, there will be ‘behaviors’ or ‘incidents’ that are significantly outside of the expected ranges or classifications within the set. Over larger sets of data, especially ones that have not been cleaned or classified, this is even more common. Detecting these anomalous incidents is a major part of the current landscape of Machine Learning.
The techniques for carrying this practice out are varied. Methods based on statistics use statistical models and flag deviations. Classification-based techniques use ML Models to recognize ‘normality’ vs ‘abnormality’. You could also call that Outlier Identification.
Regardless of the method, it’s vital to be able to identify when bits of the training data, or data coming in from the field are not standard. Identifying anomalous patterns or incidents enables higher prediction success rates from any given model.
Types of Anomalies
Understanding what type of result can be anomalous is a prerequisite for being able to identify anomalies. It’s not quite as simple as seeing obvious and exaggerated outliers plotted into a graph. Let’s look at types of anomalies to understand what it is that needs to be identified.
from RAD (post by Dr. Yosef Yehuda (Yossi) Kuttner, Ph.D.)
There are three main types of anomalies:
Point anomalies. A data point that differs remarkably from the rest of the data points in the dataset considered.
Contextual anomalies. The anomaly of a data point is related to the location, time or other contextual attributes of other points in the dataset. For example, it is reasonable to assume that water consumption in households is much higher at night during a weekend, compared to midday in a workday.
Collective anomalies. When a group of points are treated as an anomaly on the whole, but specific data points in that group are considered normal. Collective anomalies can only be detected in datasets where the data is related in some way, i.e., sequential, spatial or graph data.
So what I was able to glean from the post by RAD is that there are different categories of anomalies, and these categorizations are important to understand because ‘norms’ always exist within some kind of scope, and not all data will be anomalous in all contexts.
Just because certain anomalies jump right out because they are remarkably different from every other sample, doesn’t mean that other anomalies will behave like this. It’s completely possible for groupings of unexpected measurements to occur. The entire group can be classified as anomalous, OR given enough points with those values, that group can grow to become an expectation and have it’s own set of outliers around it’s “borders”.
Detection Techniques
I won’t make an exhaustive list here. I’ll just cover a few methods, then summarize what those methods do abstractly afterward.
K-Means Clustering
K-means clustering is an unsupervised machine-learning technique that divides a dataset into distinct clusters. During this process, you will find the average or mean of the points within given clusters. Using that average, other points can be compared against that “centroid”.
If values achieve a distance away from it that is unexpected, that would be anomalous behavior.
K - Nearest Neighbors
Similarly to clustering, this method also uses a ‘points’ distance away from an average to determine how ‘anomalous’ it is considered. That said, this time the average is not based on clusters, but instead an average of the distance away from ‘nearest neighbors’.
Support Vector Machines (SVM)
This one is a classifier-based method, so under the hood, it’s still using regression to determine which side of the ‘classifying line’ a point lands on. These definitely only work in the environment of supervised learning, but according to Wikipedia are highly reliable for predictions, so they are likely also viable for anomaly detection as well.
The Abstract
The same sorts of tools that are used to make predictions are also used to identify anomalies. I’m not sure of the precise setup of how this would work in code, as I’m just learning the theory at the moment and am not putting this into practice in lock-step with what topics I’m reading about.
I can definitely say that all 3 of the methods above are used as both techniques of prediction, as well as anomaly detection. For example, if you’ve applied a k-means clustering technique, the same model that will allow you to know if a point falls within a specific cluster will also reveal how far away from the centroid that point is.
Conclusion
The art of detecting anomalies in a dataset is a crucial aspect of machine learning, lending itself to improving model prediction and classification accuracy. Understanding the nature and type of anomalies serves as a key first step in the process.
Implementing K-means clustering, K-nearest neighbors, and support vector machines are ways to get predictions about a given data set, as well as being able to find anomalous data by understanding what is standard data, and what lies outside of that expectation.