Outlier Detection with K-Means ClusteringΒΆ
This step first partitions the data points into \(k\) clusters by applying the K-Means Clustering algorithm. Then, a distance ratio
\[\frac{dist(Y_o, c_o)}{\bar{c}_o}\]
for each data object is calculated, where \(Y_o\) is the data object, \(c_{o}\) is the center of the cluster which \(Y_o\) belongs to and \(\bar{c}_o\) stands for the average distance between all data objects in that cluster and the center \(c_{o}\). The larger the ratio, the farther away the data object is relative from the center. Finally, if the calculated ratio is above a pre-defined threshold, the observed data object is identified as an outlier.
Input Parameters
- Data samples abstracted in a n-dimensional feature space
- Specified number of clusters
- A pre-defined threshold for the distance ratios
Output Parameters
- Original data with outliers marked
Workflow
Algorithm
References
- J. Han, M. Kamber and J. Pei, Data Mining - Concepts and Techniques, 3rd ed., Amsterdam: Morgan Kaufmann Publishers, 2012.