Outlier Detection with K-Means Clustering¶
This step first partitions the data points into k clusters by applying the K-Means Clustering algorithm. Then, a distance ratio
dist(Yo,co)ˉco
for each data object is calculated, where Yo is the data object, co is the center of the cluster which Yo belongs to and ˉco stands for the average distance between all data objects in that cluster and the center co. The larger the ratio, the farther away the data object is relative from the center. Finally, if the calculated ratio is above a pre-defined threshold, the observed data object is identified as an outlier.
Input Parameters
- Data samples abstracted in a n-dimensional feature space
- Specified number of clusters
- A pre-defined threshold for the distance ratios
Output Parameters
- Original data with outliers marked
Workflow
Algorithm
References
- J. Han, M. Kamber and J. Pei, Data Mining - Concepts and Techniques, 3rd ed., Amsterdam: Morgan Kaufmann Publishers, 2012.