Outlier Detection with K-Means Clustering

This step first partitions the data points into k clusters by applying the K-Means Clustering algorithm. Then, a distance ratio

dist(Yo,co)ˉco

for each data object is calculated, where Yo is the data object, co is the center of the cluster which Yo belongs to and ˉco stands for the average distance between all data objects in that cluster and the center co. The larger the ratio, the farther away the data object is relative from the center. Finally, if the calculated ratio is above a pre-defined threshold, the observed data object is identified as an outlier.

Input Parameters

  1. Data samples abstracted in a n-dimensional feature space
  2. Specified number of clusters
  3. A pre-defined threshold for the distance ratios

Output Parameters

  1. Original data with outliers marked

Workflow

../../../../_images/workflow34.svg

Algorithm

K-Means Clustering

References

  • J. Han, M. Kamber and J. Pei, Data Mining - Concepts and Techniques, 3rd ed., Amsterdam: Morgan Kaufmann Publishers, 2012.