Outlier Detection with K-Means ClusteringΒΆ

This step first partitions the data points into \(k\) clusters by applying the K-Means Clustering algorithm. Then, a distance ratio

\[\frac{dist(Y_o, c_o)}{\bar{c}_o}\]

for each data object is calculated, where \(Y_o\) is the data object, \(c_{o}\) is the center of the cluster which \(Y_o\) belongs to and \(\bar{c}_o\) stands for the average distance between all data objects in that cluster and the center \(c_{o}\). The larger the ratio, the farther away the data object is relative from the center. Finally, if the calculated ratio is above a pre-defined threshold, the observed data object is identified as an outlier.

Input Parameters

  1. Data samples abstracted in a n-dimensional feature space
  2. Specified number of clusters
  3. A pre-defined threshold for the distance ratios

Output Parameters

  1. Original data with outliers marked

Workflow

../../../../_images/workflow34.svg

Algorithm

K-Means Clustering

References

  • J. Han, M. Kamber and J. Pei, Data Mining - Concepts and Techniques, 3rd ed., Amsterdam: Morgan Kaufmann Publishers, 2012.