Outlier Detection with K-Means Clustering¶

This step first partitions the data points into $k$ clusters by applying the K-Means Clustering algorithm. Then, a distance ratio

$\frac{dist(Y_o, c_o)}{\bar{c}_o}$

for each data object is calculated, where $Y_o$ is the data object, $c_{o}$ is the center of the cluster which $Y_o$ belongs to and $\bar{c}_o$ stands for the average distance between all data objects in that cluster and the center $c_{o}$ . The larger the ratio, the farther away the data object is relative from the center. Finally, if the calculated ratio is above a pre-defined threshold, the observed data object is identified as an outlier.

Input Parameters

Data samples abstracted in a n-dimensional feature space
Specified number of clusters
A pre-defined threshold for the distance ratios

Output Parameters

Original data with outliers marked

Workflow

Algorithm

K-Means Clustering

References

J. Han, M. Kamber and J. Pei, Data Mining - Concepts and Techniques, 3rd ed., Amsterdam: Morgan Kaufmann Publishers, 2012.

Table Of Contents

Previous topic

Next topic

This Page

Outlier Detection with K-Means Clustering¶