# Outlier Detection with K-Means Clustering¶

This step first partitions the data points into $$k$$ clusters by applying the K-Means Clustering algorithm. Then, a distance ratio

$\frac{dist(Y_o, c_o)}{\bar{c}_o}$

for each data object is calculated, where $$Y_o$$ is the data object, $$c_{o}$$ is the center of the cluster which $$Y_o$$ belongs to and $$\bar{c}_o$$ stands for the average distance between all data objects in that cluster and the center $$c_{o}$$. The larger the ratio, the farther away the data object is relative from the center. Finally, if the calculated ratio is above a pre-defined threshold, the observed data object is identified as an outlier.

Input Parameters

1. Data samples abstracted in a n-dimensional feature space
2. Specified number of clusters
3. A pre-defined threshold for the distance ratios

Output Parameters

1. Original data with outliers marked

Workflow

Algorithm

K-Means Clustering

References

• J. Han, M. Kamber and J. Pei, Data Mining - Concepts and Techniques, 3rd ed., Amsterdam: Morgan Kaufmann Publishers, 2012.