Clustering high-dimensional data

Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional spaces of data are often encountered in areas such as medicine, where DNA microarray technology can produce a large number of measurements at once, and the clustering of text documents, where, if a word-frequency vector is used, the number of dimensions equals the size of the vocabulary.

Four problems need to be overcome for clustering in high-dimensional data:

Recent research indicates that the discrimination problems only occur when there is a high number of irrelevant dimensions, and that shared-nearest-neighbor approaches can improve results.

Approaches towards clustering in axis-parallel or arbitrarily oriented affine subspaces differ in how they interpret the overall goal, which is finding clusters in data with high dimensionality. An overall different approach is to find clusters based on pattern in the data matrix, often referred to as biclustering, which is a technique frequently utilized in bioinformatics.

The image on the right shows a mere two-dimensional space where a number of clusters can be identified. In the one-dimensional subspaces, the clusters $c_{a}$ (in subspace $\{x\}$ ) and $c_{b}$ , $c_{c}$ , $c_{d}$ (in subspace $\{y\}$ ) can be found. $c_{c}$ cannot be considered a cluster in a two-dimensional (sub-)space, since it is too sparsely distributed in the $x$ axis. In two dimensions, the two clusters $c_{ab}$ and $c_{ad}$ can be identified.

...
Wikipedia