Refining Clusters in High-dimensional Text Data | Center for Big Data Analytics

Abstract: The k-means algorithm with cosine similarity, also known as the spherical k-means algorithm, is a popular method for clustering document collections. However, spherical k-means can often yield qualitatively poor results, especially for small clusters, say 25-30 documents per cluster, where it tends to get stuck at a local maximum far away from the optimal. In this paper, we present the first-variation principle that refines a given clustering by incrementally moving data points between clusters, thus achieving a higher objective function value. Combining first-variation with spherical k-means yields a powerful ping-pong strategy that often qualitatively improves k-means clustering. We present several experimental results to show that our proposed method works well in clustering high-dimensional and sparse text data.

Topics:
Data Clustering

Download: pdf

Citation

Refining Clusters in High-dimensional Text Data (pdf, software)
I. Dhillon, Y. Guan, J. Kogan.
In SIAM International Conference on Data Mining (SDM), April 2002.

Bibtex:
@inproceedings{dhillon2002refiningc, author = "Inderjit S. Dhillon AND Yuqiang Guan AND J. Kogan", title = "Refining Clusters in High-dimensional Text Data", booktitle = "SIAM International Conference on Data Mining (SDM)", year = "2002", month = "apr", abstract = "The k-means algorithm with cosine similarity, also known as the spherical k-means algorithm, is a popular method for clustering document collections. However, spherical k-means can often yield qualitatively poor results, especially for small clusters, say 25-30 documents per cluster, where it tends to get stuck at a local maximum far away from the optimal. In this paper, we present the first-variation principle that refines a given clustering by incrementally moving data points between clusters, thus achieving a higher objective function value. Combining first-variation with spherical k-means yields a powerful ping-pong strategy that often qualitatively improves k-means clustering. We present several experimental results to show that our proposed method works well in clustering high-dimensional and sparse text data." }