Information Theoretic Clustering of Sparse Co-Occurrence Data

Abstract: A novel approach to clustering co-occurrence data poses it as an optimization problem in information theory which minimizes the resulting loss in mutual information. A divisive clustering algorithm that monotonically reduces this loss function was recently proposed. In this paper we show that sparse high-dimensional data presents special challenges which can result in the algorithm getting stuck at poor local minima. We propose two solutions to this problem: (a) a “prior” to overcome infinite relative entropy values as in the supervised Naive Bayes algorithm, and (b) local search to escape local minima. Finally, we combine these solutions to get a robust algorithm that is computationally efficient. We present experimental results to show that the proposed method is effective in clustering document collections and outperforms previous information-theoretic clustering approaches.

Topics:
Data Clustering

Download: pdf

Citation

Information Theoretic Clustering of Sparse Co-Occurrence Data (pdf, software)
I. Dhillon, Y. Guan.
In IEEE International Conference on Data Mining (ICDM), pp. 517-520, November 2003.

Bibtex:
@inproceedings{dhillon2003informatio, author = "Inderjit S. Dhillon AND Yuqiang Guan", title = "Information Theoretic Clustering of Sparse Co-Occurrence Data", booktitle = "IEEE International Conference on Data Mining (ICDM)", page = "517–520", year = "2003", month = "nov", abstract = "A novel approach to clustering co-occurrence data poses it as an optimization problem in information theory which minimizes the resulting loss in mutual information. A divisive clustering algorithm that monotonically reduces this loss function was recently proposed. In this paper we show that sparse high-dimensional data presents special challenges which can result in the algorithm getting stuck at poor local minima. We propose two solutions to this problem: (a) a “prior” to overcome infinite relative entropy values as in the supervised Naive Bayes algorithm, and (b) local search to escape local minima. Finally, we combine these solutions to get a robust algorithm that is computationally efficient. We present experimental results to show that the proposed method is effective in clustering document collections and outperforms previous information-theoretic clustering approaches." }

Center for Big Data Analytics

Information Theoretic Clustering of Sparse Co-Occurrence Data

Inderjit Dhillon, Yuqiang Guan

Download: pdf

Citation