Information-Theoretic Co-clustering | Center for Big Data Analytics

Abstract: Two-dimensional contingency or co-occurrence tables arise frequently in important applications such as text, web-log and market-basket data analysis. A basic problem in contingency table analysis is co-clustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views the contingency table as an empirical joint probability distribution of two discrete random variables and poses the co-clustering problem as an optimization problem in information theory — the optimal co-clustering maximizes the mutual information between the clustered random variables subject to constraints on the number of row and column clusters. We present an innovative co-clustering algorithm that monotonically increases the preserved mutual information by intertwining both the row and column clusterings at all stages. Using the practical example of simultaneous word-document clustering, we demonstrate that our algorithm works well in practice, especially in the presence of sparsity and high-dimensionality.

Topics:
Co-Clustering

Download: pdf

Citation

Information-Theoretic Co-clustering (pdf, software)
I. Dhillon, S. Mallela, D. Modha.
In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 89-98, August 2003.

Bibtex:
@inproceedings{dhillon2003informatio, author = "Inderjit S. Dhillon AND Subramanyam Mallela AND Dharmendra S. Modha", title = "Information-Theoretic Co-clustering", booktitle = "ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)", page = "89–98", year = "2003", month = "aug", abstract = "Two-dimensional contingency or co-occurrence tables arise frequently in important applications such as text, web-log and market-basket data analysis. A basic problem in contingency table analysis is co-clustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views the contingency table as an empirical joint probability distribution of two discrete random variables and poses the co-clustering problem as an optimization problem in information theory — the optimal co-clustering maximizes the mutual information between the clustered random variables subject to constraints on the number of row and column clusters. We present an innovative co-clustering algorithm that monotonically increases the preserved mutual information by intertwining both the row and column clusterings at all stages. Using the practical example of simultaneous word-document clustering, we demonstrate that our algorithm works well in practice, especially in the presence of sparsity and high-dimensionality." }