Abstract: Most clustering algorithms produce a single clustering for a given dataset even when the data can be clustered
naturally in multiple ways. In this paper, we address the difficult problem of uncovering disparate clusterings from the data in
a totally unsupervised manner. We propose two new approaches for this problem. In the first approach, we aim to find good
clusterings of the data that are also decorrelated with one another. To this end, we give a new and tractable characterization of
decorrelation between clusterings, and present an objective function to capture it. We provide an iterative “decorrelated” k-means
type algorithm to minimize this objective function. In the second approach, we model the data as a sum of mixtures and associate
each mixture with a clustering. This approach leads us to the problem of learning a convolution of mixture distributions. Though
the latter problem can be formulated as one of factorial learning [8,13,16], the existing formulations and methods do not perform
well on many real high-dimensional datasets. We propose a new regularized factorial-learning framework that is more suitable
for capturing the notion of disparate clusterings in modern, high-dimensional datasets. Furthermore, we provide kernelized
version of both of our algorithms. The resulting algorithms do well in uncovering multiple clusterings, and are much improved
over existing methods. We evaluate our methods on two real-world datasets—a music dataset from the text-mining domain, and
a portrait dataset from the computer-vision domain. Our methods achieve a substantially higher accuracy than existing factorial
learning as well as traditional clustering algorithms.
- Topics:
- Data Clustering
Download: pdf
Citation
- Simultaneous Unsupervised Learning of Disparate Clusterings (pdf, software)
P. Jain, R. Meka, I. Dhillon.
Statistical Analysis and Data Mining 1(3), pp. 195-210, November 2008.
Bibtex: