Abstract: Most clustering algorithms produce a single clustering for
a given data set even when the data can be clustered naturally
in multiple ways. In this paper, we address the difficult
problem of uncovering disparate clusterings from the data
in a totally unsupervised manner. We propose two new approaches
for this problem. In the first approach we aim to
find good clusterings of the data that are also decorrelated
with one another. To this end, we give a new and tractable
characterization of decorrelation between clusterings, and
present an objective function to capture it. We provide an
iterative “decorrelated” k-means type algorithm to minimize
this objective function. In the second approach, we model
the data as a sum of mixtures and associate each mixture
with a clustering. This approach leads us to the problem of
learning a convolution of mixture distributions. Though the
latter problem can be formulated as one of factorial learning
[8, 13, 16], the existing formulations and methods do not
perform well on many real high-dimensional data sets. We
propose a new regularized factorial learning framework that
is more suitable for capturing the notion of disparate clusterings
in modern, high-dimensional data sets. The resulting
algorithm does well in uncovering multiple clusterings, and
is much improved over existing methods. We evaluate our
methods on two real-world data sets – a music data set from
the text mining domain, and a portrait data set from the computer
vision domain. Our methods achieve a substantially
higher accuracy than existing factorial learning as well as
traditional clustering algorithms.
- Topics:
- Data Clustering
Download: pdf
Citation
- Simultaneous Unsupervised Learning of Disparate Clusterings (pdf, software)
P. Jain, R. Meka, I. Dhillon.
In IEEE International Conference on Data Mining (ICDM), pp. 858-869, April 2008.
Bibtex: