Hunting for Coherent Co-clusters in High Dimensional and Noisy Datasets

Meghana Deodhar, Hyuk Cho, Gunjan Gupta, Joydeep Ghosh, Inderjit Dhillon

Abstract:   Clustering problems often involve datasets where only a part of the data is relevant to the problem, e.g., in microarray data analysis only a subset of the genes show cohesive expressions within a subset of the conditions/features. The existence of a large number of non-informative data points and features makes it challenging to hunt for coherent and meaningful clusters from such datasets. Additionally, since clusters could exist in different subspaces of the feature space, a co-clustering algorithm that simultaneously clusters objects and features is often more suitable as compared to one that is restricted to traditional “one-sided” clustering. We propose Robust Overlapping Co-clustering (ROCC), a scalable and very versatile framework that addresses the problem of efficiently mining dense, arbitrarily positioned, possibly overlapping co-clusters from large, noisy datasets. ROCC has several desirable properties that make it extremely well suited to a number of real life applications. Through extensive experimentation we show that our approach is significantly more accurate in identifying biologically meaningful co-clusters in microarray data as compared to several other prominent approaches that have been applied to this task. We also point out other interesting applications of the proposed framework in solving difficult clustering problems.

Download: pdf

Citation

  • Hunting for Coherent Co-clusters in High Dimensional and Noisy Datasets (pdf, software)
    M. Deodhar, H. Cho, G. Gupta, J. Ghosh, I. Dhillon.
    In IEEE International Conference on Data Mining (ICDM), December 2008.

    Bibtex: