Abstract: Clustering is one of the most fundamental tasks in data mining. To analyze complex real-world data emerging in many data-centric applications, the problem of non-exhaustive, overlapping clustering has been studied where the goal is to find overlapping clusters and also detect outliers simultaneously. We propose a novel convex semidefinite program (SDP) as a relaxation of the non-exhaustive, overlapping clustering problem. Although the SDP formulation enjoys attractive theoretical properties with respect to global optimization, it is computationally intractable for large problem sizes. As an alternative, we optimize a low-rank factorization of the solution. The resulting problem is non-convex, but has a smaller number of solution variables. We construct an optimization solver using an augmented Lagrangian methodology that enables us to deal with problems with tens of thousands of data points. The new solver provides more accurate and reliable answers than other approaches. By exploiting the connection between graph clustering objective functions and a kernel k-means objective, our new low-rank solver can also compute overlapping communities of social networks with state-of-the-art accuracy.
- Non-exhaustive, Overlapping Clustering via Low-Rank Semidefinite Programming (pdf, slides, software)
Y. Hou, J. Whang, D. Gleich, I. Dhillon.
In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 427–436, August 2015. (Oral)