Overlapping Community Detection in Massive Social Networks

Project Summary

Community detection is an important task in network analysis. A community (also referred to as a cluster) is a set of cohesive vertices that have more connections inside the set than outside. In many social and information networks, these communities naturally overlap. For instance, in a social network, each vertex in a graph corresponds to an individual who usually participates in multiple communities. In this project, we study the problem of overlapping community detection.

Project Description

1. Overlapping Community Detection Using Seed Set Expansion

We propose an efficient overlapping community detection algorithm using a seed set expansion approach. The key idea of our algorithm is to find good seeds, and then greedily expand these seeds based on a community metric. Within this seed expansion method, we investigate the problem of how to determine good seed nodes in a graph.

sse_pic

Overview: Our overlapping community detection algorithm involves neighborhood-based seeding and local expansion of a set of carefully chosen seeds. We name our algorithm NISE by abbreviating our main idea, Neighborhood-Inflated Seed Expansion.

Filtering phase removes regions of the graph that are trivially separable from the rest of the graph.
Seeding phase finds good seeds in the filtered graph.
Expansion phase expands the seed sets using a personalized PageRank clustering scheme.
Propagation phase further expands the communities to the regions that were removed in the filtering phase.

Two effective Seeding Strategies:

Graclus Centers: Centers of graph partitions
Spread Hubs: Independent set of high-degree vertice

Experimental Results (conductance vs. graph coverage): lower curve indicates better communities. NISE (colored lines) outperforms other state-of-the-art methods. Also, our new seeding strategies (blue and red lines) are better than existing strategies, and are thus effective in finding good overlapping communities in real-world networks.

2. Non-exhaustive, Overlapping k-means

The empirical success of the seed expansion based overlapping community detection motivates us to investigate the problem of non-exhaustive, overlapping clustering in a more fundamental and principled way. The result of our investigations is a novel improvement to the traditional k-means clustering objective, with easy-to-understand parameters that capture the degrees of overlap and non-exhaustiveness. By studying the objective, we are able to obtain a simple iterative algorithm which we call NEO-K-Means (Non-Exhaustive, Overlapping K-Means).

NEO-K-Means can correctly detect the outliers and find natural overlapping clustering structure in data clustering. Green points indicate overlap between the clusters, and black points indicate outliers.

karate

Overlapping Community Detection using NEO-K-Means: The traditional normalized cut-based graph clustering objective can be extended to the non-exhaustive, overlapping graph clustering setting, and this extended graph clustering objective is equivalent to the weighted kernel NEO-K-Means objective. This equivalence enables us to apply our NEO-K-Means algorithm to the overlapping community detection problem. On Zachary’s Karate Club network, NEO-K-Means is able to reveal the natural underlying overlapping structure of the network.

Experimental Results (average normalized cut): Lower normalized cut indicates better clustering. NEO-K-Means (red bars) achieves the lowest normalized cut on all the datasets.

3. Low-Rank Semidefinite Programming for NEO-K-Means

Summary: The iterative NEO-K-Means algorithm is fast, but suffers from the classic problem that iterative algorithms for k-means fall into local minimizers given poor initialization. For a more accurate solution, we study a convex relaxation of the NEO-K-Means objective, and also formulate a low-rank SDP for NEO-K-Means. We implement the scalable LRSDP algorithm, and also propose a series of initialization and rounding strategies that accelerate the convergence of our optimization procedures.

Experimental Results on Overlapping Community Detection: We compute AUC of conductance-vs-graph coverage (lower AUC indicates better communities). LRSDP produces the best quality communities in terms of AUC score.

Main Publications

Non-exhaustive, Overlapping Clustering via Low-Rank Semidefinite Programming (pdf, slides, software)
Y. Hou, J. Whang, D. Gleich, I. Dhillon.
In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 427–436, August 2015. (Oral)

Non-exhaustive, Overlapping k-means (pdf, software)
J. Whang, I. Dhillon, D. Gleich.
In SIAM International Conference on Data Mining (SDM), pp. 936–944, May 2015.

Overlapping Community Detection Using Seed Set Expansion (pdf, slides, software)
J. Whang, D. Gleich, I. Dhillon.
In ACM Conference on Information and Knowledge Management (CIKM), pp. 2099-2108, October 2013. (Oral)

Related Publications

Fast Multiplier Methods to Optimize Non-exhaustive, Overlapping Clustering (pdf, software)
Y. Hou, J. Whang, D. Gleich, I. Dhillon.
In SIAM International Conference on Data Mining (SDM), pp. 297–305, May 2016. (Oral)

Overlapping Community Detection Using Neighborhood-Inflated Seed Expansion (pdf, software)
J. Whang, D. Gleich, I. Dhillon.
IEEE Transactions on Knowledge and Data Engineering (TKDE) 28(5), pp. 1272–1284, 2016.

Scalable Data-driven PageRank: Algorithms, System Issues, and Lessons Learned (pdf, software)
J. Whang, A. Lenharth, I. Dhillon, K. Pingali.
In International European Conference on Parallel and Distributed Computing (Euro-Par), pp. 438–450, August 2015. (Oral)

Stochastic Blockmodel with Cluster Overlap, Relevance Selection, and Similarity-Based Smoothing (pdf, slides, software)
J. Whang, P. Rai, I. Dhillon.
In IEEE International Conference on Data Mining (ICDM), pp. 817 – 826, December 2013. (Oral)

Scalable and Memory-Efficient Clustering of Large-Scale Social Networks (pdf, slides, software)
J. Whang, X. Sui, I. Dhillon.
In IEEE International Conference on Data Mining (ICDM), pp. 705-714, December 2012. (Oral)

Parallel Clustered Low-rank Approximation of Graphs and Its Application to Link Prediction (pdf, software)
X. Sui, T. Lee, J. Whang, B. Savas, S. Jain, K. Pingali, I. Dhillon.
International Workshop on Languages and Compilers for Parallel Computing (LCPC), pp. 76-95, October 2012. (Oral)

Scalable Clustering of Signed Networks using Balance Normalized Cut (pdf, slides, software)
K. Chiang, J. Whang, I. Dhillon.
In ACM Conference on Information and Knowledge Management (CIKM), pp. 615-624, October 2012. (Oral)