Overlapping Community Detection in Massive Social Networks

Overlapping Community Detection in Massive Social Networks

Primary researcher(s): Joyce Whang, Inderjit Dhillon

Project Summary

Community detection is an important task in network analysis. A community (also referred to as a cluster) is a set of cohesive vertices that have more connections inside the set than outside. In many social and information networks, these communities naturally overlap. For instance, in a social network, each vertex in a graph corresponds to an individual who usually participates in multiple communities. In this project, we study the problem of overlapping community detection.

Project Description

1. Overlapping Community Detection Using Seed Set Expansion

We propose an efficient overlapping community detection algorithm using a seed set expansion approach. The key idea of our algorithm is to find good seeds, and then greedily expand these seeds based on a community metric. Within this seed expansion method, we investigate the problem of how to determine good seed nodes in a graph.

sse_pic

Overview: Our overlapping community detection algorithm involves neighborhood-based seeding and local expansion of a set of carefully chosen seeds. We name our algorithm NISE by abbreviating our main idea, Neighborhood-Inflated Seed Expansion.

  • Filtering phase removes regions of the graph that are trivially separable from the rest of the graph.
  • Seeding phase finds good seeds in the filtered graph.
  • Expansion phase expands the seed sets using a personalized PageRank clustering scheme.
  • Propagation phase further expands the communities to the regions that were removed in the filtering phase.

 Two effective Seeding Strategies:

  • Graclus Centers: Centers of graph partitions
  • Spread Hubs: Independent set of high-degree vertice

cond_cov

Experimental Results (conductance vs. graph coverage): lower curve indicates better communities. NISE (colored lines) outperforms other state-of-the-art methods. Also, our new seeding strategies (blue and red lines) are better than existing strategies, and are thus effective in finding good overlapping communities in real-world networks.

2. Non-exhaustive, Overlapping k-means

The empirical success of the seed expansion based overlapping community detection motivates us to investigate the problem of non-exhaustive, overlapping clustering in a more fundamental and principled way. The result of our investigations is a novel improvement to the traditional k-means clustering objective, with easy-to-understand parameters that capture the degrees of overlap and non-exhaustiveness. By studying the objective, we are able to obtain a simple iterative algorithm which we call NEO-K-Means (Non-Exhaustive, Overlapping K-Means).

neoNEO-K-Means can correctly detect the outliers and find natural overlapping clustering structure in data clustering. Green points indicate overlap between the clusters, and black points indicate outliers.

karate

Overlapping Community Detection using NEO-K-Means: The traditional normalized cut-based graph clustering objective can be extended to the non-exhaustive, overlapping graph clustering setting, and this extended graph clustering objective is equivalent to the weighted kernel NEO-K-Means objective. This equivalence enables us to apply our NEO-K-Means algorithm to the overlapping community detection problem. On Zachary’s Karate Club network, NEO-K-Means is able to reveal the natural underlying overlapping structure of the network.

ncut_plots

Experimental Results (average normalized cut): Lower normalized cut indicates better clustering. NEO-K-Means (red bars) achieves the lowest normalized cut on all the datasets.

3. Low-Rank Semidefinite Programming for NEO-K-Means

Summary: The iterative NEO-K-Means algorithm is fast, but suffers from the classic problem that iterative algorithms for k-means fall into local minimizers given poor initialization. For a more accurate solution, we study a convex relaxation of the NEO-K-Means objective, and also formulate a low-rank SDP for NEO-K-Means. We implement the scalable LRSDP algorithm, and also propose a series of initialization and rounding strategies that accelerate the convergence of our optimization procedures.

o1

Experimental Results on Overlapping Community Detection: We compute AUC of conductance-vs-graph coverage (lower AUC indicates better communities). LRSDP produces the best quality communities in terms of AUC score.

table

Main Publications