# Overlapping Community Detection in Massive Social Networks

### Project Summary

Community detection is an important task in network analysis. A community (also referred to as a cluster) is a set of cohesive vertices that have more connections inside the set than outside. In many social and information networks, these communities naturally overlap. For instance, in a social network, each vertex in a graph corresponds to an individual who usually participates in multiple communities. In this project, we study the problem of overlapping community detection.### Project Description

**1. Overlapping Community Detection Using Seed Set Expansion**

We propose an efficient overlapping community detection algorithm using a seed set expansion approach. The key idea of our algorithm is to find good seeds, and then greedily expand these seeds based on a community metric. Within this seed expansion method, we investigate the problem of how to determine good seed nodes in a graph.

* Overview:* Our overlapping community detection algorithm involves neighborhood-based seeding and local expansion of a set of carefully chosen seeds. We name our algorithm NISE by abbreviating our main idea, Neighborhood-Inflated Seed Expansion.

- Filtering phase removes regions of the graph that are trivially separable from the rest of the graph.
- Seeding phase finds good seeds in the filtered graph.
- Expansion phase expands the seed sets using a personalized PageRank clustering scheme.
- Propagation phase further expands the communities to the regions that were removed in the filtering phase.

** Two effective Seeding**** Strategies:**

- Graclus Centers: Centers of graph partitions
- Spread Hubs: Independent set of high-degree vertice

**Experimental Results (conductance vs. graph coverage):**** **lower curve indicates better communities. NISE (colored lines) outperforms other state-of-the-art methods. Also, our new seeding strategies (blue and red lines) are better than existing strategies, and are thus effective in finding good overlapping communities in real-world networks.

**2. Non-exhaustive, Overlapping k-means**

The empirical success of the seed expansion based overlapping community detection motivates us to investigate the problem of non-exhaustive, overlapping clustering in a more fundamental and principled way. The result of our investigations is a novel improvement to the traditional k-means clustering objective, with easy-to-understand parameters that capture the degrees of overlap and non-exhaustiveness. By studying the objective, we are able to obtain a simple iterative algorithm which we call NEO-K-Means (Non-Exhaustive, Overlapping K-Means).

* NEO-K-Means* can correctly detect the outliers and find natural overlapping clustering structure in data clustering. Green points indicate overlap between the clusters, and black points indicate outliers.

* Overlapping Community Detection using NEO-K-Means:* The traditional normalized cut-based graph clustering objective can be extended to the non-exhaustive, overlapping graph clustering setting, and this extended graph clustering objective is equivalent to the weighted kernel NEO-K-Means objective. This equivalence enables us to apply our NEO-K-Means algorithm to the overlapping community detection problem. On Zachary’s Karate Club network, NEO-K-Means is able to reveal the natural underlying overlapping structure of the network.

* Experimental Results (average normalized cut): *Lower normalized cut indicates better clustering. NEO-K-Means (red bars) achieves the lowest normalized cut on all the datasets.

**3. Low-Rank Semidefinite Programming for NEO-K-Means
**

* Summary:* The iterative NEO-K-Means algorithm is fast, but suffers from the classic problem that iterative algorithms for k-means fall into local minimizers given poor initialization. For a more accurate solution, we study a convex relaxation of the NEO-K-Means objective, and also formulate a low-rank SDP for NEO-K-Means. We implement the scalable LRSDP algorithm, and also propose a series of initialization and rounding strategies that accelerate the convergence of our optimization procedures.

* Experimental Results on Overlapping Community Detection:* We compute AUC of conductance-vs-graph coverage (lower AUC indicates better communities). LRSDP produces the best quality communities in terms of AUC score.

### Main Publications

- Non-exhaustive, Overlapping Clustering via Low-Rank Semidefinite Programming (pdf, slides, software)

Y. Hou, J. Whang, D. Gleich, I. Dhillon.

In*ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)*, pp. 427–436, August 2015. (**Oral**) - Non-exhaustive, Overlapping k-means (pdf, software)

J. Whang, I. Dhillon, D. Gleich.

In*SIAM International Conference on Data Mining (SDM)*, pp. 936–944, May 2015. - Overlapping Community Detection Using Seed Set Expansion (pdf, slides, software)

J. Whang, D. Gleich, I. Dhillon.

In*ACM Conference on Information and Knowledge Management (CIKM)*, pp. 2099-2108, October 2013. (**Oral**) - Fast Multiplier Methods to Optimize Non-exhaustive, Overlapping Clustering (pdf, software)

Y. Hou, J. Whang, D. Gleich, I. Dhillon.

In*SIAM International Conference on Data Mining (SDM)*, pp. 297–305, May 2016. (**Oral**) - Overlapping Community Detection Using Neighborhood-Inflated Seed Expansion (pdf, software)

J. Whang, D. Gleich, I. Dhillon.*IEEE Transactions on Knowledge and Data Engineering (TKDE)*28(5), pp. 1272–1284, 2016. - Scalable Data-driven PageRank: Algorithms, System Issues, and Lessons Learned (pdf, software)

J. Whang, A. Lenharth, I. Dhillon, K. Pingali.

In*International European Conference on Parallel and Distributed Computing (Euro-Par)*, pp. 438–450, August 2015. (**Oral**) - Stochastic Blockmodel with Cluster Overlap, Relevance Selection, and Similarity-Based Smoothing (pdf, slides, software)

J. Whang, P. Rai, I. Dhillon.

In*IEEE International Conference on Data Mining (ICDM)*, pp. 817 – 826, December 2013. (**Oral**) - Scalable and Memory-Efficient Clustering of Large-Scale Social Networks (pdf, slides, software)

J. Whang, X. Sui, I. Dhillon.

In*IEEE International Conference on Data Mining (ICDM)*, pp. 705-714, December 2012. (**Oral**) - Parallel Clustered Low-rank Approximation of Graphs and Its Application to Link Prediction (pdf, software)

X. Sui, T. Lee, J. Whang, B. Savas, S. Jain, K. Pingali, I. Dhillon.*International Workshop on Languages and Compilers for Parallel Computing (LCPC)*, pp. 76-95, October 2012. (**Oral**) - Scalable Clustering of Signed Networks using Balance Normalized Cut (pdf, slides, software)

K. Chiang, J. Whang, I. Dhillon.

In*ACM Conference on Information and Knowledge Management (CIKM)*, pp. 615-624, October 2012. (**Oral**)