A distributed community detection algorithm for large scale networks under stochastic block models

Community detection for large scale networks is of great importance in modern data analysis. In this work, we develop a distributed spectral clustering algorithm to handle this task. Specifically, we distribute a certain number of pilot network nodes on the master server and the others on worker ser...

Full description

Saved in:
Bibliographic Details
Published in:Computational statistics & data analysis Vol. 187; p. 107794
Main Authors: Wu, Shihao, Li, Zhe, Zhu, Xuening
Format: Journal Article
Language:English
Published: Elsevier B.V 01.11.2023
Subjects:
ISSN:0167-9473
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Community detection for large scale networks is of great importance in modern data analysis. In this work, we develop a distributed spectral clustering algorithm to handle this task. Specifically, we distribute a certain number of pilot network nodes on the master server and the others on worker servers. A spectral clustering algorithm is first conducted on the master to select pseudo centers. Next, the indexes of the pseudo centers are broadcasted to workers to complete the distributed community detection task using an SVD (singular value decomposition) type algorithm. The proposed distributed algorithm has three advantages. First, the communication cost is low, since only the indexes of pseudo centers are communicated. Second, no further iterative algorithm is needed on workers while a “one-shot” computation suffices. Third, both the computational complexity and the storage requirements are much lower compared to using the whole adjacency matrix. We develop a Python package DCD (The Python package is provided in https://github.com/Ikerlz/dcd.) to implement the distributed algorithm on a Spark system and establish theoretical properties with respect to the estimation accuracy and mis-clustering rates under the stochastic block model. Experiments on a variety of synthetic and empirical datasets are carried out to further illustrate the advantages of the methodology.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0167-9473
DOI:10.1016/j.csda.2023.107794