A Sampling-Based Density Peaks Clustering Algorithm for Large-Scale Data

•An improved triangle-inequality-based search strategy is proposed.•An approximate local density calculation of representatives is proposed.•Experiments show that our algorithm costs far less time than DPC and other state-of-the-art algorithms proposed recently. With the rapid development of informa...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Pattern recognition Ročník 136; s. 109238
Hlavní autoři: Ding, Shifei, Li, Chao, Xu, Xiao, Ding, Ling, Zhang, Jian, Guo, Lili, Shi, Tianhao
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier Ltd 01.04.2023
Témata:
ISSN:0031-3203, 1873-5142
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:•An improved triangle-inequality-based search strategy is proposed.•An approximate local density calculation of representatives is proposed.•Experiments show that our algorithm costs far less time than DPC and other state-of-the-art algorithms proposed recently. With the rapid development of information technology, massive amount of data is generated. How to discover useful information to support decision-making has become one of the focuses of scholar's research. Clustering is thought to be one of the main means to deal with large-scale data. Density peaks clustering (DPC) is an effective density-based clustering algorithm which is widely applied in numerous fields because of its satisfactory performance. However, the computational complexity of DPC is O(N2) which is not friendly to large-scale data. To solve this issue, a sampling-based density peaks clustering algorithm for large-scale data (SDPC) is proposed. Firstly, a sampling method is used to reduce the distance calculations. Secondly, approximate representatives are identified by an improved TI search strategy which further accelerates the clustering process. Afterwards, the approximate representatives are clustered by DPC. Finally, the remaining points are allocated to the same cluster as its nearest representatives. Experimental results on both synthetic datasets and real-world datasets illustrate that SDPC is more efficient than DPC, while its clustering performance maintains the same level as DPC.
ISSN:0031-3203
1873-5142
DOI:10.1016/j.patcog.2022.109238