Robust Categorical Data Clustering Guided by Multi-Granular Competitive Learning

Data set composed of categorical features is very common in big data analysis tasks. Since categorical features are usually with a limited number of qualitative possible values, the nested granular cluster effect is prevalent in the implicit discrete distance space of categorical data. That is, data...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Proceedings of the International Conference on Distributed Computing Systems s. 288 - 299
Hlavní autori: Cai, Shenghong, Zhang, Yiqun, Luo, Xiaopeng, Cheung, Yiu-Ming, Jia, Hong, Liu, Peng
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 23.07.2024
Predmet:
ISSN:2575-8411
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Data set composed of categorical features is very common in big data analysis tasks. Since categorical features are usually with a limited number of qualitative possible values, the nested granular cluster effect is prevalent in the implicit discrete distance space of categorical data. That is, data objects frequently overlap in space or subspace to form small compact clusters, and similar small clusters often form larger clusters. However, the distance space cannot be well-defined like the Euclidean distance due to the qualitative categorical data values, which brings great challenges to the cluster analysis of categorical data. In view of this, we design a Multi-Granular Competitive Penalization Learning (MGCPL) algorithm to allow potential clusters to interactively tune themselves and converge in stages with different numbers of naturally compact clusters. To leverage MGCPL, we also propose a Cluster Aggregation strategy based on MGCPL Encoding (CAME) to first encode the data objects according to the learned multi-granular distributions, and then perform final clustering on the embeddings. It turns out that the proposed MGCPL-guided Categorical Data Clustering (MCDC) approach is competent in automatically exploring the nested distribution of multi-granular clusters and highly robust to categorical data sets from various domains. Benefiting from its linear time complexity, MCDC is scalable to large-scale data sets and promising in pre-partitioning data sets or compute nodes for boosting distributed computing. Extensive experiments with statistical evidence demonstrate its superiority compared to state-of-the-art counterparts on various real public data sets.
AbstractList Data set composed of categorical features is very common in big data analysis tasks. Since categorical features are usually with a limited number of qualitative possible values, the nested granular cluster effect is prevalent in the implicit discrete distance space of categorical data. That is, data objects frequently overlap in space or subspace to form small compact clusters, and similar small clusters often form larger clusters. However, the distance space cannot be well-defined like the Euclidean distance due to the qualitative categorical data values, which brings great challenges to the cluster analysis of categorical data. In view of this, we design a Multi-Granular Competitive Penalization Learning (MGCPL) algorithm to allow potential clusters to interactively tune themselves and converge in stages with different numbers of naturally compact clusters. To leverage MGCPL, we also propose a Cluster Aggregation strategy based on MGCPL Encoding (CAME) to first encode the data objects according to the learned multi-granular distributions, and then perform final clustering on the embeddings. It turns out that the proposed MGCPL-guided Categorical Data Clustering (MCDC) approach is competent in automatically exploring the nested distribution of multi-granular clusters and highly robust to categorical data sets from various domains. Benefiting from its linear time complexity, MCDC is scalable to large-scale data sets and promising in pre-partitioning data sets or compute nodes for boosting distributed computing. Extensive experiments with statistical evidence demonstrate its superiority compared to state-of-the-art counterparts on various real public data sets.
Author Liu, Peng
Cheung, Yiu-Ming
Zhang, Yiqun
Jia, Hong
Cai, Shenghong
Luo, Xiaopeng
Author_xml – sequence: 1
  givenname: Shenghong
  surname: Cai
  fullname: Cai, Shenghong
  email: 3121005074@mail2.gdut.edu.cn
  organization: Guangdong University of Technology,Guangzhou,China
– sequence: 2
  givenname: Yiqun
  surname: Zhang
  fullname: Zhang, Yiqun
  email: yqzhang@gdut.edu.cn
  organization: Guangdong University of Technology,Guangzhou,China
– sequence: 3
  givenname: Xiaopeng
  surname: Luo
  fullname: Luo, Xiaopeng
  email: gordonlok@foxmail.com
  organization: Guangzhou Huali College,Guangzhou,China
– sequence: 4
  givenname: Yiu-Ming
  surname: Cheung
  fullname: Cheung, Yiu-Ming
  email: ymc@comp.hkbu.edu.hk
  organization: Hong Kong Baptist University,Hong Kong SAR,China
– sequence: 5
  givenname: Hong
  surname: Jia
  fullname: Jia, Hong
  email: hongjia1102@szu.edu.cn
  organization: Guangdong Provincial Key Laboratory of Intelligent Information Processing,Shenzhen,China
– sequence: 6
  givenname: Peng
  surname: Liu
  fullname: Liu, Peng
  email: liupeng@gdut.edu.cn
  organization: Guangdong University of Technology,Guangzhou,China
BookMark eNotjttKAzEYhKMoaGvfQCEvsPXPZnO6lFRroaJ4uC7Zzb8lst0t2azQtzegV8PMfAwzIxf90CMhdwyWjIG539iV_ZBgsi-hrJYAwMUZWRhlNBfAtQRhzsl1KZQodMXYFZmN43fGhJb8mry9D_U0Jmpdwv0QQ-M6unLJUdvlGGPo93Q9BY-e1if6MnUpFOvo-qlzkdrhcMQUUvhBukUX-0zfkMvWdSMu_nVOvp4eP-1zsX1db-zDtghMyVS0-a3gtfS8zNsCvKtaZZQrhdfKN66qWW7ryhguSsEaMDUaowTqRraMeT4nt3-7ARF3xxgOLp52DCRnoDn_BappUhc
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ICDCS60910.2024.00035
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9798350386059
EISSN 2575-8411
EndPage 299
ExternalDocumentID 10631083
Genre orig-research
GrantInformation_xml – fundername: National Natural Science Foundation of China (NSFC)
  grantid: 62102097,62374047,62174038
  funderid: 10.13039/501100001809
GroupedDBID 29G
6IE
6IF
6IH
6IK
6IL
6IM
6IN
AAJGR
AAWTH
ABLEC
ACGFS
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
IPLJI
M43
OCL
RIE
RIL
RIO
RNS
ID FETCH-LOGICAL-i176t-f20253b6d32ded50da4f797a25d87dca4b13b6b49935251c09be9975e8c6f11d3
IEDL.DBID RIE
ISICitedReferencesCount 10
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001304430200026&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:32:38 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i176t-f20253b6d32ded50da4f797a25d87dca4b13b6b49935251c09be9975e8c6f11d3
PageCount 12
ParticipantIDs ieee_primary_10631083
PublicationCentury 2000
PublicationDate 2024-July-23
PublicationDateYYYYMMDD 2024-07-23
PublicationDate_xml – month: 07
  year: 2024
  text: 2024-July-23
  day: 23
PublicationDecade 2020
PublicationTitle Proceedings of the International Conference on Distributed Computing Systems
PublicationTitleAbbrev ICDCS
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0005863
Score 2.3263688
Snippet Data set composed of categorical features is very common in big data analysis tasks. Since categorical features are usually with a limited number of...
SourceID ieee
SourceType Publisher
StartPage 288
SubjectTerms Big Data
Boosting
categorical feature
Cluster analysis
cluster granularity
Clustering algorithms
Competitive learning
Distributed databases
Encoding
Euclidean distance
number of clusters
Title Robust Categorical Data Clustering Guided by Multi-Granular Competitive Learning
URI https://ieeexplore.ieee.org/document/10631083
WOSCitedRecordID wos001304430200026&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELZoxcBUHkW85YHVEMeJ7cwpFJaq4iF1qxw_UCXUojZB4t9z56aUhYEtSpRE8cn57uz7vo-Qawd_fesNZ9xAiZIZmTCdO8cyya3g0gFmu2g2oUYjPZkU45asHrkw3vvYfOZv8DDu5buFbXCpDGa4hGxEiw7pKKXWZK1tP4eWoqXo8KS4fSwH5bNENIQiMEWJ7AQt3X5ZqEQEue_98937pL_l4tHxD8ockB0_PyS9jRkDbefmERk_LapmVdMSpR_Wwh90YGpDy_cGxRDgZjpsZs47Wn3RyLtlQwAqbEOlZcyeYxsRbRVX3_rk9f7upXxgrV0Cm3ElaxbgO3NRSSdSeFaeOJMFVSiT5k4rZ01WcbhaQYmDEqjcJkXli0LlXlsZOHfimHTni7k_ITSYIFPUJpIQLe1TSMOwMjKJ1SFYI05JH0do-rFWxJhuBufsj_PnZA-DgGuiqbgg3XrZ-Euyaz_r2Wp5FeP4De8knqw
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3JTsMwEB1BQYJTWYrY8YGrIY4TxzmndBGlqqBIvVWOFxQJtahNkPh7bDelXDhwixIlUTxy3ow97z2AW2X_-lILgomwJUokWIB5rBSOGJGUMGUxW3mziWQ45JNJOqrJ6p4Lo7X2zWf6zh36vXw1l5VbKrMznNlshNNt2ImjKCQrutamo4MzWpN0SJDe97N29sIcHtoyMHQi2YEzdftlouIxpNP859sPoLVh46HRD84cwpaeHUFzbceA6tl5DKPneV4tS5Q58YeV9Adqi1Kg7L1ycgj2ZtStCqUVyr-QZ97iroUq14iKMp8_-0YiVGuuvrXgtfMwznq4NkzABUlYiY39zpjmTNHQPisOlIhMkiYijBVPlBRRTuzV3BY5TgSVyCDNdZomseaSGUIUPYHGbD7Tp4CMMCx06kTMxovr0CZirjYSgeTGSEHPoOVGaPqx0sSYrgfn_I_zN7DXGz8NpoP-8PEC9l1A3AppSC-hUS4qfQW78rMslotrH9NvXy-h8w
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+of+the+International+Conference+on+Distributed+Computing+Systems&rft.atitle=Robust+Categorical+Data+Clustering+Guided+by+Multi-Granular+Competitive+Learning&rft.au=Cai%2C+Shenghong&rft.au=Zhang%2C+Yiqun&rft.au=Luo%2C+Xiaopeng&rft.au=Cheung%2C+Yiu-Ming&rft.date=2024-07-23&rft.pub=IEEE&rft.eissn=2575-8411&rft.spage=288&rft.epage=299&rft_id=info:doi/10.1109%2FICDCS60910.2024.00035&rft.externalDocID=10631083