Speaker Clustering by Co-Optimizing Deep Representation Learning and Cluster Estimation

Speaker clustering is a task to merge speech segments uttered by the same speaker into a single cluster, which is an effective tool for alleviating the management of massive amount of audio documents. In this paper, we present a work for co-optimizing the two main steps of speaker clustering, namely...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on multimedia Vol. 23; pp. 3377 - 3387
Main Authors: Li, Yanxiong, Wang, Wucheng, Liu, Mingle, Jiang, Zhongjie, He, Qianhua
Format: Journal Article
Language:English
Published: Piscataway IEEE 2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:
ISSN:1520-9210, 1941-0077
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Speaker clustering is a task to merge speech segments uttered by the same speaker into a single cluster, which is an effective tool for alleviating the management of massive amount of audio documents. In this paper, we present a work for co-optimizing the two main steps of speaker clustering, namely, feature learning and cluster estimation. In our method, the deep representation feature is learned by a deep convolutional autoencoder network (DCAN), while the cluster estimation is realized by a softmax layer that is combined with the DCAN. We devise an integrated loss function to simultaneously minimize the reconstruction loss (for deep representation learning) and the clustering loss (for cluster estimation). Many state-of-the-art audio features and clustering methods are evaluated on experimental datasets selected from two publicly available speech corpora (the AISHELL-2 and the VoxCeleb1). The results show that the proposed method exceeds other speaker clustering methods in regard to the normalized mutual information (NMI) and the clustering accuracy (CA). Additionally, the proposed deep representation feature outperforms other features that were widely used in previous works, in terms of both NMI and CA.
AbstractList Speaker clustering is a task to merge speech segments uttered by the same speaker into a single cluster, which is an effective tool for alleviating the management of massive amount of audio documents. In this paper, we present a work for co-optimizing the two main steps of speaker clustering, namely, feature learning and cluster estimation. In our method, the deep representation feature is learned by a deep convolutional autoencoder network (DCAN), while the cluster estimation is realized by a softmax layer that is combined with the DCAN. We devise an integrated loss function to simultaneously minimize the reconstruction loss (for deep representation learning) and the clustering loss (for cluster estimation). Many state-of-the-art audio features and clustering methods are evaluated on experimental datasets selected from two publicly available speech corpora (the AISHELL-2 and the VoxCeleb1). The results show that the proposed method exceeds other speaker clustering methods in regard to the normalized mutual information (NMI) and the clustering accuracy (CA). Additionally, the proposed deep representation feature outperforms other features that were widely used in previous works, in terms of both NMI and CA.
Author Jiang, Zhongjie
He, Qianhua
Li, Yanxiong
Liu, Mingle
Wang, Wucheng
Author_xml – sequence: 1
  givenname: Yanxiong
  orcidid: 0000-0003-4362-1125
  surname: Li
  fullname: Li, Yanxiong
  email: eeyxli@scut.edu.cn
  organization: School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China
– sequence: 2
  givenname: Wucheng
  surname: Wang
  fullname: Wang, Wucheng
  email: 347710323@qq.com
  organization: School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China
– sequence: 3
  givenname: Mingle
  surname: Liu
  fullname: Liu, Mingle
  email: 1324903969@qq.com
  organization: School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China
– sequence: 4
  givenname: Zhongjie
  surname: Jiang
  fullname: Jiang, Zhongjie
  email: 870782975@qq.com
  organization: School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China
– sequence: 5
  givenname: Qianhua
  surname: He
  fullname: He, Qianhua
  email: eeqhhe@scut.edu.cn
  organization: School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China
BookMark eNp9kL1PwzAQxS1UJNrCjsQSiTnlbMdxPaJSPqRWlaCIMXKSM0opTrDTofz1OKQwMDDd6d773eneiAxsbZGQcwoTSkFdrZfLCQMGEw4sSVN5RIZUJTQGkHIQesEgVozCCRl5vwGgiQA5JC9PDeo3dNFsu_Mtusq-Rvk-mtXxqmmr9-qzG9wgNtEjNg492la3VW2jBWpnO1Hb8geO5j4w3_opOTZ66_HsUMfk-Xa-nt3Hi9Xdw-x6ERdM0TbWRQ6m1KXUYlpiqnIugJo8MVRwLXNutKEJxymHNAjawFTlKRaScakp5IqPyWW_t3H1xw59m23qnbPhZMaEFEqqhIvgSntX4WrvHZqsqPo_WqerbUYh60LMQohZF2J2CDGA8AdsXPjQ7f9DLnqkQsRfu2JAg4t_AZv9fzM
CODEN ITMUF8
CitedBy_id crossref_primary_10_1109_TMM_2023_3253301
crossref_primary_10_1109_TASLP_2024_3385287
crossref_primary_10_1109_TASLP_2023_3338533
crossref_primary_10_1109_TNNLS_2022_3151498
crossref_primary_10_1002_cpe_6954
crossref_primary_10_3390_rs15112832
crossref_primary_10_1016_j_eswa_2025_127542
Cites_doi 10.21437/Odyssey.2016-41
10.21437/Interspeech.2013-684
10.1109/TMM.2012.2233724
10.1109/ICASSP.2019.8683892
10.1109/ICASSP.2010.5495088
10.21437/Odyssey.2014-29
10.1109/89.876308
10.1007/978-3-540-74549-5_7
10.1109/ICASSP.2019.8683683
10.1109/BigMM.2015.46
10.1109/TASLP.2018.2791806
10.1016/j.csl.2016.09.002
10.1007/978-3-540-68585-2_47
10.1109/TASL.2011.2125954
10.1109/ASRU46091.2019.9003959
10.1109/TMM.2019.2947199
10.1109/ICASSP.2019.8683525
10.1007/3-540-45323-7_30
10.1109/TASLP.2017.2747097
10.1016/j.csl.2005.08.002
10.1109/ICASSP.2010.5495077
10.1109/ICASSP.2018.8461893
10.1109/TASL.2009.2015698
10.1109/ICASSP.2010.5495078
10.1109/ICASSP.2018.8461375
10.1109/TASL.2007.894525
10.1038/323533a0
10.1109/ACCESS.2018.2872931
10.1016/j.csl.2008.05.001
10.1109/TASL.2013.2264673
10.1016/j.csl.2009.01.002
10.1016/j.sigpro.2009.03.001
10.21437/Interspeech.2017-166
10.1109/ICASSP.2018.8462628
10.1109/TSA.2004.840940
10.1049/iet-spr.2013.0340
10.21437/Interspeech.2018-1893
10.1109/TPAMI.2010.88
10.1109/MLSP.2017.8168166
10.1016/j.csl.2019.101027
10.1109/TMM.2017.2777671
10.1007/978-1-4471-5779-3
10.1109/ICASSP.2017.7953094
10.1016/0925-2312(93)90006-O
10.1109/TMM.2019.2902489
10.1109/TPAMI.2011.174
10.21437/Interspeech.2019-1388
10.1109/TASLP.2013.2285474
10.1109/ICASSP.2017.7953183
10.1016/j.patcog.2019.06.018
10.21437/Interspeech.2019-1399
10.1109/TSMCA.2007.909595
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021
DBID 97E
RIA
RIE
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/TMM.2020.3024667
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
DatabaseTitleList Technology Research Database

Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EISSN 1941-0077
EndPage 3387
ExternalDocumentID 10_1109_TMM_2020_3024667
9201020
Genre orig-research
GrantInformation_xml – fundername: National Natural Science Foundation of China
  grantid: 61771200
  funderid: 10.13039/501100001809
GroupedDBID -~X
0R~
29I
4.4
5GY
5VS
6IK
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
ACGFO
ACGFS
ACIWK
AENEX
AETIX
AGQYO
AGSQL
AHBIQ
AI.
AIBXA
AKJIK
AKQYR
ALLEH
ALMA_UNASSIGNED_HOLDINGS
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
EBS
EJD
HZ~
H~9
IFIPE
IFJZH
IPLJI
JAVBF
LAI
M43
O9-
OCL
P2P
PQQKQ
RIA
RIE
RNS
TN5
VH1
ZY4
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c291t-acb0fdad7a58de69b3501fb4f153a7b3faf143e8306501af089b6ec7237a10b93
IEDL.DBID RIE
ISICitedReferencesCount 12
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000698902000033&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1520-9210
IngestDate Sun Nov 30 05:22:53 EST 2025
Tue Nov 18 22:05:36 EST 2025
Sat Nov 29 03:10:05 EST 2025
Wed Aug 27 05:08:48 EDT 2025
IsPeerReviewed true
IsScholarly true
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c291t-acb0fdad7a58de69b3501fb4f153a7b3faf143e8306501af089b6ec7237a10b93
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0003-4362-1125
PQID 2575979435
PQPubID 75737
PageCount 11
ParticipantIDs ieee_primary_9201020
crossref_citationtrail_10_1109_TMM_2020_3024667
proquest_journals_2575979435
crossref_primary_10_1109_TMM_2020_3024667
PublicationCentury 2000
PublicationDate 20210000
2021-00-00
20210101
PublicationDateYYYYMMDD 2021-01-01
PublicationDate_xml – year: 2021
  text: 20210000
PublicationDecade 2020
PublicationPlace Piscataway
PublicationPlace_xml – name: Piscataway
PublicationTitle IEEE transactions on multimedia
PublicationTitleAbbrev TMM
PublicationYear 2021
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref57
ref13
dupuy (ref40) 2014
ref12
ref59
ref15
ref14
yu (ref46) 2015
ref52
ref55
ref54
ref10
ref17
tsai (ref16) 2005
ref19
ref18
chen (ref60) 1998
ref51
ref50
y?lmaz (ref25) 2019
ref45
ref48
du (ref53) 2018
ref47
ref42
ref41
ref44
ref43
ref49
ref8
ref7
ref9
ref4
ref3
fiscus (ref58) 2007; 4625
ref6
ref35
ref34
ref37
ref36
li (ref56) 2018; 13
ref31
ref30
ref33
ref32
ref2
ref1
ref38
rouvier (ref39) 2012
ref24
ref23
ref26
ref20
k?l?ç (ref5) 2016; 18
ref22
ref21
ref28
ref27
ref29
minotto (ref11) 2015; 17
References_xml – ident: ref13
  doi: 10.21437/Odyssey.2016-41
– ident: ref18
  doi: 10.21437/Interspeech.2013-684
– ident: ref10
  doi: 10.1109/TMM.2012.2233724
– start-page: 1
  year: 2018
  ident: ref53
  article-title: AISHELL-2: Transforming mandarin ASR research into industrial scale
– ident: ref44
  doi: 10.1109/ICASSP.2019.8683892
– start-page: 146
  year: 2012
  ident: ref39
  article-title: A global optimization framework for speaker diarization
  publication-title: Proc Odyssey
– ident: ref33
  doi: 10.1109/ICASSP.2010.5495088
– start-page: 187
  year: 2014
  ident: ref40
  article-title: Recent improvements on ILP-based clustering for broadcast news speaker diarization
  publication-title: Proc Odyssey
  doi: 10.21437/Odyssey.2014-29
– ident: ref15
  doi: 10.1109/89.876308
– ident: ref12
  doi: 10.1007/978-3-540-74549-5_7
– ident: ref49
  doi: 10.1109/ICASSP.2019.8683683
– ident: ref50
  doi: 10.1109/BigMM.2015.46
– ident: ref6
  doi: 10.1109/TASLP.2018.2791806
– ident: ref8
  doi: 10.1016/j.csl.2016.09.002
– ident: ref37
  doi: 10.1007/978-3-540-68585-2_47
– ident: ref2
  doi: 10.1109/TASL.2011.2125954
– ident: ref45
  doi: 10.1109/ASRU46091.2019.9003959
– ident: ref57
  doi: 10.1109/TMM.2019.2947199
– ident: ref48
  doi: 10.1109/ICASSP.2019.8683525
– ident: ref59
  doi: 10.1007/3-540-45323-7_30
– ident: ref20
  doi: 10.1109/TASLP.2017.2747097
– ident: ref32
  doi: 10.1016/j.csl.2005.08.002
– volume: 13
  start-page: 965
  year: 2018
  ident: ref56
  article-title: Mobile phone clustering from speech recordings using deep embedding and spectral clustering
  publication-title: IEEE TIFS
– ident: ref38
  doi: 10.1109/ICASSP.2010.5495077
– ident: ref7
  doi: 10.1109/ICASSP.2018.8461893
– ident: ref36
  doi: 10.1109/TASL.2009.2015698
– ident: ref30
  doi: 10.1109/ICASSP.2010.5495078
– ident: ref27
  doi: 10.1109/ICASSP.2018.8461375
– start-page: 127
  year: 1998
  ident: ref60
  article-title: Speaker, environment and channel change detection and clustering via the Bayesian information criterion
  publication-title: Proc DARPA Broad News Transcription Understanding Workshop
– start-page: i-725?728
  year: 2005
  ident: ref16
  article-title: Clustering speech utterances by speaker using eigenvoice-motivated vector space models
  publication-title: Proc ICASSP
– ident: ref43
  doi: 10.1109/TASL.2007.894525
– ident: ref51
  doi: 10.1038/323533a0
– ident: ref47
  doi: 10.1109/ACCESS.2018.2872931
– ident: ref41
  doi: 10.1016/j.csl.2008.05.001
– ident: ref19
  doi: 10.1109/TASL.2013.2264673
– ident: ref17
  doi: 10.1016/j.csl.2009.01.002
– ident: ref1
  doi: 10.1016/j.sigpro.2009.03.001
– ident: ref28
  doi: 10.21437/Interspeech.2017-166
– ident: ref26
  doi: 10.1109/ICASSP.2018.8462628
– ident: ref14
  doi: 10.1109/TSA.2004.840940
– ident: ref34
  doi: 10.1049/iet-spr.2013.0340
– ident: ref24
  doi: 10.21437/Interspeech.2018-1893
– ident: ref55
  doi: 10.1109/TPAMI.2010.88
– volume: 4625
  start-page: 373
  year: 2007
  ident: ref58
  article-title: The rich transcription 2007 meeting recognition evaluation
  publication-title: LNCS
– ident: ref22
  doi: 10.1109/MLSP.2017.8168166
– ident: ref54
  doi: 10.1016/j.csl.2019.101027
– volume: 18
  start-page: 2417
  year: 2016
  ident: ref5
  article-title: Mean-shift and sparse sampling-based SMC-PHD filtering for audio informed visual speaker tracking
  publication-title: IEEE TMM
– ident: ref4
  doi: 10.1109/TMM.2017.2777671
– year: 2015
  ident: ref46
  publication-title: Automatic Speech Recognition?A Deep Learning Approach
  doi: 10.1007/978-1-4471-5779-3
– ident: ref23
  doi: 10.1109/ICASSP.2017.7953094
– volume: 17
  start-page: 1694
  year: 2015
  ident: ref11
  article-title: Multimodal multi-channel on-line speaker diarization using sensor fusion through SVM
  publication-title: IEEE TMM
– ident: ref52
  doi: 10.1016/0925-2312(93)90006-O
– ident: ref3
  doi: 10.1109/TMM.2019.2902489
– ident: ref9
  doi: 10.1109/TPAMI.2011.174
– ident: ref29
  doi: 10.21437/Interspeech.2019-1388
– ident: ref31
  doi: 10.1109/TASLP.2013.2285474
– ident: ref21
  doi: 10.1109/ICASSP.2017.7953183
– ident: ref35
  doi: 10.1016/j.patcog.2019.06.018
– start-page: 411
  year: 2019
  ident: ref25
  article-title: Large-scale speaker diarization of radio broadcast archives
  publication-title: Proc INTERSPEECH
  doi: 10.21437/Interspeech.2019-1399
– ident: ref42
  doi: 10.1109/TSMCA.2007.909595
SSID ssj0014507
Score 2.3753848
Snippet Speaker clustering is a task to merge speech segments uttered by the same speaker into a single cluster, which is an effective tool for alleviating the...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 3377
SubjectTerms audio document analysis
Clustering
Clustering algorithms
Clustering methods
Decoding
deep convolutional autoencoder network
deep representation
Estimation
Feature extraction
Learning
Neural networks
Representations
Speaker clustering
Speaker recognition
Speech
Title Speaker Clustering by Co-Optimizing Deep Representation Learning and Cluster Estimation
URI https://ieeexplore.ieee.org/document/9201020
https://www.proquest.com/docview/2575979435
Volume 23
WOSCitedRecordID wos000698902000033&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL)
  customDbUrl:
  eissn: 1941-0077
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0014507
  issn: 1520-9210
  databaseCode: RIE
  dateStart: 19990101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1JS8QwFH6oeNCDy6g4buTgRbBOJu1MkqOMihdHcUFvJUkTEbUzzCLor_clk46CIngraQKlX_OWvrzvA9hHyI3QliYcc40kcxztYEZ10rSGcZVKUSgdxCZ4tyseHuTVDBxOe2GsteHwmT3yl6GWX_TM2P8qa0hfumWYoM9y3p70ak0rBlkrtEajO6KJxDymKklS2bi9uMBEkGF-ig6pHRTlv1xQ0FT5YYiDdzlb_t9zrcBSjCLJ8QT2VZixZQ2WK4UGEjdsDRa_0Q2uwf1N36pnvN95GXt-BBwk-p10esklGo7Xpw8_cGJtn1yH87GxLakkkYT1kaiyqBaTUzQOk77Hdbg7O73tnCdRWCExTDZHiTKaukIVXLVEYdtS--qi05lD86e4Tp1yGEZZ4VXlaVM5KqRuW8NZylWTapluwFzZK-0mECfQTBmapc6To2F0w4xkWpgMA0M0BqIOjepd5yayjnvxi5c8ZB9U5ohO7tHJIzp1OJiu6E8YN_6Yu-bRmM6LQNRhp4Izj1tymDMvRerp8Fpbv6_ahgXmD6yE_ys7MDcajO0uzJu30dNwsBe-tk_ccNDp
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1bS8MwFD6ICuqDd3Fe8-CLYF2WdmvyKFNRdFN0om8lSRMRtRu6CfrrPcnSKSiCbyVNoPRrzqUn5_sAdhByzZWhUYq5RpTYFO1gQlVUM5qlMhY8l8qLTaTtNr-7E5djsDfqhTHG-MNnZt9d-lp-3tUD96usKlzplmGCPuGUs0K31qhmkNR9czQ6JBoJzGTKoiQV1U6rhakgwwwVXVLDa8p_OSGvqvLDFHv_cjz3vyebh9kQR5KDIfALMGaKRZgrNRpI2LKLMPONcHAJbq97Rj7i_ebTwDEk4CBR76TZjS7QdDw_fLiBQ2N65MqfkA2NSQUJNKz3RBZ5uZgcoXkYdj4uw83xUad5EgVphUgzUetHUitqc5mnss5z0xDK1RetSiwaQJmq2EqLgZThTlee1qSlXKiG0SmLU1mjSsQrMF50C7MKxHI0VJomsXX0aBjfMC2Y4jrB0BDNAa9AtXzXmQ68407-4inz-QcVGaKTOXSygE4FdkcrekPOjT_mLjk0RvMCEBXYKOHMwqZ8zZgTI3WEePW131dtw9RJp3WenZ-2z9ZhmrnjK_5vywaM918GZhMm9Vv_4fVly395n-kW1DI
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Speaker+Clustering+by+Co-Optimizing+Deep+Representation+Learning+and+Cluster+Estimation&rft.jtitle=IEEE+transactions+on+multimedia&rft.au=Li%2C+Yanxiong&rft.au=Wang%2C+Wucheng&rft.au=Liu%2C+Mingle&rft.au=Jiang%2C+Zhongjie&rft.date=2021&rft.pub=IEEE&rft.issn=1520-9210&rft.volume=23&rft.spage=3377&rft.epage=3387&rft_id=info:doi/10.1109%2FTMM.2020.3024667&rft.externalDocID=9201020
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1520-9210&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1520-9210&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1520-9210&client=summon