Speaker Clustering by Co-Optimizing Deep Representation Learning and Cluster Estimation
Speaker clustering is a task to merge speech segments uttered by the same speaker into a single cluster, which is an effective tool for alleviating the management of massive amount of audio documents. In this paper, we present a work for co-optimizing the two main steps of speaker clustering, namely...
Saved in:
| Published in: | IEEE transactions on multimedia Vol. 23; pp. 3377 - 3387 |
|---|---|
| Main Authors: | , , , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Piscataway
IEEE
2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Subjects: | |
| ISSN: | 1520-9210, 1941-0077 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Speaker clustering is a task to merge speech segments uttered by the same speaker into a single cluster, which is an effective tool for alleviating the management of massive amount of audio documents. In this paper, we present a work for co-optimizing the two main steps of speaker clustering, namely, feature learning and cluster estimation. In our method, the deep representation feature is learned by a deep convolutional autoencoder network (DCAN), while the cluster estimation is realized by a softmax layer that is combined with the DCAN. We devise an integrated loss function to simultaneously minimize the reconstruction loss (for deep representation learning) and the clustering loss (for cluster estimation). Many state-of-the-art audio features and clustering methods are evaluated on experimental datasets selected from two publicly available speech corpora (the AISHELL-2 and the VoxCeleb1). The results show that the proposed method exceeds other speaker clustering methods in regard to the normalized mutual information (NMI) and the clustering accuracy (CA). Additionally, the proposed deep representation feature outperforms other features that were widely used in previous works, in terms of both NMI and CA. |
|---|---|
| AbstractList | Speaker clustering is a task to merge speech segments uttered by the same speaker into a single cluster, which is an effective tool for alleviating the management of massive amount of audio documents. In this paper, we present a work for co-optimizing the two main steps of speaker clustering, namely, feature learning and cluster estimation. In our method, the deep representation feature is learned by a deep convolutional autoencoder network (DCAN), while the cluster estimation is realized by a softmax layer that is combined with the DCAN. We devise an integrated loss function to simultaneously minimize the reconstruction loss (for deep representation learning) and the clustering loss (for cluster estimation). Many state-of-the-art audio features and clustering methods are evaluated on experimental datasets selected from two publicly available speech corpora (the AISHELL-2 and the VoxCeleb1). The results show that the proposed method exceeds other speaker clustering methods in regard to the normalized mutual information (NMI) and the clustering accuracy (CA). Additionally, the proposed deep representation feature outperforms other features that were widely used in previous works, in terms of both NMI and CA. |
| Author | Jiang, Zhongjie He, Qianhua Li, Yanxiong Liu, Mingle Wang, Wucheng |
| Author_xml | – sequence: 1 givenname: Yanxiong orcidid: 0000-0003-4362-1125 surname: Li fullname: Li, Yanxiong email: eeyxli@scut.edu.cn organization: School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China – sequence: 2 givenname: Wucheng surname: Wang fullname: Wang, Wucheng email: 347710323@qq.com organization: School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China – sequence: 3 givenname: Mingle surname: Liu fullname: Liu, Mingle email: 1324903969@qq.com organization: School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China – sequence: 4 givenname: Zhongjie surname: Jiang fullname: Jiang, Zhongjie email: 870782975@qq.com organization: School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China – sequence: 5 givenname: Qianhua surname: He fullname: He, Qianhua email: eeqhhe@scut.edu.cn organization: School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China |
| BookMark | eNp9kL1PwzAQxS1UJNrCjsQSiTnlbMdxPaJSPqRWlaCIMXKSM0opTrDTofz1OKQwMDDd6d773eneiAxsbZGQcwoTSkFdrZfLCQMGEw4sSVN5RIZUJTQGkHIQesEgVozCCRl5vwGgiQA5JC9PDeo3dNFsu_Mtusq-Rvk-mtXxqmmr9-qzG9wgNtEjNg492la3VW2jBWpnO1Hb8geO5j4w3_opOTZ66_HsUMfk-Xa-nt3Hi9Xdw-x6ERdM0TbWRQ6m1KXUYlpiqnIugJo8MVRwLXNutKEJxymHNAjawFTlKRaScakp5IqPyWW_t3H1xw59m23qnbPhZMaEFEqqhIvgSntX4WrvHZqsqPo_WqerbUYh60LMQohZF2J2CDGA8AdsXPjQ7f9DLnqkQsRfu2JAg4t_AZv9fzM |
| CODEN | ITMUF8 |
| CitedBy_id | crossref_primary_10_1109_TMM_2023_3253301 crossref_primary_10_1109_TASLP_2024_3385287 crossref_primary_10_1109_TASLP_2023_3338533 crossref_primary_10_1109_TNNLS_2022_3151498 crossref_primary_10_1002_cpe_6954 crossref_primary_10_3390_rs15112832 crossref_primary_10_1016_j_eswa_2025_127542 |
| Cites_doi | 10.21437/Odyssey.2016-41 10.21437/Interspeech.2013-684 10.1109/TMM.2012.2233724 10.1109/ICASSP.2019.8683892 10.1109/ICASSP.2010.5495088 10.21437/Odyssey.2014-29 10.1109/89.876308 10.1007/978-3-540-74549-5_7 10.1109/ICASSP.2019.8683683 10.1109/BigMM.2015.46 10.1109/TASLP.2018.2791806 10.1016/j.csl.2016.09.002 10.1007/978-3-540-68585-2_47 10.1109/TASL.2011.2125954 10.1109/ASRU46091.2019.9003959 10.1109/TMM.2019.2947199 10.1109/ICASSP.2019.8683525 10.1007/3-540-45323-7_30 10.1109/TASLP.2017.2747097 10.1016/j.csl.2005.08.002 10.1109/ICASSP.2010.5495077 10.1109/ICASSP.2018.8461893 10.1109/TASL.2009.2015698 10.1109/ICASSP.2010.5495078 10.1109/ICASSP.2018.8461375 10.1109/TASL.2007.894525 10.1038/323533a0 10.1109/ACCESS.2018.2872931 10.1016/j.csl.2008.05.001 10.1109/TASL.2013.2264673 10.1016/j.csl.2009.01.002 10.1016/j.sigpro.2009.03.001 10.21437/Interspeech.2017-166 10.1109/ICASSP.2018.8462628 10.1109/TSA.2004.840940 10.1049/iet-spr.2013.0340 10.21437/Interspeech.2018-1893 10.1109/TPAMI.2010.88 10.1109/MLSP.2017.8168166 10.1016/j.csl.2019.101027 10.1109/TMM.2017.2777671 10.1007/978-1-4471-5779-3 10.1109/ICASSP.2017.7953094 10.1016/0925-2312(93)90006-O 10.1109/TMM.2019.2902489 10.1109/TPAMI.2011.174 10.21437/Interspeech.2019-1388 10.1109/TASLP.2013.2285474 10.1109/ICASSP.2017.7953183 10.1016/j.patcog.2019.06.018 10.21437/Interspeech.2019-1399 10.1109/TSMCA.2007.909595 |
| ContentType | Journal Article |
| Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021 |
| Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021 |
| DBID | 97E RIA RIE AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D |
| DOI | 10.1109/TMM.2020.3024667 |
| DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
| DatabaseTitle | CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional |
| DatabaseTitleList | Technology Research Database |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering Computer Science |
| EISSN | 1941-0077 |
| EndPage | 3387 |
| ExternalDocumentID | 10_1109_TMM_2020_3024667 9201020 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: National Natural Science Foundation of China grantid: 61771200 funderid: 10.13039/501100001809 |
| GroupedDBID | -~X 0R~ 29I 4.4 5GY 5VS 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFO ACGFS ACIWK AENEX AETIX AGQYO AGSQL AHBIQ AI. AIBXA AKJIK AKQYR ALLEH ALMA_UNASSIGNED_HOLDINGS ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS EJD HZ~ H~9 IFIPE IFJZH IPLJI JAVBF LAI M43 O9- OCL P2P PQQKQ RIA RIE RNS TN5 VH1 ZY4 AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D |
| ID | FETCH-LOGICAL-c291t-acb0fdad7a58de69b3501fb4f153a7b3faf143e8306501af089b6ec7237a10b93 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 12 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000698902000033&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1520-9210 |
| IngestDate | Sun Nov 30 05:22:53 EST 2025 Tue Nov 18 22:05:36 EST 2025 Sat Nov 29 03:10:05 EST 2025 Wed Aug 27 05:08:48 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Language | English |
| License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037 |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c291t-acb0fdad7a58de69b3501fb4f153a7b3faf143e8306501af089b6ec7237a10b93 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ORCID | 0000-0003-4362-1125 |
| PQID | 2575979435 |
| PQPubID | 75737 |
| PageCount | 11 |
| ParticipantIDs | ieee_primary_9201020 crossref_citationtrail_10_1109_TMM_2020_3024667 proquest_journals_2575979435 crossref_primary_10_1109_TMM_2020_3024667 |
| PublicationCentury | 2000 |
| PublicationDate | 20210000 2021-00-00 20210101 |
| PublicationDateYYYYMMDD | 2021-01-01 |
| PublicationDate_xml | – year: 2021 text: 20210000 |
| PublicationDecade | 2020 |
| PublicationPlace | Piscataway |
| PublicationPlace_xml | – name: Piscataway |
| PublicationTitle | IEEE transactions on multimedia |
| PublicationTitleAbbrev | TMM |
| PublicationYear | 2021 |
| Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| References | ref57 ref13 dupuy (ref40) 2014 ref12 ref59 ref15 ref14 yu (ref46) 2015 ref52 ref55 ref54 ref10 ref17 tsai (ref16) 2005 ref19 ref18 chen (ref60) 1998 ref51 ref50 y?lmaz (ref25) 2019 ref45 ref48 du (ref53) 2018 ref47 ref42 ref41 ref44 ref43 ref49 ref8 ref7 ref9 ref4 ref3 fiscus (ref58) 2007; 4625 ref6 ref35 ref34 ref37 ref36 li (ref56) 2018; 13 ref31 ref30 ref33 ref32 ref2 ref1 ref38 rouvier (ref39) 2012 ref24 ref23 ref26 ref20 k?l?ç (ref5) 2016; 18 ref22 ref21 ref28 ref27 ref29 minotto (ref11) 2015; 17 |
| References_xml | – ident: ref13 doi: 10.21437/Odyssey.2016-41 – ident: ref18 doi: 10.21437/Interspeech.2013-684 – ident: ref10 doi: 10.1109/TMM.2012.2233724 – start-page: 1 year: 2018 ident: ref53 article-title: AISHELL-2: Transforming mandarin ASR research into industrial scale – ident: ref44 doi: 10.1109/ICASSP.2019.8683892 – start-page: 146 year: 2012 ident: ref39 article-title: A global optimization framework for speaker diarization publication-title: Proc Odyssey – ident: ref33 doi: 10.1109/ICASSP.2010.5495088 – start-page: 187 year: 2014 ident: ref40 article-title: Recent improvements on ILP-based clustering for broadcast news speaker diarization publication-title: Proc Odyssey doi: 10.21437/Odyssey.2014-29 – ident: ref15 doi: 10.1109/89.876308 – ident: ref12 doi: 10.1007/978-3-540-74549-5_7 – ident: ref49 doi: 10.1109/ICASSP.2019.8683683 – ident: ref50 doi: 10.1109/BigMM.2015.46 – ident: ref6 doi: 10.1109/TASLP.2018.2791806 – ident: ref8 doi: 10.1016/j.csl.2016.09.002 – ident: ref37 doi: 10.1007/978-3-540-68585-2_47 – ident: ref2 doi: 10.1109/TASL.2011.2125954 – ident: ref45 doi: 10.1109/ASRU46091.2019.9003959 – ident: ref57 doi: 10.1109/TMM.2019.2947199 – ident: ref48 doi: 10.1109/ICASSP.2019.8683525 – ident: ref59 doi: 10.1007/3-540-45323-7_30 – ident: ref20 doi: 10.1109/TASLP.2017.2747097 – ident: ref32 doi: 10.1016/j.csl.2005.08.002 – volume: 13 start-page: 965 year: 2018 ident: ref56 article-title: Mobile phone clustering from speech recordings using deep embedding and spectral clustering publication-title: IEEE TIFS – ident: ref38 doi: 10.1109/ICASSP.2010.5495077 – ident: ref7 doi: 10.1109/ICASSP.2018.8461893 – ident: ref36 doi: 10.1109/TASL.2009.2015698 – ident: ref30 doi: 10.1109/ICASSP.2010.5495078 – ident: ref27 doi: 10.1109/ICASSP.2018.8461375 – start-page: 127 year: 1998 ident: ref60 article-title: Speaker, environment and channel change detection and clustering via the Bayesian information criterion publication-title: Proc DARPA Broad News Transcription Understanding Workshop – start-page: i-725?728 year: 2005 ident: ref16 article-title: Clustering speech utterances by speaker using eigenvoice-motivated vector space models publication-title: Proc ICASSP – ident: ref43 doi: 10.1109/TASL.2007.894525 – ident: ref51 doi: 10.1038/323533a0 – ident: ref47 doi: 10.1109/ACCESS.2018.2872931 – ident: ref41 doi: 10.1016/j.csl.2008.05.001 – ident: ref19 doi: 10.1109/TASL.2013.2264673 – ident: ref17 doi: 10.1016/j.csl.2009.01.002 – ident: ref1 doi: 10.1016/j.sigpro.2009.03.001 – ident: ref28 doi: 10.21437/Interspeech.2017-166 – ident: ref26 doi: 10.1109/ICASSP.2018.8462628 – ident: ref14 doi: 10.1109/TSA.2004.840940 – ident: ref34 doi: 10.1049/iet-spr.2013.0340 – ident: ref24 doi: 10.21437/Interspeech.2018-1893 – ident: ref55 doi: 10.1109/TPAMI.2010.88 – volume: 4625 start-page: 373 year: 2007 ident: ref58 article-title: The rich transcription 2007 meeting recognition evaluation publication-title: LNCS – ident: ref22 doi: 10.1109/MLSP.2017.8168166 – ident: ref54 doi: 10.1016/j.csl.2019.101027 – volume: 18 start-page: 2417 year: 2016 ident: ref5 article-title: Mean-shift and sparse sampling-based SMC-PHD filtering for audio informed visual speaker tracking publication-title: IEEE TMM – ident: ref4 doi: 10.1109/TMM.2017.2777671 – year: 2015 ident: ref46 publication-title: Automatic Speech Recognition?A Deep Learning Approach doi: 10.1007/978-1-4471-5779-3 – ident: ref23 doi: 10.1109/ICASSP.2017.7953094 – volume: 17 start-page: 1694 year: 2015 ident: ref11 article-title: Multimodal multi-channel on-line speaker diarization using sensor fusion through SVM publication-title: IEEE TMM – ident: ref52 doi: 10.1016/0925-2312(93)90006-O – ident: ref3 doi: 10.1109/TMM.2019.2902489 – ident: ref9 doi: 10.1109/TPAMI.2011.174 – ident: ref29 doi: 10.21437/Interspeech.2019-1388 – ident: ref31 doi: 10.1109/TASLP.2013.2285474 – ident: ref21 doi: 10.1109/ICASSP.2017.7953183 – ident: ref35 doi: 10.1016/j.patcog.2019.06.018 – start-page: 411 year: 2019 ident: ref25 article-title: Large-scale speaker diarization of radio broadcast archives publication-title: Proc INTERSPEECH doi: 10.21437/Interspeech.2019-1399 – ident: ref42 doi: 10.1109/TSMCA.2007.909595 |
| SSID | ssj0014507 |
| Score | 2.3753848 |
| Snippet | Speaker clustering is a task to merge speech segments uttered by the same speaker into a single cluster, which is an effective tool for alleviating the... |
| SourceID | proquest crossref ieee |
| SourceType | Aggregation Database Enrichment Source Index Database Publisher |
| StartPage | 3377 |
| SubjectTerms | audio document analysis Clustering Clustering algorithms Clustering methods Decoding deep convolutional autoencoder network deep representation Estimation Feature extraction Learning Neural networks Representations Speaker clustering Speaker recognition Speech |
| Title | Speaker Clustering by Co-Optimizing Deep Representation Learning and Cluster Estimation |
| URI | https://ieeexplore.ieee.org/document/9201020 https://www.proquest.com/docview/2575979435 |
| Volume | 23 |
| WOSCitedRecordID | wos000698902000033&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVIEE databaseName: IEEE Electronic Library (IEL) customDbUrl: eissn: 1941-0077 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0014507 issn: 1520-9210 databaseCode: RIE dateStart: 19990101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1JS8QwFH6oeNCDy6g4buTgRbBOJu1MkqOMihdHcUFvJUkTEbUzzCLor_clk46CIngraQKlX_OWvrzvA9hHyI3QliYcc40kcxztYEZ10rSGcZVKUSgdxCZ4tyseHuTVDBxOe2GsteHwmT3yl6GWX_TM2P8qa0hfumWYoM9y3p70ak0rBlkrtEajO6KJxDymKklS2bi9uMBEkGF-ig6pHRTlv1xQ0FT5YYiDdzlb_t9zrcBSjCLJ8QT2VZixZQ2WK4UGEjdsDRa_0Q2uwf1N36pnvN95GXt-BBwk-p10esklGo7Xpw8_cGJtn1yH87GxLakkkYT1kaiyqBaTUzQOk77Hdbg7O73tnCdRWCExTDZHiTKaukIVXLVEYdtS--qi05lD86e4Tp1yGEZZ4VXlaVM5KqRuW8NZylWTapluwFzZK-0mECfQTBmapc6To2F0w4xkWpgMA0M0BqIOjepd5yayjnvxi5c8ZB9U5ohO7tHJIzp1OJiu6E8YN_6Yu-bRmM6LQNRhp4Izj1tymDMvRerp8Fpbv6_ahgXmD6yE_ys7MDcajO0uzJu30dNwsBe-tk_ccNDp |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1bS8MwFD6ICuqDd3Fe8-CLYF2WdmvyKFNRdFN0om8lSRMRtRu6CfrrPcnSKSiCbyVNoPRrzqUn5_sAdhByzZWhUYq5RpTYFO1gQlVUM5qlMhY8l8qLTaTtNr-7E5djsDfqhTHG-MNnZt9d-lp-3tUD96usKlzplmGCPuGUs0K31qhmkNR9czQ6JBoJzGTKoiQV1U6rhakgwwwVXVLDa8p_OSGvqvLDFHv_cjz3vyebh9kQR5KDIfALMGaKRZgrNRpI2LKLMPONcHAJbq97Rj7i_ebTwDEk4CBR76TZjS7QdDw_fLiBQ2N65MqfkA2NSQUJNKz3RBZ5uZgcoXkYdj4uw83xUad5EgVphUgzUetHUitqc5mnss5z0xDK1RetSiwaQJmq2EqLgZThTlee1qSlXKiG0SmLU1mjSsQrMF50C7MKxHI0VJomsXX0aBjfMC2Y4jrB0BDNAa9AtXzXmQ68407-4inz-QcVGaKTOXSygE4FdkcrekPOjT_mLjk0RvMCEBXYKOHMwqZ8zZgTI3WEePW131dtw9RJp3WenZ-2z9ZhmrnjK_5vywaM918GZhMm9Vv_4fVly395n-kW1DI |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Speaker+Clustering+by+Co-Optimizing+Deep+Representation+Learning+and+Cluster+Estimation&rft.jtitle=IEEE+transactions+on+multimedia&rft.au=Li%2C+Yanxiong&rft.au=Wang%2C+Wucheng&rft.au=Liu%2C+Mingle&rft.au=Jiang%2C+Zhongjie&rft.date=2021&rft.pub=IEEE&rft.issn=1520-9210&rft.volume=23&rft.spage=3377&rft.epage=3387&rft_id=info:doi/10.1109%2FTMM.2020.3024667&rft.externalDocID=9201020 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1520-9210&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1520-9210&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1520-9210&client=summon |