Efficient time-domain speech separation using short encoded sequence network
The key to single-channel time-domain speech separation lies in encoding the mixed speech into latent feature representations by encoder and obtaining the accurate estimation of the target speaker masks by the separation network. Despite the advanced separation network contribute to separate target...
Saved in:
| Published in: | Speech communication Vol. 166; p. 103150 |
|---|---|
| Main Authors: | , , , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Elsevier B.V
01.01.2025
|
| Subjects: | |
| ISSN: | 0167-6393 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | The key to single-channel time-domain speech separation lies in encoding the mixed speech into latent feature representations by encoder and obtaining the accurate estimation of the target speaker masks by the separation network. Despite the advanced separation network contribute to separate target speech, but due to the limitation of the time-domain encoder–decoder framework, these separation models commonly improve the separation performance by setting a small convolution kernel size of encoder to increase the length of the coded sequence, which will result in increased computational complexity and training costs for the model. Therefore, in this paper, we propose an efficient time-domain speech separation model using short-sequence encoder–decoder framework (ESEDNet). In this model, we construct a novel encoder–decoder framework to accommodate short encoded sequences, where the encoder consists of multiple convolution and downsampling operations to reduce length of high-resolution sequence, while the decoder utilizes the encoded features to reconstruct the fine-detailed speech sequence of the target speaker. Since the output sequence of the encoder is shorter, when combined with our proposed multi-temporal resolution Transformer separation network (MTRFormer), ESEDNet can efficiently obtains separation masks for the short encoded feature sequence. Experiments show that compared with previous state-of-the-art (SOTA) methods, ESEDNet is more efficient in terms of computational complexity, training speed and GPU memory usage, while maintaining competitive separation performance.
•We introduce an encoder-decoder framework that ensures a short encoded sequence while achieving excellent separation performance.•We design the separation network, which combines with an encoder-decoder network to achieve effective target source separation.•The ESEDNet has smaller model size, lower training cost, and is easy to extend to other networks. |
|---|---|
| AbstractList | The key to single-channel time-domain speech separation lies in encoding the mixed speech into latent feature representations by encoder and obtaining the accurate estimation of the target speaker masks by the separation network. Despite the advanced separation network contribute to separate target speech, but due to the limitation of the time-domain encoder–decoder framework, these separation models commonly improve the separation performance by setting a small convolution kernel size of encoder to increase the length of the coded sequence, which will result in increased computational complexity and training costs for the model. Therefore, in this paper, we propose an efficient time-domain speech separation model using short-sequence encoder–decoder framework (ESEDNet). In this model, we construct a novel encoder–decoder framework to accommodate short encoded sequences, where the encoder consists of multiple convolution and downsampling operations to reduce length of high-resolution sequence, while the decoder utilizes the encoded features to reconstruct the fine-detailed speech sequence of the target speaker. Since the output sequence of the encoder is shorter, when combined with our proposed multi-temporal resolution Transformer separation network (MTRFormer), ESEDNet can efficiently obtains separation masks for the short encoded feature sequence. Experiments show that compared with previous state-of-the-art (SOTA) methods, ESEDNet is more efficient in terms of computational complexity, training speed and GPU memory usage, while maintaining competitive separation performance.
•We introduce an encoder-decoder framework that ensures a short encoded sequence while achieving excellent separation performance.•We design the separation network, which combines with an encoder-decoder network to achieve effective target source separation.•The ESEDNet has smaller model size, lower training cost, and is easy to extend to other networks. |
| ArticleNumber | 103150 |
| Author | Liu, Debang Zhang, Tianqi Christensen, Mads Græsbøll Ma, Baoze Deng, Pan |
| Author_xml | – sequence: 1 givenname: Debang orcidid: 0000-0002-7411-9683 surname: Liu fullname: Liu, Debang email: debangliu@163.com organization: School of Communications and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China – sequence: 2 givenname: Tianqi surname: Zhang fullname: Zhang, Tianqi email: zhangtq@cqupt.edu.cn organization: School of Communications and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China – sequence: 3 givenname: Mads Græsbøll surname: Christensen fullname: Christensen, Mads Græsbøll email: mgc@create.aau.dk organization: Audio Analysis Lab, CREATE, Aalborg University, 9000 Aalborg, Denmark – sequence: 4 givenname: Baoze surname: Ma fullname: Ma, Baoze email: mabz@cqupt.edu.cn organization: School of Communications and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China – sequence: 5 givenname: Pan surname: Deng fullname: Deng, Pan email: d200101004@stu.cqupt.edu.cn organization: School of Communications and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China |
| BookMark | eNqFkM1OwzAQhH0oEm3hDTjkBVLsOD8OByRUlR-pEhc4W85mTV0au9guiLfHJZw4wGlXO5rRzjcjE-ssEnLB6IJRVl9uF2GP4IZFQYsynTir6IRMk9TkNW_5KZmFsKWUlkIUU7JeaW3AoI1ZNAPmvRuUsVnKQNhkAffKq2iczQ7B2JcsbJyPGVpwPfZJfjukHTOL8cP51zNyotUu4PnPnJPn29XT8j5fP949LG_WOXBax5yJsuqAg1BCsU5x2jaqFqwoK1AF40BZV6X3oMG6RaVFITTVVDFoNauavuVzUo654F0IHrXcezMo_ykZlUcKcitHCvJIQY4Uku3qlw1M_G4XvTK7_8zXoxlTsXeDXoYjNsDeeIQoe2f-DvgChY6ABA |
| CitedBy_id | crossref_primary_10_1007_s00034_025_03204_8 |
| Cites_doi | 10.1109/ICASSP40776.2020.9054266 10.1109/TASL.2013.2270369 10.1109/TASLP.2017.2726762 10.1109/MLSP49062.2020.9231900 10.1109/ICASSP.2001.941023 10.1109/TNN.2004.832812 10.21437/Interspeech.2020-1673 10.1016/j.patcog.2020.107404 10.1109/TNN.2007.913988 10.1109/ICCV.2015.123 10.1109/CVPR.2017.243 10.21437/Interspeech.2023-1753 10.1109/TASLP.2023.3285241 10.1109/ICASSP.2017.7952154 10.1121/1.2229005 10.1109/ICASSP39728.2021.9413901 10.1109/CVPR.2017.195 10.1109/TSA.2005.858005 10.1109/TASLP.2018.2842159 10.1121/1.1907229 10.1109/CVPR.2017.113 10.1109/TASLP.2018.2795749 10.1109/TASLP.2019.2915167 10.1109/TASLP.2019.2928140 10.21437/Interspeech.2019-2003 10.1145/3474085.3475587 |
| ContentType | Journal Article |
| Copyright | 2024 |
| Copyright_xml | – notice: 2024 |
| DBID | AAYXX CITATION |
| DOI | 10.1016/j.specom.2024.103150 |
| DatabaseName | CrossRef |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Languages & Literatures Social Welfare & Social Work Psychology |
| ExternalDocumentID | 10_1016_j_specom_2024_103150 S0167639324001213 |
| GrantInformation_xml | – fundername: National Natural Science Foundation of China grantid: 61671095; 61702065; 61701067; 61771085; 62201113 funderid: http://dx.doi.org/10.13039/501100001809 – fundername: Natural Science Foundation of Chongqing, China grantid: cstc2021jcyj-msxmX0836 funderid: http://dx.doi.org/10.13039/501100005230 |
| GroupedDBID | --K --M -~X .DC .~1 07C 0R~ 123 1B1 1~. 1~5 4.4 457 4G. 53G 5VS 7-5 71M 8P~ 9JN 9JO AACTN AADFP AAEDT AAEDW AAFJI AAGJA AAGUQ AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AAXKI AAXUO AAYFN ABBOA ABDPE ABFNM ABIVO ABJNI ABMAC ABMMH ABOYX ABWVN ABXDB ACDAQ ACGFS ACNNM ACRLP ACRPL ACXNI ACZNC ADBBV ADEZE ADIYS ADJOM ADMUD ADNMO ADTZH AEBSH AECPX AEKER AENEX AFJKZ AFKWA AFTJW AGHFR AGUBO AGYEJ AHHHB AHJVU AHZHX AIALX AIEXJ AIKHN AITUG AJOXV AKRWK ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOMHK AOUOD ASPBG AVARZ AVWKF AXJTR AZFZN BJAXD BKOJK BLXMC CS3 DU5 EBS EFJIC EJD EO8 EO9 EP2 EP3 F0J F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-Q GBLVA GBOLZ HLZ HVGLF HZ~ IHE J1W JJJVA KOM LG9 M41 MO0 N9A O-L O9- OAUVE OKEIE OZT P-8 P-9 P2P PC. PQQKQ PRBVW Q38 R2- RIG ROL RPZ SBC SDF SDG SDP SES SEW SPC SPCBC SSB SSO SST SSV SSY SSZ T5K WUQ XJE ~G- 9DU AATTM AAYWO AAYXX ACLOT ACVFH ADCNI AEIPS AEUPX AFPUW AGQPQ AIGII AIIUN AKBMS AKYEP ANKPU APXCP CITATION EFKBS EFLBG ~HD |
| ID | FETCH-LOGICAL-c306t-1845bc3c8a8a1ba3097a681245ca213c01b5048c7e69eaf828f0f0a1c9f157d93 |
| ISICitedReferencesCount | 3 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001377242100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0167-6393 |
| IngestDate | Tue Nov 18 22:45:00 EST 2025 Sat Nov 29 06:17:52 EST 2025 Sat Jan 04 15:43:01 EST 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Keywords | Speech separation Computational complexity Multi-temporal resolution Transformer Short sequence encoder–decoder framework |
| Language | English |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-c306t-1845bc3c8a8a1ba3097a681245ca213c01b5048c7e69eaf828f0f0a1c9f157d93 |
| ORCID | 0000-0002-7411-9683 |
| ParticipantIDs | crossref_primary_10_1016_j_specom_2024_103150 crossref_citationtrail_10_1016_j_specom_2024_103150 elsevier_sciencedirect_doi_10_1016_j_specom_2024_103150 |
| PublicationCentury | 2000 |
| PublicationDate | January 2025 2025-01-00 |
| PublicationDateYYYYMMDD | 2025-01-01 |
| PublicationDate_xml | – month: 01 year: 2025 text: January 2025 |
| PublicationDecade | 2020 |
| PublicationTitle | Speech communication |
| PublicationYear | 2025 |
| Publisher | Elsevier B.V |
| Publisher_xml | – name: Elsevier B.V |
| References | Chen, Luo, Mesgarani (b4) 2017 Subakan, Cem, Ravanelli, Mirco, Cornell, Samuele, Bronzi, Mirko, Zhong, Jianyuan, 2020. Attention Is All You Need In Speech Separation. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, pp. 21–25. Ronneberger, Fischer, Brox (b40) 2015 Kingma, Ba (b20) 2014 Kolbæk, Yu, Tan, Jensen (b21) 2017; 25 Stoller, Ewert, Dixon (b42) 2018 Vincent, Gribonval, Févotte (b49) 2006; 14 He, Kaiming, Zhang, X., Ren, Shaoqing, Sun, Jian, 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In: 2015 IEEE International Conference on Computer Vision. ICCV, pp. 1026–1034. Luo, Chen, Mesgarani (b28) 2018; 26 Tzinis, Efthymios, Wang, Zhepei, Smaragdis, Paris, 2020. Sudo RM -RF: Efficient Networks for Universal Audio Source Separation. In: 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing. MLSP, pp. 1–6. Subakan, Cem, Ravanelli, Mirco, Cornell, Samuele, Bronzi, Mirko, Zhong, Jianyuan, 2021. Attention Is All You Need In Speech Separation. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, pp. 21–25. Panayotov, Chen, Povey, Khudanpur (b33) 2015 Lu, Duan, Zhang (b27) 2019; 27 Lea, Vidal, Reiter, Hager (b24) 2016 Richter, Welker, Lemercier, Lay, Gerkmann (b38) 2023 Agarap (b2) 2018 Wang, Brown (b50) 2008; 19 Wang, Chen (b51) 2018; 26 Hu, Wang (b15) 2004; 15 Yu, Dong, Kolbæk, Morten, Tan, Z., Jensen, Jesper Højvang, 2017. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, pp. 241–245. Ephrat, Mosseri, Lang, Dekel, Wilson, Hassidim, Freeman, Rubinstein (b10) 2018 Luo, Mesgarani (b30) 2019; 27 Lu, Duan, Zhang (b26) 2018; PP Subakan, Ravanelli, Cornell, Lepoutre, Grondin (b45) 2022 Bai, Kolter, Koltun (b3) 2018 Huang, Gao, Liu, Zhuang, Van Der Maaten, Laurens, Weinberger, Kilian Q, 2017. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4700–4708. Le Roux, Ono, Sagayama (b22) 2008 Huang, Watanabe, Yang, García, Khudanpur (b17) 2022 Lea, Colin, Flynn, Michael D, Vidal, Rene, Reiter, Austin, Hager, Gregory D, 2017. Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 156–165. Luo, Yi, Chen, Zhuo, Yoshioka, Takuya, 2020. Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, pp. 46–50. Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, Adam (b13) 2017 Tao, Ruijie, Pan, Zexu, Das, Rohan Kumar, Qian, Xinyuan, Shou, Mike Zheng, Li, Haizhou, 2021. Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection. In: Proceedings of the 29th ACM International Conference on Multimedia. pp. 3927–3935. Ioffe, Szegedy (b18) 2015 Chollet, François, 2017. Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1251–1258. Hershey, Chen, Le Roux, Watanabe (b12) 2016 Scheibler, Ji, Chung, Byun, Choe, Choi (b41) 2023 Martel, Héctor, Richter, Julius, Li, Kai, Hu, Xiaolin, Gerkmann, Timo, 2023. Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model. In: Proc. INTERSPEECH 2023. pp. 1673–1677. Qin, Zhang, Huang, Dehghan, Zaiane, Jagersand (b36) 2020; 106 Hu, Li, Zhang, Luo, Lemercier, Gerkmann (b14) 2021; 34 Pariente, Manuel, Cornell, Samuele, Cosentino, Joris, Sivasankaran, Sunit, Tzinis, Efthymios, Heitkaemper, Jens, Olvera, Michel, Stöter, Fabian-Robert, Hu, Mathieu, Martín-Doñas, Juan M., Ditter, David, Frank, Ariel, Deleforge, Antoine, Vincent, Emmanuel, 2020a. Asteroid: the PyTorch-based audio source separation toolkit for researchers. In: Proc. Interspeech. Chen, Mao, Liu (b5) 2020 Ravanelli, Parcollet, Plantinga, Rouhe, Cornell, Lugosch, Subakan, Dawalatabad, Heba, Zhong, Chou, Yeh, Fu, Liao, Rastorgueva, Grondin, Aris, Na, Gao, Mori, Bengio (b37) 2021 Cherry (b6) 1953; 25 Pariente, Cornell, Deleforge, Vincent (b35) 2020 Lo, Chen-Chou, Fu, Szu-Wei, Huang, Wen-Chin, Wang, Xin, Yamagishi, Junichi, Tsao, Yu, Wang, Hsin-Min, 2019. MOSNet: Deep Learning based Objective Assessment for Voice Conversion. In: Proc. Interspeech 2019. Wu, Xu, Zhang, Chen, Yu, Xie, Yu (b54) 2019 Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (b48) 2017; 30 Wang, Le Roux, Hershey (b52) 2018 Afouras, Chung, Zisserman (b1) 2018 Cosentino, Pariente, Cornell, Deleforge, Vincent (b9) 2020 Isik, Roux, Chen, Watanabe, Hershey (b19) 2016 Mohammadiha, Smaragdis, Leijon (b32) 2013; 21 Cooke, Barker, Cunningham, Shao (b8) 2006; 120 Rix, Antony W., Beerends, John G., Hollier, Mike, Hekstra, Andries P., 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221). Vol. 2, pp. 749–752. Wang, Roux, Wang, Hershey (b53) 2018 Hu (10.1016/j.specom.2024.103150_b15) 2004; 15 Wang (10.1016/j.specom.2024.103150_b51) 2018; 26 Scheibler (10.1016/j.specom.2024.103150_b41) 2023 10.1016/j.specom.2024.103150_b43 Ioffe (10.1016/j.specom.2024.103150_b18) 2015 Cooke (10.1016/j.specom.2024.103150_b8) 2006; 120 10.1016/j.specom.2024.103150_b47 10.1016/j.specom.2024.103150_b46 10.1016/j.specom.2024.103150_b44 Howard (10.1016/j.specom.2024.103150_b13) 2017 Pariente (10.1016/j.specom.2024.103150_b35) 2020 Vaswani (10.1016/j.specom.2024.103150_b48) 2017; 30 Kolbæk (10.1016/j.specom.2024.103150_b21) 2017; 25 Subakan (10.1016/j.specom.2024.103150_b45) 2022 Wang (10.1016/j.specom.2024.103150_b53) 2018 10.1016/j.specom.2024.103150_b39 Mohammadiha (10.1016/j.specom.2024.103150_b32) 2013; 21 Cosentino (10.1016/j.specom.2024.103150_b9) 2020 Hu (10.1016/j.specom.2024.103150_b14) 2021; 34 Ephrat (10.1016/j.specom.2024.103150_b10) 2018 Isik (10.1016/j.specom.2024.103150_b19) 2016 10.1016/j.specom.2024.103150_b31 Afouras (10.1016/j.specom.2024.103150_b1) 2018 10.1016/j.specom.2024.103150_b34 Hershey (10.1016/j.specom.2024.103150_b12) 2016 10.1016/j.specom.2024.103150_b7 10.1016/j.specom.2024.103150_b29 Lu (10.1016/j.specom.2024.103150_b26) 2018; PP Agarap (10.1016/j.specom.2024.103150_b2) 2018 Cherry (10.1016/j.specom.2024.103150_b6) 1953; 25 Wang (10.1016/j.specom.2024.103150_b52) 2018 10.1016/j.specom.2024.103150_b25 Vincent (10.1016/j.specom.2024.103150_b49) 2006; 14 10.1016/j.specom.2024.103150_b23 Chen (10.1016/j.specom.2024.103150_b4) 2017 Richter (10.1016/j.specom.2024.103150_b38) 2023 Wu (10.1016/j.specom.2024.103150_b54) 2019 Huang (10.1016/j.specom.2024.103150_b17) 2022 10.1016/j.specom.2024.103150_b16 Kingma (10.1016/j.specom.2024.103150_b20) 2014 Stoller (10.1016/j.specom.2024.103150_b42) 2018 Panayotov (10.1016/j.specom.2024.103150_b33) 2015 Ravanelli (10.1016/j.specom.2024.103150_b37) 2021 Wang (10.1016/j.specom.2024.103150_b50) 2008; 19 Bai (10.1016/j.specom.2024.103150_b3) 2018 Chen (10.1016/j.specom.2024.103150_b5) 2020 Lu (10.1016/j.specom.2024.103150_b27) 2019; 27 10.1016/j.specom.2024.103150_b11 10.1016/j.specom.2024.103150_b55 Luo (10.1016/j.specom.2024.103150_b30) 2019; 27 Le Roux (10.1016/j.specom.2024.103150_b22) 2008 Lea (10.1016/j.specom.2024.103150_b24) 2016 Luo (10.1016/j.specom.2024.103150_b28) 2018; 26 Qin (10.1016/j.specom.2024.103150_b36) 2020; 106 Ronneberger (10.1016/j.specom.2024.103150_b40) 2015 |
| References_xml | – year: 2018 ident: b3 article-title: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling – start-page: 47 year: 2016 end-page: 54 ident: b24 article-title: Temporal convolutional networks: A unified approach to action segmentation publication-title: European Conference on Computer Vision – year: 2017 ident: b13 article-title: Mobilenets: Efficient convolutional neural networks for mobile vision applications – reference: Lea, Colin, Flynn, Michael D, Vidal, Rene, Reiter, Austin, Hager, Gregory D, 2017. Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 156–165. – volume: 26 start-page: 787 year: 2018 end-page: 796 ident: b28 article-title: Speaker-independent speech separation with deep attractor network publication-title: IEEE/ACM Trans. Audio Speech Lang. Process. – start-page: 5206 year: 2015 end-page: 5210 ident: b33 article-title: Librispeech: an asr corpus based on public domain audio books publication-title: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing – reference: Huang, Gao, Liu, Zhuang, Van Der Maaten, Laurens, Weinberger, Kilian Q, 2017. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4700–4708. – volume: 30 year: 2017 ident: b48 article-title: Attention is all you need publication-title: Adv. Neural Inf. Process. Syst. – reference: Subakan, Cem, Ravanelli, Mirco, Cornell, Samuele, Bronzi, Mirko, Zhong, Jianyuan, 2020. Attention Is All You Need In Speech Separation. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, pp. 21–25. – volume: 25 start-page: 1901 year: 2017 end-page: 1913 ident: b21 article-title: Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks publication-title: IEEE/ACM Trans. Audio Speech Lang. Process. – volume: 14 start-page: 1462 year: 2006 end-page: 1469 ident: b49 article-title: Performance measurement in blind audio source separation publication-title: IEEE Trans. Audio Speech Lang. Process. – volume: 26 start-page: 1702 year: 2018 end-page: 1726 ident: b51 article-title: Supervised speech separation based on deep learning: An overview publication-title: IEEE/ACM Trans. Audio Speech Lang. Process. – volume: 15 start-page: 1135 year: 2004 end-page: 1150 ident: b15 article-title: Monaural speech segregation based on pitch tracking and amplitude modulation publication-title: IEEE Trans. Neural Netw. – start-page: 448 year: 2015 end-page: 456 ident: b18 article-title: Batch normalization: Accelerating deep network training by reducing internal covariate shift publication-title: International Conference on Machine Learning – volume: 19 start-page: 199 year: 2008 ident: b50 article-title: Computational auditory scene analysis: Principles, algorithms, and applications publication-title: IEEE Trans. Neural Netw. – reference: Yu, Dong, Kolbæk, Morten, Tan, Z., Jensen, Jesper Højvang, 2017. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, pp. 241–245. – start-page: 6837 year: 2022 end-page: 6841 ident: b17 article-title: Investigating self-supervised learning for speech enhancement and separation publication-title: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing – reference: Tao, Ruijie, Pan, Zexu, Das, Rohan Kumar, Qian, Xinyuan, Shou, Mike Zheng, Li, Haizhou, 2021. Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection. In: Proceedings of the 29th ACM International Conference on Multimedia. pp. 3927–3935. – reference: Martel, Héctor, Richter, Julius, Li, Kai, Hu, Xiaolin, Gerkmann, Timo, 2023. Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model. In: Proc. INTERSPEECH 2023. pp. 1673–1677. – year: 2021 ident: b37 article-title: SpeechBrain: A general-purpose speech toolkit – year: 2022 ident: b45 article-title: Resource-efficient separation transformer – year: 2016 ident: b19 article-title: Single-channel multi-speaker separation using deep clustering – start-page: 667 year: 2019 end-page: 673 ident: b54 article-title: Time domain audio visual speech separation publication-title: 2019 IEEE Automatic Speech Recognition and Understanding Workshop – volume: 27 start-page: 1256 year: 2019 end-page: 1266 ident: b30 article-title: Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation publication-title: IEEE/ACM Trans. Audio Speech Lang. Process. – year: 2018 ident: b10 article-title: Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation – reference: Rix, Antony W., Beerends, John G., Hollier, Mike, Hekstra, Andries P., 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221). Vol. 2, pp. 749–752. – volume: 27 start-page: 1697 year: 2019 end-page: 1712 ident: b27 article-title: Audio–visual deep clustering for speech separation publication-title: IEEE/ACM Trans. Audio Speech Lang. Process. – year: 2018 ident: b42 article-title: Wave-u-net: A multi-scale neural network for end-to-end audio source separation – year: 2018 ident: b53 article-title: End-to-end speech separation with unfolded iterative phase reconstruction – volume: 34 start-page: 22509 year: 2021 end-page: 22522 ident: b14 article-title: Speech separation using an asynchronous fully recurrent convolutional neural network publication-title: Adv. Neural Inf. Process. Syst. (NeurIPS) – year: 2018 ident: b1 article-title: LRS3-TED: a large-scale dataset for visual speech recognition – volume: 21 start-page: 2140 year: 2013 end-page: 2151 ident: b32 article-title: Supervised and unsupervised speech enhancement using nonnegative matrix factorization publication-title: IEEE Trans. Audio Speech Lang. Process. – start-page: 686 year: 2018 end-page: 690 ident: b52 article-title: Alternative objective functions for deep clustering publication-title: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing – start-page: 23 year: 2008 end-page: 28 ident: b22 article-title: Explicit consistency constraints for STFT spectrograms and their application to phase reconstruction publication-title: Interspeech – start-page: 1 year: 2023 end-page: 5 ident: b41 article-title: Diffusion-based generative speech source separation publication-title: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing – reference: Luo, Yi, Chen, Zhuo, Yoshioka, Takuya, 2020. Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, pp. 46–50. – reference: Lo, Chen-Chou, Fu, Szu-Wei, Huang, Wen-Chin, Wang, Xin, Yamagishi, Junichi, Tsao, Yu, Wang, Hsin-Min, 2019. MOSNet: Deep Learning based Objective Assessment for Voice Conversion. In: Proc. Interspeech 2019. – start-page: 6364 year: 2020 end-page: 6368 ident: b35 article-title: Filterbank design for end-to-end speech separation publication-title: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing – reference: Subakan, Cem, Ravanelli, Mirco, Cornell, Samuele, Bronzi, Mirko, Zhong, Jianyuan, 2021. Attention Is All You Need In Speech Separation. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, pp. 21–25. – start-page: 2642 year: 2020 end-page: 2646 ident: b5 article-title: Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation publication-title: Interspeech – volume: 120 start-page: 2421 year: 2006 end-page: 2424 ident: b8 article-title: An audio-visual corpus for speech perception and automatic speech recognition publication-title: J. Acoust. Soc. Am. – volume: PP start-page: 1 year: 2018 ident: b26 article-title: Listen and look : Audio-visual matching assisted speech source separation publication-title: IEEE Signal Process. Lett. – year: 2018 ident: b2 article-title: Deep learning using rectified linear units (ReLU) – reference: He, Kaiming, Zhang, X., Ren, Shaoqing, Sun, Jian, 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In: 2015 IEEE International Conference on Computer Vision. ICCV, pp. 1026–1034. – start-page: 234 year: 2015 end-page: 241 ident: b40 article-title: U-net: Convolutional networks for biomedical image segmentation publication-title: International Conference on Medical Image Computing and Computer-Assisted Intervention – volume: 25 start-page: 975 year: 1953 end-page: 979 ident: b6 article-title: Some experiments on the recognition of speech, with one and with two ears publication-title: J. Acoust. Soc. Am. – volume: 106 year: 2020 ident: b36 article-title: U2-Net: Going deeper with nested U-structure for salient object detection publication-title: Pattern Recognit. – year: 2020 ident: b9 article-title: Librimix: An open-source dataset for generalizable speech separation – year: 2014 ident: b20 article-title: Adam: A method for stochastic optimization – reference: Tzinis, Efthymios, Wang, Zhepei, Smaragdis, Paris, 2020. Sudo RM -RF: Efficient Networks for Universal Audio Source Separation. In: 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing. MLSP, pp. 1–6. – start-page: 246 year: 2017 end-page: 250 ident: b4 article-title: Deep attractor network for single-microphone speaker separation publication-title: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing – year: 2016 ident: b12 article-title: Deep clustering: Discriminative embeddings for segmentation and separation – reference: Pariente, Manuel, Cornell, Samuele, Cosentino, Joris, Sivasankaran, Sunit, Tzinis, Efthymios, Heitkaemper, Jens, Olvera, Michel, Stöter, Fabian-Robert, Hu, Mathieu, Martín-Doñas, Juan M., Ditter, David, Frank, Ariel, Deleforge, Antoine, Vincent, Emmanuel, 2020a. Asteroid: the PyTorch-based audio source separation toolkit for researchers. In: Proc. Interspeech. – year: 2023 ident: b38 article-title: Speech enhancement and dereverberation with diffusion-based generative models publication-title: IEEE/ACM Trans. Audio Speech Lang. Process. – reference: Chollet, François, 2017. Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1251–1258. – ident: 10.1016/j.specom.2024.103150_b29 doi: 10.1109/ICASSP40776.2020.9054266 – year: 2018 ident: 10.1016/j.specom.2024.103150_b42 – year: 2018 ident: 10.1016/j.specom.2024.103150_b3 – volume: 21 start-page: 2140 issue: 10 year: 2013 ident: 10.1016/j.specom.2024.103150_b32 article-title: Supervised and unsupervised speech enhancement using nonnegative matrix factorization publication-title: IEEE Trans. Audio Speech Lang. Process. doi: 10.1109/TASL.2013.2270369 – volume: 25 start-page: 1901 issue: 10 year: 2017 ident: 10.1016/j.specom.2024.103150_b21 article-title: Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks publication-title: IEEE/ACM Trans. Audio Speech Lang. Process. doi: 10.1109/TASLP.2017.2726762 – start-page: 234 year: 2015 ident: 10.1016/j.specom.2024.103150_b40 article-title: U-net: Convolutional networks for biomedical image segmentation – start-page: 1 year: 2023 ident: 10.1016/j.specom.2024.103150_b41 article-title: Diffusion-based generative speech source separation – year: 2018 ident: 10.1016/j.specom.2024.103150_b53 – start-page: 23 year: 2008 ident: 10.1016/j.specom.2024.103150_b22 article-title: Explicit consistency constraints for STFT spectrograms and their application to phase reconstruction – ident: 10.1016/j.specom.2024.103150_b47 doi: 10.1109/MLSP49062.2020.9231900 – ident: 10.1016/j.specom.2024.103150_b39 doi: 10.1109/ICASSP.2001.941023 – start-page: 6837 year: 2022 ident: 10.1016/j.specom.2024.103150_b17 article-title: Investigating self-supervised learning for speech enhancement and separation – volume: 15 start-page: 1135 issue: 5 year: 2004 ident: 10.1016/j.specom.2024.103150_b15 article-title: Monaural speech segregation based on pitch tracking and amplitude modulation publication-title: IEEE Trans. Neural Netw. doi: 10.1109/TNN.2004.832812 – volume: 30 year: 2017 ident: 10.1016/j.specom.2024.103150_b48 article-title: Attention is all you need publication-title: Adv. Neural Inf. Process. Syst. – year: 2016 ident: 10.1016/j.specom.2024.103150_b12 – volume: 34 start-page: 22509 year: 2021 ident: 10.1016/j.specom.2024.103150_b14 article-title: Speech separation using an asynchronous fully recurrent convolutional neural network publication-title: Adv. Neural Inf. Process. Syst. (NeurIPS) – ident: 10.1016/j.specom.2024.103150_b34 doi: 10.21437/Interspeech.2020-1673 – volume: 106 year: 2020 ident: 10.1016/j.specom.2024.103150_b36 article-title: U2-Net: Going deeper with nested U-structure for salient object detection publication-title: Pattern Recognit. doi: 10.1016/j.patcog.2020.107404 – volume: 19 start-page: 199 year: 2008 ident: 10.1016/j.specom.2024.103150_b50 article-title: Computational auditory scene analysis: Principles, algorithms, and applications publication-title: IEEE Trans. Neural Netw. doi: 10.1109/TNN.2007.913988 – year: 2020 ident: 10.1016/j.specom.2024.103150_b9 – ident: 10.1016/j.specom.2024.103150_b11 doi: 10.1109/ICCV.2015.123 – year: 2018 ident: 10.1016/j.specom.2024.103150_b2 – ident: 10.1016/j.specom.2024.103150_b16 doi: 10.1109/CVPR.2017.243 – year: 2021 ident: 10.1016/j.specom.2024.103150_b37 – start-page: 47 year: 2016 ident: 10.1016/j.specom.2024.103150_b24 article-title: Temporal convolutional networks: A unified approach to action segmentation – ident: 10.1016/j.specom.2024.103150_b31 doi: 10.21437/Interspeech.2023-1753 – year: 2023 ident: 10.1016/j.specom.2024.103150_b38 article-title: Speech enhancement and dereverberation with diffusion-based generative models publication-title: IEEE/ACM Trans. Audio Speech Lang. Process. doi: 10.1109/TASLP.2023.3285241 – year: 2016 ident: 10.1016/j.specom.2024.103150_b19 – start-page: 667 year: 2019 ident: 10.1016/j.specom.2024.103150_b54 article-title: Time domain audio visual speech separation – year: 2017 ident: 10.1016/j.specom.2024.103150_b13 – ident: 10.1016/j.specom.2024.103150_b55 doi: 10.1109/ICASSP.2017.7952154 – volume: 120 start-page: 2421 issue: 5 year: 2006 ident: 10.1016/j.specom.2024.103150_b8 article-title: An audio-visual corpus for speech perception and automatic speech recognition publication-title: J. Acoust. Soc. Am. doi: 10.1121/1.2229005 – volume: PP start-page: 1 issue: 8 year: 2018 ident: 10.1016/j.specom.2024.103150_b26 article-title: Listen and look : Audio-visual matching assisted speech source separation publication-title: IEEE Signal Process. Lett. – ident: 10.1016/j.specom.2024.103150_b43 doi: 10.1109/ICASSP39728.2021.9413901 – ident: 10.1016/j.specom.2024.103150_b44 doi: 10.1109/ICASSP39728.2021.9413901 – year: 2018 ident: 10.1016/j.specom.2024.103150_b10 – ident: 10.1016/j.specom.2024.103150_b7 doi: 10.1109/CVPR.2017.195 – volume: 14 start-page: 1462 year: 2006 ident: 10.1016/j.specom.2024.103150_b49 article-title: Performance measurement in blind audio source separation publication-title: IEEE Trans. Audio Speech Lang. Process. doi: 10.1109/TSA.2005.858005 – start-page: 448 year: 2015 ident: 10.1016/j.specom.2024.103150_b18 article-title: Batch normalization: Accelerating deep network training by reducing internal covariate shift – volume: 26 start-page: 1702 issue: 10 year: 2018 ident: 10.1016/j.specom.2024.103150_b51 article-title: Supervised speech separation based on deep learning: An overview publication-title: IEEE/ACM Trans. Audio Speech Lang. Process. doi: 10.1109/TASLP.2018.2842159 – volume: 25 start-page: 975 issue: 5 year: 1953 ident: 10.1016/j.specom.2024.103150_b6 article-title: Some experiments on the recognition of speech, with one and with two ears publication-title: J. Acoust. Soc. Am. doi: 10.1121/1.1907229 – start-page: 246 year: 2017 ident: 10.1016/j.specom.2024.103150_b4 article-title: Deep attractor network for single-microphone speaker separation – ident: 10.1016/j.specom.2024.103150_b23 doi: 10.1109/CVPR.2017.113 – volume: 26 start-page: 787 issue: 4 year: 2018 ident: 10.1016/j.specom.2024.103150_b28 article-title: Speaker-independent speech separation with deep attractor network publication-title: IEEE/ACM Trans. Audio Speech Lang. Process. doi: 10.1109/TASLP.2018.2795749 – volume: 27 start-page: 1256 issue: 8 year: 2019 ident: 10.1016/j.specom.2024.103150_b30 article-title: Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation publication-title: IEEE/ACM Trans. Audio Speech Lang. Process. doi: 10.1109/TASLP.2019.2915167 – start-page: 2642 year: 2020 ident: 10.1016/j.specom.2024.103150_b5 article-title: Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation publication-title: Interspeech – volume: 27 start-page: 1697 issue: 11 year: 2019 ident: 10.1016/j.specom.2024.103150_b27 article-title: Audio–visual deep clustering for speech separation publication-title: IEEE/ACM Trans. Audio Speech Lang. Process. doi: 10.1109/TASLP.2019.2928140 – ident: 10.1016/j.specom.2024.103150_b25 doi: 10.21437/Interspeech.2019-2003 – ident: 10.1016/j.specom.2024.103150_b46 doi: 10.1145/3474085.3475587 – start-page: 5206 year: 2015 ident: 10.1016/j.specom.2024.103150_b33 article-title: Librispeech: an asr corpus based on public domain audio books – year: 2022 ident: 10.1016/j.specom.2024.103150_b45 – start-page: 6364 year: 2020 ident: 10.1016/j.specom.2024.103150_b35 article-title: Filterbank design for end-to-end speech separation – year: 2014 ident: 10.1016/j.specom.2024.103150_b20 – start-page: 686 year: 2018 ident: 10.1016/j.specom.2024.103150_b52 article-title: Alternative objective functions for deep clustering – year: 2018 ident: 10.1016/j.specom.2024.103150_b1 |
| SSID | ssj0004882 |
| Score | 2.4375722 |
| Snippet | The key to single-channel time-domain speech separation lies in encoding the mixed speech into latent feature representations by encoder and obtaining the... |
| SourceID | crossref elsevier |
| SourceType | Enrichment Source Index Database Publisher |
| StartPage | 103150 |
| SubjectTerms | Computational complexity Multi-temporal resolution Transformer Short sequence encoder–decoder framework Speech separation |
| Title | Efficient time-domain speech separation using short encoded sequence network |
| URI | https://dx.doi.org/10.1016/j.specom.2024.103150 |
| Volume | 166 |
| WOSCitedRecordID | wos001377242100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVESC databaseName: Elsevier SD Freedom Collection Journals 2021 issn: 0167-6393 databaseCode: AIEXJ dateStart: 20220201 customDbUrl: isFulltext: true dateEnd: 99991231 titleUrlDefault: https://www.sciencedirect.com omitProxy: false ssIdentifier: ssj0004882 providerName: Elsevier – providerCode: PRVESC databaseName: Elsevier SD Freedom Collection Journals 2021 issn: 0167-6393 databaseCode: AIEXJ dateStart: 19950101 customDbUrl: isFulltext: true dateEnd: 99991231 titleUrlDefault: https://www.sciencedirect.com omitProxy: false ssIdentifier: ssj0004882 providerName: Elsevier |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1JbxMxFLYg5dALomFpoSAfEJfRVJPMYvtYUCigUCE1iNxGtsejpgqTECdV21_f52UWCGKTuEwia-xE_r55fu_NWxB6qcDsKZigoemKFCaxMI6mwjQzS7KYlJIxxW2zCXJ6SqdT9snHz2vbToBUFb26Ysv_CjWMAdgmdfYv4G4WhQH4DqDDFWCH6x8BP7JFIcwrftM3PiwWX8H4D_RSKXkeaOVqfQPmG-sl0OegfwemmmUBqmcdWB1ULjq8q7qeuRVkN6OkCeeZbbzw4v4k7PqiJ0DBb7M2kMDIFVV5189HXujgxL2wz7Qwn6_pvI1WdK2h-eJGdR0Uw_QHB8V25oxzZIKABu0o_k4SuwYsW1LdORgujkzy6cKUDxgmR7Y9RdSeYk1s4ZlZ2qxsomNNwbq7aGdIUkZ7aOf4_Wj6oU2bpbaXWPNX6sxKG_63_Vs_11w62sjkAbrvzQh87ODfQ3dU1UdPxt75rPErPG7qZes-2m3Oues-OnQp2fiLmpd8peDeegAQf4jGDYFwh0DYEQi3BMKWQNgSCHsC4ZpA2BPoEfr8djR58y70PTdCCcbjOgSDPxUylpRTPhA8jhjhpkRdkkoOeymjgUhh4yRRGTzFJdjrZVRGfCBZOUhJweLHqFctKrWPMBdUsIIMC6ZEUgpCwfRQGSlVnBE4V9gBiuv9zKUvSG_6oszzOvLwInco5AaF3KFwgMJm1tIVZPnN_aSGKvdKpVMWc2DXL2c-_eeZz9Bu-yAcot56tVHP0T15uZ7p1QtPw1tNdJ1S |
| linkProvider | Elsevier |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Efficient+time-domain+speech+separation+using+short+encoded+sequence+network&rft.jtitle=Speech+communication&rft.au=Liu%2C+Debang&rft.au=Zhang%2C+Tianqi&rft.au=Christensen%2C+Mads+Gr%C3%A6sb%C3%B8ll&rft.au=Ma%2C+Baoze&rft.date=2025-01-01&rft.pub=Elsevier+B.V&rft.issn=0167-6393&rft.volume=166&rft_id=info:doi/10.1016%2Fj.specom.2024.103150&rft.externalDocID=S0167639324001213 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0167-6393&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0167-6393&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0167-6393&client=summon |