Efficient time-domain speech separation using short encoded sequence network

The key to single-channel time-domain speech separation lies in encoding the mixed speech into latent feature representations by encoder and obtaining the accurate estimation of the target speaker masks by the separation network. Despite the advanced separation network contribute to separate target...

Full description

Saved in:
Bibliographic Details
Published in:Speech communication Vol. 166; p. 103150
Main Authors: Liu, Debang, Zhang, Tianqi, Christensen, Mads Græsbøll, Ma, Baoze, Deng, Pan
Format: Journal Article
Language:English
Published: Elsevier B.V 01.01.2025
Subjects:
ISSN:0167-6393
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract The key to single-channel time-domain speech separation lies in encoding the mixed speech into latent feature representations by encoder and obtaining the accurate estimation of the target speaker masks by the separation network. Despite the advanced separation network contribute to separate target speech, but due to the limitation of the time-domain encoder–decoder framework, these separation models commonly improve the separation performance by setting a small convolution kernel size of encoder to increase the length of the coded sequence, which will result in increased computational complexity and training costs for the model. Therefore, in this paper, we propose an efficient time-domain speech separation model using short-sequence encoder–decoder framework (ESEDNet). In this model, we construct a novel encoder–decoder framework to accommodate short encoded sequences, where the encoder consists of multiple convolution and downsampling operations to reduce length of high-resolution sequence, while the decoder utilizes the encoded features to reconstruct the fine-detailed speech sequence of the target speaker. Since the output sequence of the encoder is shorter, when combined with our proposed multi-temporal resolution Transformer separation network (MTRFormer), ESEDNet can efficiently obtains separation masks for the short encoded feature sequence. Experiments show that compared with previous state-of-the-art (SOTA) methods, ESEDNet is more efficient in terms of computational complexity, training speed and GPU memory usage, while maintaining competitive separation performance. •We introduce an encoder-decoder framework that ensures a short encoded sequence while achieving excellent separation performance.•We design the separation network, which combines with an encoder-decoder network to achieve effective target source separation.•The ESEDNet has smaller model size, lower training cost, and is easy to extend to other networks.
AbstractList The key to single-channel time-domain speech separation lies in encoding the mixed speech into latent feature representations by encoder and obtaining the accurate estimation of the target speaker masks by the separation network. Despite the advanced separation network contribute to separate target speech, but due to the limitation of the time-domain encoder–decoder framework, these separation models commonly improve the separation performance by setting a small convolution kernel size of encoder to increase the length of the coded sequence, which will result in increased computational complexity and training costs for the model. Therefore, in this paper, we propose an efficient time-domain speech separation model using short-sequence encoder–decoder framework (ESEDNet). In this model, we construct a novel encoder–decoder framework to accommodate short encoded sequences, where the encoder consists of multiple convolution and downsampling operations to reduce length of high-resolution sequence, while the decoder utilizes the encoded features to reconstruct the fine-detailed speech sequence of the target speaker. Since the output sequence of the encoder is shorter, when combined with our proposed multi-temporal resolution Transformer separation network (MTRFormer), ESEDNet can efficiently obtains separation masks for the short encoded feature sequence. Experiments show that compared with previous state-of-the-art (SOTA) methods, ESEDNet is more efficient in terms of computational complexity, training speed and GPU memory usage, while maintaining competitive separation performance. •We introduce an encoder-decoder framework that ensures a short encoded sequence while achieving excellent separation performance.•We design the separation network, which combines with an encoder-decoder network to achieve effective target source separation.•The ESEDNet has smaller model size, lower training cost, and is easy to extend to other networks.
ArticleNumber 103150
Author Liu, Debang
Zhang, Tianqi
Christensen, Mads Græsbøll
Ma, Baoze
Deng, Pan
Author_xml – sequence: 1
  givenname: Debang
  orcidid: 0000-0002-7411-9683
  surname: Liu
  fullname: Liu, Debang
  email: debangliu@163.com
  organization: School of Communications and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
– sequence: 2
  givenname: Tianqi
  surname: Zhang
  fullname: Zhang, Tianqi
  email: zhangtq@cqupt.edu.cn
  organization: School of Communications and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
– sequence: 3
  givenname: Mads Græsbøll
  surname: Christensen
  fullname: Christensen, Mads Græsbøll
  email: mgc@create.aau.dk
  organization: Audio Analysis Lab, CREATE, Aalborg University, 9000 Aalborg, Denmark
– sequence: 4
  givenname: Baoze
  surname: Ma
  fullname: Ma, Baoze
  email: mabz@cqupt.edu.cn
  organization: School of Communications and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
– sequence: 5
  givenname: Pan
  surname: Deng
  fullname: Deng, Pan
  email: d200101004@stu.cqupt.edu.cn
  organization: School of Communications and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
BookMark eNqFkM1OwzAQhH0oEm3hDTjkBVLsOD8OByRUlR-pEhc4W85mTV0au9guiLfHJZw4wGlXO5rRzjcjE-ssEnLB6IJRVl9uF2GP4IZFQYsynTir6IRMk9TkNW_5KZmFsKWUlkIUU7JeaW3AoI1ZNAPmvRuUsVnKQNhkAffKq2iczQ7B2JcsbJyPGVpwPfZJfjukHTOL8cP51zNyotUu4PnPnJPn29XT8j5fP949LG_WOXBax5yJsuqAg1BCsU5x2jaqFqwoK1AF40BZV6X3oMG6RaVFITTVVDFoNauavuVzUo654F0IHrXcezMo_ykZlUcKcitHCvJIQY4Uku3qlw1M_G4XvTK7_8zXoxlTsXeDXoYjNsDeeIQoe2f-DvgChY6ABA
CitedBy_id crossref_primary_10_1007_s00034_025_03204_8
Cites_doi 10.1109/ICASSP40776.2020.9054266
10.1109/TASL.2013.2270369
10.1109/TASLP.2017.2726762
10.1109/MLSP49062.2020.9231900
10.1109/ICASSP.2001.941023
10.1109/TNN.2004.832812
10.21437/Interspeech.2020-1673
10.1016/j.patcog.2020.107404
10.1109/TNN.2007.913988
10.1109/ICCV.2015.123
10.1109/CVPR.2017.243
10.21437/Interspeech.2023-1753
10.1109/TASLP.2023.3285241
10.1109/ICASSP.2017.7952154
10.1121/1.2229005
10.1109/ICASSP39728.2021.9413901
10.1109/CVPR.2017.195
10.1109/TSA.2005.858005
10.1109/TASLP.2018.2842159
10.1121/1.1907229
10.1109/CVPR.2017.113
10.1109/TASLP.2018.2795749
10.1109/TASLP.2019.2915167
10.1109/TASLP.2019.2928140
10.21437/Interspeech.2019-2003
10.1145/3474085.3475587
ContentType Journal Article
Copyright 2024
Copyright_xml – notice: 2024
DBID AAYXX
CITATION
DOI 10.1016/j.specom.2024.103150
DatabaseName CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Languages & Literatures
Social Welfare & Social Work
Psychology
ExternalDocumentID 10_1016_j_specom_2024_103150
S0167639324001213
GrantInformation_xml – fundername: National Natural Science Foundation of China
  grantid: 61671095; 61702065; 61701067; 61771085; 62201113
  funderid: http://dx.doi.org/10.13039/501100001809
– fundername: Natural Science Foundation of Chongqing, China
  grantid: cstc2021jcyj-msxmX0836
  funderid: http://dx.doi.org/10.13039/501100005230
GroupedDBID --K
--M
-~X
.DC
.~1
07C
0R~
123
1B1
1~.
1~5
4.4
457
4G.
53G
5VS
7-5
71M
8P~
9JN
9JO
AACTN
AADFP
AAEDT
AAEDW
AAFJI
AAGJA
AAGUQ
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAQXK
AAXKI
AAXUO
AAYFN
ABBOA
ABDPE
ABFNM
ABIVO
ABJNI
ABMAC
ABMMH
ABOYX
ABWVN
ABXDB
ACDAQ
ACGFS
ACNNM
ACRLP
ACRPL
ACXNI
ACZNC
ADBBV
ADEZE
ADIYS
ADJOM
ADMUD
ADNMO
ADTZH
AEBSH
AECPX
AEKER
AENEX
AFJKZ
AFKWA
AFTJW
AGHFR
AGUBO
AGYEJ
AHHHB
AHJVU
AHZHX
AIALX
AIEXJ
AIKHN
AITUG
AJOXV
AKRWK
ALMA_UNASSIGNED_HOLDINGS
AMFUW
AMRAJ
AOMHK
AOUOD
ASPBG
AVARZ
AVWKF
AXJTR
AZFZN
BJAXD
BKOJK
BLXMC
CS3
DU5
EBS
EFJIC
EJD
EO8
EO9
EP2
EP3
F0J
F5P
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-Q
GBLVA
GBOLZ
HLZ
HVGLF
HZ~
IHE
J1W
JJJVA
KOM
LG9
M41
MO0
N9A
O-L
O9-
OAUVE
OKEIE
OZT
P-8
P-9
P2P
PC.
PQQKQ
PRBVW
Q38
R2-
RIG
ROL
RPZ
SBC
SDF
SDG
SDP
SES
SEW
SPC
SPCBC
SSB
SSO
SST
SSV
SSY
SSZ
T5K
WUQ
XJE
~G-
9DU
AATTM
AAYWO
AAYXX
ACLOT
ACVFH
ADCNI
AEIPS
AEUPX
AFPUW
AGQPQ
AIGII
AIIUN
AKBMS
AKYEP
ANKPU
APXCP
CITATION
EFKBS
EFLBG
~HD
ID FETCH-LOGICAL-c306t-1845bc3c8a8a1ba3097a681245ca213c01b5048c7e69eaf828f0f0a1c9f157d93
ISICitedReferencesCount 3
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001377242100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0167-6393
IngestDate Tue Nov 18 22:45:00 EST 2025
Sat Nov 29 06:17:52 EST 2025
Sat Jan 04 15:43:01 EST 2025
IsPeerReviewed true
IsScholarly true
Keywords Speech separation
Computational complexity
Multi-temporal resolution Transformer
Short sequence encoder–decoder framework
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c306t-1845bc3c8a8a1ba3097a681245ca213c01b5048c7e69eaf828f0f0a1c9f157d93
ORCID 0000-0002-7411-9683
ParticipantIDs crossref_primary_10_1016_j_specom_2024_103150
crossref_citationtrail_10_1016_j_specom_2024_103150
elsevier_sciencedirect_doi_10_1016_j_specom_2024_103150
PublicationCentury 2000
PublicationDate January 2025
2025-01-00
PublicationDateYYYYMMDD 2025-01-01
PublicationDate_xml – month: 01
  year: 2025
  text: January 2025
PublicationDecade 2020
PublicationTitle Speech communication
PublicationYear 2025
Publisher Elsevier B.V
Publisher_xml – name: Elsevier B.V
References Chen, Luo, Mesgarani (b4) 2017
Subakan, Cem, Ravanelli, Mirco, Cornell, Samuele, Bronzi, Mirko, Zhong, Jianyuan, 2020. Attention Is All You Need In Speech Separation. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, pp. 21–25.
Ronneberger, Fischer, Brox (b40) 2015
Kingma, Ba (b20) 2014
Kolbæk, Yu, Tan, Jensen (b21) 2017; 25
Stoller, Ewert, Dixon (b42) 2018
Vincent, Gribonval, Févotte (b49) 2006; 14
He, Kaiming, Zhang, X., Ren, Shaoqing, Sun, Jian, 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In: 2015 IEEE International Conference on Computer Vision. ICCV, pp. 1026–1034.
Luo, Chen, Mesgarani (b28) 2018; 26
Tzinis, Efthymios, Wang, Zhepei, Smaragdis, Paris, 2020. Sudo RM -RF: Efficient Networks for Universal Audio Source Separation. In: 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing. MLSP, pp. 1–6.
Subakan, Cem, Ravanelli, Mirco, Cornell, Samuele, Bronzi, Mirko, Zhong, Jianyuan, 2021. Attention Is All You Need In Speech Separation. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, pp. 21–25.
Panayotov, Chen, Povey, Khudanpur (b33) 2015
Lu, Duan, Zhang (b27) 2019; 27
Lea, Vidal, Reiter, Hager (b24) 2016
Richter, Welker, Lemercier, Lay, Gerkmann (b38) 2023
Agarap (b2) 2018
Wang, Brown (b50) 2008; 19
Wang, Chen (b51) 2018; 26
Hu, Wang (b15) 2004; 15
Yu, Dong, Kolbæk, Morten, Tan, Z., Jensen, Jesper Højvang, 2017. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, pp. 241–245.
Ephrat, Mosseri, Lang, Dekel, Wilson, Hassidim, Freeman, Rubinstein (b10) 2018
Luo, Mesgarani (b30) 2019; 27
Lu, Duan, Zhang (b26) 2018; PP
Subakan, Ravanelli, Cornell, Lepoutre, Grondin (b45) 2022
Bai, Kolter, Koltun (b3) 2018
Huang, Gao, Liu, Zhuang, Van Der Maaten, Laurens, Weinberger, Kilian Q, 2017. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4700–4708.
Le Roux, Ono, Sagayama (b22) 2008
Huang, Watanabe, Yang, García, Khudanpur (b17) 2022
Lea, Colin, Flynn, Michael D, Vidal, Rene, Reiter, Austin, Hager, Gregory D, 2017. Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 156–165.
Luo, Yi, Chen, Zhuo, Yoshioka, Takuya, 2020. Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, pp. 46–50.
Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, Adam (b13) 2017
Tao, Ruijie, Pan, Zexu, Das, Rohan Kumar, Qian, Xinyuan, Shou, Mike Zheng, Li, Haizhou, 2021. Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection. In: Proceedings of the 29th ACM International Conference on Multimedia. pp. 3927–3935.
Ioffe, Szegedy (b18) 2015
Chollet, François, 2017. Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1251–1258.
Hershey, Chen, Le Roux, Watanabe (b12) 2016
Scheibler, Ji, Chung, Byun, Choe, Choi (b41) 2023
Martel, Héctor, Richter, Julius, Li, Kai, Hu, Xiaolin, Gerkmann, Timo, 2023. Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model. In: Proc. INTERSPEECH 2023. pp. 1673–1677.
Qin, Zhang, Huang, Dehghan, Zaiane, Jagersand (b36) 2020; 106
Hu, Li, Zhang, Luo, Lemercier, Gerkmann (b14) 2021; 34
Pariente, Manuel, Cornell, Samuele, Cosentino, Joris, Sivasankaran, Sunit, Tzinis, Efthymios, Heitkaemper, Jens, Olvera, Michel, Stöter, Fabian-Robert, Hu, Mathieu, Martín-Doñas, Juan M., Ditter, David, Frank, Ariel, Deleforge, Antoine, Vincent, Emmanuel, 2020a. Asteroid: the PyTorch-based audio source separation toolkit for researchers. In: Proc. Interspeech.
Chen, Mao, Liu (b5) 2020
Ravanelli, Parcollet, Plantinga, Rouhe, Cornell, Lugosch, Subakan, Dawalatabad, Heba, Zhong, Chou, Yeh, Fu, Liao, Rastorgueva, Grondin, Aris, Na, Gao, Mori, Bengio (b37) 2021
Cherry (b6) 1953; 25
Pariente, Cornell, Deleforge, Vincent (b35) 2020
Lo, Chen-Chou, Fu, Szu-Wei, Huang, Wen-Chin, Wang, Xin, Yamagishi, Junichi, Tsao, Yu, Wang, Hsin-Min, 2019. MOSNet: Deep Learning based Objective Assessment for Voice Conversion. In: Proc. Interspeech 2019.
Wu, Xu, Zhang, Chen, Yu, Xie, Yu (b54) 2019
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (b48) 2017; 30
Wang, Le Roux, Hershey (b52) 2018
Afouras, Chung, Zisserman (b1) 2018
Cosentino, Pariente, Cornell, Deleforge, Vincent (b9) 2020
Isik, Roux, Chen, Watanabe, Hershey (b19) 2016
Mohammadiha, Smaragdis, Leijon (b32) 2013; 21
Cooke, Barker, Cunningham, Shao (b8) 2006; 120
Rix, Antony W., Beerends, John G., Hollier, Mike, Hekstra, Andries P., 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221). Vol. 2, pp. 749–752.
Wang, Roux, Wang, Hershey (b53) 2018
Hu (10.1016/j.specom.2024.103150_b15) 2004; 15
Wang (10.1016/j.specom.2024.103150_b51) 2018; 26
Scheibler (10.1016/j.specom.2024.103150_b41) 2023
10.1016/j.specom.2024.103150_b43
Ioffe (10.1016/j.specom.2024.103150_b18) 2015
Cooke (10.1016/j.specom.2024.103150_b8) 2006; 120
10.1016/j.specom.2024.103150_b47
10.1016/j.specom.2024.103150_b46
10.1016/j.specom.2024.103150_b44
Howard (10.1016/j.specom.2024.103150_b13) 2017
Pariente (10.1016/j.specom.2024.103150_b35) 2020
Vaswani (10.1016/j.specom.2024.103150_b48) 2017; 30
Kolbæk (10.1016/j.specom.2024.103150_b21) 2017; 25
Subakan (10.1016/j.specom.2024.103150_b45) 2022
Wang (10.1016/j.specom.2024.103150_b53) 2018
10.1016/j.specom.2024.103150_b39
Mohammadiha (10.1016/j.specom.2024.103150_b32) 2013; 21
Cosentino (10.1016/j.specom.2024.103150_b9) 2020
Hu (10.1016/j.specom.2024.103150_b14) 2021; 34
Ephrat (10.1016/j.specom.2024.103150_b10) 2018
Isik (10.1016/j.specom.2024.103150_b19) 2016
10.1016/j.specom.2024.103150_b31
Afouras (10.1016/j.specom.2024.103150_b1) 2018
10.1016/j.specom.2024.103150_b34
Hershey (10.1016/j.specom.2024.103150_b12) 2016
10.1016/j.specom.2024.103150_b7
10.1016/j.specom.2024.103150_b29
Lu (10.1016/j.specom.2024.103150_b26) 2018; PP
Agarap (10.1016/j.specom.2024.103150_b2) 2018
Cherry (10.1016/j.specom.2024.103150_b6) 1953; 25
Wang (10.1016/j.specom.2024.103150_b52) 2018
10.1016/j.specom.2024.103150_b25
Vincent (10.1016/j.specom.2024.103150_b49) 2006; 14
10.1016/j.specom.2024.103150_b23
Chen (10.1016/j.specom.2024.103150_b4) 2017
Richter (10.1016/j.specom.2024.103150_b38) 2023
Wu (10.1016/j.specom.2024.103150_b54) 2019
Huang (10.1016/j.specom.2024.103150_b17) 2022
10.1016/j.specom.2024.103150_b16
Kingma (10.1016/j.specom.2024.103150_b20) 2014
Stoller (10.1016/j.specom.2024.103150_b42) 2018
Panayotov (10.1016/j.specom.2024.103150_b33) 2015
Ravanelli (10.1016/j.specom.2024.103150_b37) 2021
Wang (10.1016/j.specom.2024.103150_b50) 2008; 19
Bai (10.1016/j.specom.2024.103150_b3) 2018
Chen (10.1016/j.specom.2024.103150_b5) 2020
Lu (10.1016/j.specom.2024.103150_b27) 2019; 27
10.1016/j.specom.2024.103150_b11
10.1016/j.specom.2024.103150_b55
Luo (10.1016/j.specom.2024.103150_b30) 2019; 27
Le Roux (10.1016/j.specom.2024.103150_b22) 2008
Lea (10.1016/j.specom.2024.103150_b24) 2016
Luo (10.1016/j.specom.2024.103150_b28) 2018; 26
Qin (10.1016/j.specom.2024.103150_b36) 2020; 106
Ronneberger (10.1016/j.specom.2024.103150_b40) 2015
References_xml – year: 2018
  ident: b3
  article-title: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling
– start-page: 47
  year: 2016
  end-page: 54
  ident: b24
  article-title: Temporal convolutional networks: A unified approach to action segmentation
  publication-title: European Conference on Computer Vision
– year: 2017
  ident: b13
  article-title: Mobilenets: Efficient convolutional neural networks for mobile vision applications
– reference: Lea, Colin, Flynn, Michael D, Vidal, Rene, Reiter, Austin, Hager, Gregory D, 2017. Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 156–165.
– volume: 26
  start-page: 787
  year: 2018
  end-page: 796
  ident: b28
  article-title: Speaker-independent speech separation with deep attractor network
  publication-title: IEEE/ACM Trans. Audio Speech Lang. Process.
– start-page: 5206
  year: 2015
  end-page: 5210
  ident: b33
  article-title: Librispeech: an asr corpus based on public domain audio books
  publication-title: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing
– reference: Huang, Gao, Liu, Zhuang, Van Der Maaten, Laurens, Weinberger, Kilian Q, 2017. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4700–4708.
– volume: 30
  year: 2017
  ident: b48
  article-title: Attention is all you need
  publication-title: Adv. Neural Inf. Process. Syst.
– reference: Subakan, Cem, Ravanelli, Mirco, Cornell, Samuele, Bronzi, Mirko, Zhong, Jianyuan, 2020. Attention Is All You Need In Speech Separation. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, pp. 21–25.
– volume: 25
  start-page: 1901
  year: 2017
  end-page: 1913
  ident: b21
  article-title: Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks
  publication-title: IEEE/ACM Trans. Audio Speech Lang. Process.
– volume: 14
  start-page: 1462
  year: 2006
  end-page: 1469
  ident: b49
  article-title: Performance measurement in blind audio source separation
  publication-title: IEEE Trans. Audio Speech Lang. Process.
– volume: 26
  start-page: 1702
  year: 2018
  end-page: 1726
  ident: b51
  article-title: Supervised speech separation based on deep learning: An overview
  publication-title: IEEE/ACM Trans. Audio Speech Lang. Process.
– volume: 15
  start-page: 1135
  year: 2004
  end-page: 1150
  ident: b15
  article-title: Monaural speech segregation based on pitch tracking and amplitude modulation
  publication-title: IEEE Trans. Neural Netw.
– start-page: 448
  year: 2015
  end-page: 456
  ident: b18
  article-title: Batch normalization: Accelerating deep network training by reducing internal covariate shift
  publication-title: International Conference on Machine Learning
– volume: 19
  start-page: 199
  year: 2008
  ident: b50
  article-title: Computational auditory scene analysis: Principles, algorithms, and applications
  publication-title: IEEE Trans. Neural Netw.
– reference: Yu, Dong, Kolbæk, Morten, Tan, Z., Jensen, Jesper Højvang, 2017. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, pp. 241–245.
– start-page: 6837
  year: 2022
  end-page: 6841
  ident: b17
  article-title: Investigating self-supervised learning for speech enhancement and separation
  publication-title: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing
– reference: Tao, Ruijie, Pan, Zexu, Das, Rohan Kumar, Qian, Xinyuan, Shou, Mike Zheng, Li, Haizhou, 2021. Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection. In: Proceedings of the 29th ACM International Conference on Multimedia. pp. 3927–3935.
– reference: Martel, Héctor, Richter, Julius, Li, Kai, Hu, Xiaolin, Gerkmann, Timo, 2023. Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model. In: Proc. INTERSPEECH 2023. pp. 1673–1677.
– year: 2021
  ident: b37
  article-title: SpeechBrain: A general-purpose speech toolkit
– year: 2022
  ident: b45
  article-title: Resource-efficient separation transformer
– year: 2016
  ident: b19
  article-title: Single-channel multi-speaker separation using deep clustering
– start-page: 667
  year: 2019
  end-page: 673
  ident: b54
  article-title: Time domain audio visual speech separation
  publication-title: 2019 IEEE Automatic Speech Recognition and Understanding Workshop
– volume: 27
  start-page: 1256
  year: 2019
  end-page: 1266
  ident: b30
  article-title: Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation
  publication-title: IEEE/ACM Trans. Audio Speech Lang. Process.
– year: 2018
  ident: b10
  article-title: Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation
– reference: Rix, Antony W., Beerends, John G., Hollier, Mike, Hekstra, Andries P., 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221). Vol. 2, pp. 749–752.
– volume: 27
  start-page: 1697
  year: 2019
  end-page: 1712
  ident: b27
  article-title: Audio–visual deep clustering for speech separation
  publication-title: IEEE/ACM Trans. Audio Speech Lang. Process.
– year: 2018
  ident: b42
  article-title: Wave-u-net: A multi-scale neural network for end-to-end audio source separation
– year: 2018
  ident: b53
  article-title: End-to-end speech separation with unfolded iterative phase reconstruction
– volume: 34
  start-page: 22509
  year: 2021
  end-page: 22522
  ident: b14
  article-title: Speech separation using an asynchronous fully recurrent convolutional neural network
  publication-title: Adv. Neural Inf. Process. Syst. (NeurIPS)
– year: 2018
  ident: b1
  article-title: LRS3-TED: a large-scale dataset for visual speech recognition
– volume: 21
  start-page: 2140
  year: 2013
  end-page: 2151
  ident: b32
  article-title: Supervised and unsupervised speech enhancement using nonnegative matrix factorization
  publication-title: IEEE Trans. Audio Speech Lang. Process.
– start-page: 686
  year: 2018
  end-page: 690
  ident: b52
  article-title: Alternative objective functions for deep clustering
  publication-title: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing
– start-page: 23
  year: 2008
  end-page: 28
  ident: b22
  article-title: Explicit consistency constraints for STFT spectrograms and their application to phase reconstruction
  publication-title: Interspeech
– start-page: 1
  year: 2023
  end-page: 5
  ident: b41
  article-title: Diffusion-based generative speech source separation
  publication-title: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing
– reference: Luo, Yi, Chen, Zhuo, Yoshioka, Takuya, 2020. Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, pp. 46–50.
– reference: Lo, Chen-Chou, Fu, Szu-Wei, Huang, Wen-Chin, Wang, Xin, Yamagishi, Junichi, Tsao, Yu, Wang, Hsin-Min, 2019. MOSNet: Deep Learning based Objective Assessment for Voice Conversion. In: Proc. Interspeech 2019.
– start-page: 6364
  year: 2020
  end-page: 6368
  ident: b35
  article-title: Filterbank design for end-to-end speech separation
  publication-title: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing
– reference: Subakan, Cem, Ravanelli, Mirco, Cornell, Samuele, Bronzi, Mirko, Zhong, Jianyuan, 2021. Attention Is All You Need In Speech Separation. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, pp. 21–25.
– start-page: 2642
  year: 2020
  end-page: 2646
  ident: b5
  article-title: Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation
  publication-title: Interspeech
– volume: 120
  start-page: 2421
  year: 2006
  end-page: 2424
  ident: b8
  article-title: An audio-visual corpus for speech perception and automatic speech recognition
  publication-title: J. Acoust. Soc. Am.
– volume: PP
  start-page: 1
  year: 2018
  ident: b26
  article-title: Listen and look : Audio-visual matching assisted speech source separation
  publication-title: IEEE Signal Process. Lett.
– year: 2018
  ident: b2
  article-title: Deep learning using rectified linear units (ReLU)
– reference: He, Kaiming, Zhang, X., Ren, Shaoqing, Sun, Jian, 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In: 2015 IEEE International Conference on Computer Vision. ICCV, pp. 1026–1034.
– start-page: 234
  year: 2015
  end-page: 241
  ident: b40
  article-title: U-net: Convolutional networks for biomedical image segmentation
  publication-title: International Conference on Medical Image Computing and Computer-Assisted Intervention
– volume: 25
  start-page: 975
  year: 1953
  end-page: 979
  ident: b6
  article-title: Some experiments on the recognition of speech, with one and with two ears
  publication-title: J. Acoust. Soc. Am.
– volume: 106
  year: 2020
  ident: b36
  article-title: U2-Net: Going deeper with nested U-structure for salient object detection
  publication-title: Pattern Recognit.
– year: 2020
  ident: b9
  article-title: Librimix: An open-source dataset for generalizable speech separation
– year: 2014
  ident: b20
  article-title: Adam: A method for stochastic optimization
– reference: Tzinis, Efthymios, Wang, Zhepei, Smaragdis, Paris, 2020. Sudo RM -RF: Efficient Networks for Universal Audio Source Separation. In: 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing. MLSP, pp. 1–6.
– start-page: 246
  year: 2017
  end-page: 250
  ident: b4
  article-title: Deep attractor network for single-microphone speaker separation
  publication-title: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing
– year: 2016
  ident: b12
  article-title: Deep clustering: Discriminative embeddings for segmentation and separation
– reference: Pariente, Manuel, Cornell, Samuele, Cosentino, Joris, Sivasankaran, Sunit, Tzinis, Efthymios, Heitkaemper, Jens, Olvera, Michel, Stöter, Fabian-Robert, Hu, Mathieu, Martín-Doñas, Juan M., Ditter, David, Frank, Ariel, Deleforge, Antoine, Vincent, Emmanuel, 2020a. Asteroid: the PyTorch-based audio source separation toolkit for researchers. In: Proc. Interspeech.
– year: 2023
  ident: b38
  article-title: Speech enhancement and dereverberation with diffusion-based generative models
  publication-title: IEEE/ACM Trans. Audio Speech Lang. Process.
– reference: Chollet, François, 2017. Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1251–1258.
– ident: 10.1016/j.specom.2024.103150_b29
  doi: 10.1109/ICASSP40776.2020.9054266
– year: 2018
  ident: 10.1016/j.specom.2024.103150_b42
– year: 2018
  ident: 10.1016/j.specom.2024.103150_b3
– volume: 21
  start-page: 2140
  issue: 10
  year: 2013
  ident: 10.1016/j.specom.2024.103150_b32
  article-title: Supervised and unsupervised speech enhancement using nonnegative matrix factorization
  publication-title: IEEE Trans. Audio Speech Lang. Process.
  doi: 10.1109/TASL.2013.2270369
– volume: 25
  start-page: 1901
  issue: 10
  year: 2017
  ident: 10.1016/j.specom.2024.103150_b21
  article-title: Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks
  publication-title: IEEE/ACM Trans. Audio Speech Lang. Process.
  doi: 10.1109/TASLP.2017.2726762
– start-page: 234
  year: 2015
  ident: 10.1016/j.specom.2024.103150_b40
  article-title: U-net: Convolutional networks for biomedical image segmentation
– start-page: 1
  year: 2023
  ident: 10.1016/j.specom.2024.103150_b41
  article-title: Diffusion-based generative speech source separation
– year: 2018
  ident: 10.1016/j.specom.2024.103150_b53
– start-page: 23
  year: 2008
  ident: 10.1016/j.specom.2024.103150_b22
  article-title: Explicit consistency constraints for STFT spectrograms and their application to phase reconstruction
– ident: 10.1016/j.specom.2024.103150_b47
  doi: 10.1109/MLSP49062.2020.9231900
– ident: 10.1016/j.specom.2024.103150_b39
  doi: 10.1109/ICASSP.2001.941023
– start-page: 6837
  year: 2022
  ident: 10.1016/j.specom.2024.103150_b17
  article-title: Investigating self-supervised learning for speech enhancement and separation
– volume: 15
  start-page: 1135
  issue: 5
  year: 2004
  ident: 10.1016/j.specom.2024.103150_b15
  article-title: Monaural speech segregation based on pitch tracking and amplitude modulation
  publication-title: IEEE Trans. Neural Netw.
  doi: 10.1109/TNN.2004.832812
– volume: 30
  year: 2017
  ident: 10.1016/j.specom.2024.103150_b48
  article-title: Attention is all you need
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2016
  ident: 10.1016/j.specom.2024.103150_b12
– volume: 34
  start-page: 22509
  year: 2021
  ident: 10.1016/j.specom.2024.103150_b14
  article-title: Speech separation using an asynchronous fully recurrent convolutional neural network
  publication-title: Adv. Neural Inf. Process. Syst. (NeurIPS)
– ident: 10.1016/j.specom.2024.103150_b34
  doi: 10.21437/Interspeech.2020-1673
– volume: 106
  year: 2020
  ident: 10.1016/j.specom.2024.103150_b36
  article-title: U2-Net: Going deeper with nested U-structure for salient object detection
  publication-title: Pattern Recognit.
  doi: 10.1016/j.patcog.2020.107404
– volume: 19
  start-page: 199
  year: 2008
  ident: 10.1016/j.specom.2024.103150_b50
  article-title: Computational auditory scene analysis: Principles, algorithms, and applications
  publication-title: IEEE Trans. Neural Netw.
  doi: 10.1109/TNN.2007.913988
– year: 2020
  ident: 10.1016/j.specom.2024.103150_b9
– ident: 10.1016/j.specom.2024.103150_b11
  doi: 10.1109/ICCV.2015.123
– year: 2018
  ident: 10.1016/j.specom.2024.103150_b2
– ident: 10.1016/j.specom.2024.103150_b16
  doi: 10.1109/CVPR.2017.243
– year: 2021
  ident: 10.1016/j.specom.2024.103150_b37
– start-page: 47
  year: 2016
  ident: 10.1016/j.specom.2024.103150_b24
  article-title: Temporal convolutional networks: A unified approach to action segmentation
– ident: 10.1016/j.specom.2024.103150_b31
  doi: 10.21437/Interspeech.2023-1753
– year: 2023
  ident: 10.1016/j.specom.2024.103150_b38
  article-title: Speech enhancement and dereverberation with diffusion-based generative models
  publication-title: IEEE/ACM Trans. Audio Speech Lang. Process.
  doi: 10.1109/TASLP.2023.3285241
– year: 2016
  ident: 10.1016/j.specom.2024.103150_b19
– start-page: 667
  year: 2019
  ident: 10.1016/j.specom.2024.103150_b54
  article-title: Time domain audio visual speech separation
– year: 2017
  ident: 10.1016/j.specom.2024.103150_b13
– ident: 10.1016/j.specom.2024.103150_b55
  doi: 10.1109/ICASSP.2017.7952154
– volume: 120
  start-page: 2421
  issue: 5
  year: 2006
  ident: 10.1016/j.specom.2024.103150_b8
  article-title: An audio-visual corpus for speech perception and automatic speech recognition
  publication-title: J. Acoust. Soc. Am.
  doi: 10.1121/1.2229005
– volume: PP
  start-page: 1
  issue: 8
  year: 2018
  ident: 10.1016/j.specom.2024.103150_b26
  article-title: Listen and look : Audio-visual matching assisted speech source separation
  publication-title: IEEE Signal Process. Lett.
– ident: 10.1016/j.specom.2024.103150_b43
  doi: 10.1109/ICASSP39728.2021.9413901
– ident: 10.1016/j.specom.2024.103150_b44
  doi: 10.1109/ICASSP39728.2021.9413901
– year: 2018
  ident: 10.1016/j.specom.2024.103150_b10
– ident: 10.1016/j.specom.2024.103150_b7
  doi: 10.1109/CVPR.2017.195
– volume: 14
  start-page: 1462
  year: 2006
  ident: 10.1016/j.specom.2024.103150_b49
  article-title: Performance measurement in blind audio source separation
  publication-title: IEEE Trans. Audio Speech Lang. Process.
  doi: 10.1109/TSA.2005.858005
– start-page: 448
  year: 2015
  ident: 10.1016/j.specom.2024.103150_b18
  article-title: Batch normalization: Accelerating deep network training by reducing internal covariate shift
– volume: 26
  start-page: 1702
  issue: 10
  year: 2018
  ident: 10.1016/j.specom.2024.103150_b51
  article-title: Supervised speech separation based on deep learning: An overview
  publication-title: IEEE/ACM Trans. Audio Speech Lang. Process.
  doi: 10.1109/TASLP.2018.2842159
– volume: 25
  start-page: 975
  issue: 5
  year: 1953
  ident: 10.1016/j.specom.2024.103150_b6
  article-title: Some experiments on the recognition of speech, with one and with two ears
  publication-title: J. Acoust. Soc. Am.
  doi: 10.1121/1.1907229
– start-page: 246
  year: 2017
  ident: 10.1016/j.specom.2024.103150_b4
  article-title: Deep attractor network for single-microphone speaker separation
– ident: 10.1016/j.specom.2024.103150_b23
  doi: 10.1109/CVPR.2017.113
– volume: 26
  start-page: 787
  issue: 4
  year: 2018
  ident: 10.1016/j.specom.2024.103150_b28
  article-title: Speaker-independent speech separation with deep attractor network
  publication-title: IEEE/ACM Trans. Audio Speech Lang. Process.
  doi: 10.1109/TASLP.2018.2795749
– volume: 27
  start-page: 1256
  issue: 8
  year: 2019
  ident: 10.1016/j.specom.2024.103150_b30
  article-title: Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation
  publication-title: IEEE/ACM Trans. Audio Speech Lang. Process.
  doi: 10.1109/TASLP.2019.2915167
– start-page: 2642
  year: 2020
  ident: 10.1016/j.specom.2024.103150_b5
  article-title: Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation
  publication-title: Interspeech
– volume: 27
  start-page: 1697
  issue: 11
  year: 2019
  ident: 10.1016/j.specom.2024.103150_b27
  article-title: Audio–visual deep clustering for speech separation
  publication-title: IEEE/ACM Trans. Audio Speech Lang. Process.
  doi: 10.1109/TASLP.2019.2928140
– ident: 10.1016/j.specom.2024.103150_b25
  doi: 10.21437/Interspeech.2019-2003
– ident: 10.1016/j.specom.2024.103150_b46
  doi: 10.1145/3474085.3475587
– start-page: 5206
  year: 2015
  ident: 10.1016/j.specom.2024.103150_b33
  article-title: Librispeech: an asr corpus based on public domain audio books
– year: 2022
  ident: 10.1016/j.specom.2024.103150_b45
– start-page: 6364
  year: 2020
  ident: 10.1016/j.specom.2024.103150_b35
  article-title: Filterbank design for end-to-end speech separation
– year: 2014
  ident: 10.1016/j.specom.2024.103150_b20
– start-page: 686
  year: 2018
  ident: 10.1016/j.specom.2024.103150_b52
  article-title: Alternative objective functions for deep clustering
– year: 2018
  ident: 10.1016/j.specom.2024.103150_b1
SSID ssj0004882
Score 2.4375722
Snippet The key to single-channel time-domain speech separation lies in encoding the mixed speech into latent feature representations by encoder and obtaining the...
SourceID crossref
elsevier
SourceType Enrichment Source
Index Database
Publisher
StartPage 103150
SubjectTerms Computational complexity
Multi-temporal resolution Transformer
Short sequence encoder–decoder framework
Speech separation
Title Efficient time-domain speech separation using short encoded sequence network
URI https://dx.doi.org/10.1016/j.specom.2024.103150
Volume 166
WOSCitedRecordID wos001377242100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVESC
  databaseName: Elsevier SD Freedom Collection Journals 2021
  issn: 0167-6393
  databaseCode: AIEXJ
  dateStart: 20220201
  customDbUrl:
  isFulltext: true
  dateEnd: 99991231
  titleUrlDefault: https://www.sciencedirect.com
  omitProxy: false
  ssIdentifier: ssj0004882
  providerName: Elsevier
– providerCode: PRVESC
  databaseName: Elsevier SD Freedom Collection Journals 2021
  issn: 0167-6393
  databaseCode: AIEXJ
  dateStart: 19950101
  customDbUrl:
  isFulltext: true
  dateEnd: 99991231
  titleUrlDefault: https://www.sciencedirect.com
  omitProxy: false
  ssIdentifier: ssj0004882
  providerName: Elsevier
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1JbxMxFLYg5dALomFpoSAfEJfRVJPMYvtYUCigUCE1iNxGtsejpgqTECdV21_f52UWCGKTuEwia-xE_r55fu_NWxB6qcDsKZigoemKFCaxMI6mwjQzS7KYlJIxxW2zCXJ6SqdT9snHz2vbToBUFb26Ysv_CjWMAdgmdfYv4G4WhQH4DqDDFWCH6x8BP7JFIcwrftM3PiwWX8H4D_RSKXkeaOVqfQPmG-sl0OegfwemmmUBqmcdWB1ULjq8q7qeuRVkN6OkCeeZbbzw4v4k7PqiJ0DBb7M2kMDIFVV5189HXujgxL2wz7Qwn6_pvI1WdK2h-eJGdR0Uw_QHB8V25oxzZIKABu0o_k4SuwYsW1LdORgujkzy6cKUDxgmR7Y9RdSeYk1s4ZlZ2qxsomNNwbq7aGdIUkZ7aOf4_Wj6oU2bpbaXWPNX6sxKG_63_Vs_11w62sjkAbrvzQh87ODfQ3dU1UdPxt75rPErPG7qZes-2m3Oues-OnQp2fiLmpd8peDeegAQf4jGDYFwh0DYEQi3BMKWQNgSCHsC4ZpA2BPoEfr8djR58y70PTdCCcbjOgSDPxUylpRTPhA8jhjhpkRdkkoOeymjgUhh4yRRGTzFJdjrZVRGfCBZOUhJweLHqFctKrWPMBdUsIIMC6ZEUgpCwfRQGSlVnBE4V9gBiuv9zKUvSG_6oszzOvLwInco5AaF3KFwgMJm1tIVZPnN_aSGKvdKpVMWc2DXL2c-_eeZz9Bu-yAcot56tVHP0T15uZ7p1QtPw1tNdJ1S
linkProvider Elsevier
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Efficient+time-domain+speech+separation+using+short+encoded+sequence+network&rft.jtitle=Speech+communication&rft.au=Liu%2C+Debang&rft.au=Zhang%2C+Tianqi&rft.au=Christensen%2C+Mads+Gr%C3%A6sb%C3%B8ll&rft.au=Ma%2C+Baoze&rft.date=2025-01-01&rft.pub=Elsevier+B.V&rft.issn=0167-6393&rft.volume=166&rft_id=info:doi/10.1016%2Fj.specom.2024.103150&rft.externalDocID=S0167639324001213
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0167-6393&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0167-6393&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0167-6393&client=summon