Hybrid Transformers for Music Source Separation
A natural question arising in Music Source Separation (MSS) is whether long range contextual information is useful, or whether local acoustic features are sufficient. In other fields, attention based Transformers [1] have shown their ability to integrate information over long sequences. In this work...
Uloženo v:
| Vydáno v: | Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) s. 1 - 5 |
|---|---|
| Hlavní autoři: | , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
04.06.2023
|
| Témata: | |
| ISSN: | 2379-190X |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | A natural question arising in Music Source Separation (MSS) is whether long range contextual information is useful, or whether local acoustic features are sufficient. In other fields, attention based Transformers [1] have shown their ability to integrate information over long sequences. In this work, we introduce Hybrid Transformer Demucs (HT Demucs), an hybrid temporal/spectral bi-U-Net based on Hybrid Demucs [2], where the innermost layers are replaced by a cross-domain Transformer Encoder, using self-attention within one domain, and cross-attention across domains. While it performs poorly when trained only on MUSDB [3], we show that it outperforms Hybrid Demucs (trained on the same data) by 0.45 dB of SDR when using 800 extra training songs. Using sparse attention kernels to extend its receptive field, and per source fine-tuning, we achieve state-of-the-art results on MUSDB with extra training data, with 9.20 dB of SDR. |
|---|---|
| AbstractList | A natural question arising in Music Source Separation (MSS) is whether long range contextual information is useful, or whether local acoustic features are sufficient. In other fields, attention based Transformers [1] have shown their ability to integrate information over long sequences. In this work, we introduce Hybrid Transformer Demucs (HT Demucs), an hybrid temporal/spectral bi-U-Net based on Hybrid Demucs [2], where the innermost layers are replaced by a cross-domain Transformer Encoder, using self-attention within one domain, and cross-attention across domains. While it performs poorly when trained only on MUSDB [3], we show that it outperforms Hybrid Demucs (trained on the same data) by 0.45 dB of SDR when using 800 extra training songs. Using sparse attention kernels to extend its receptive field, and per source fine-tuning, we achieve state-of-the-art results on MUSDB with extra training data, with 9.20 dB of SDR. |
| Author | Rouard, Simon Massa, Francisco Defossez, Alexandre |
| Author_xml | – sequence: 1 givenname: Simon surname: Rouard fullname: Rouard, Simon organization: Meta AI – sequence: 2 givenname: Francisco surname: Massa fullname: Massa, Francisco organization: Meta AI – sequence: 3 givenname: Alexandre surname: Defossez fullname: Defossez, Alexandre organization: Meta AI |
| BookMark | eNo1j8tKAzEUQKMo2Kn-gYvxA2Z6b26eSynaChWFqeCuZJoEInamJO2if29BXZ3d4ZyKXQ3jEBh7QGgRwc5e5o9d9y4sSd1y4NQigFVWqgtWoeYGFXGtL9mEk7YNWvi8YVUpXwBgtDATNlue-px8vc5uKHHMu5BLfWb9eixpW3fjMW9D3YW9y-6QxuGWXUf3XcLdH6fs4_lpPV82q7fFOWbVJDT20PhIoEn2UTsvQQmFwirnhIyKc68F9hgF7y0BERqDMqI3YGTQIkYET1N2_-tNIYTNPqedy6fN_x39AIhjRaY |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/ICASSP49357.2023.10096956 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering |
| EISBN | 1728163277 9781728163277 |
| EISSN | 2379-190X |
| EndPage | 5 |
| ExternalDocumentID | 10096956 |
| Genre | orig-research |
| GroupedDBID | 23M 6IE 6IF 6IH 6IK 6IL 6IM 6IN AAJGR AAWTH ABLEC ACGFS ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO RNS |
| ID | FETCH-LOGICAL-i189t-df30735bf7ad506461496aa45f622d741b1f42b9303318815f1d8085e74ff10d3 |
| IEDL.DBID | RIE |
| IngestDate | Wed Aug 27 02:23:37 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i189t-df30735bf7ad506461496aa45f622d741b1f42b9303318815f1d8085e74ff10d3 |
| PageCount | 5 |
| ParticipantIDs | ieee_primary_10096956 |
| PublicationCentury | 2000 |
| PublicationDate | 2023-June-4 |
| PublicationDateYYYYMMDD | 2023-06-04 |
| PublicationDate_xml | – month: 06 year: 2023 text: 2023-June-4 day: 04 |
| PublicationDecade | 2020 |
| PublicationTitle | Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) |
| PublicationTitleAbbrev | ICASSP |
| PublicationYear | 2023 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0008748 |
| Score | 2.6306784 |
| Snippet | A natural question arising in Music Source Separation (MSS) is whether long range contextual information is useful, or whether local acoustic features are... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1 |
| SubjectTerms | Acoustics Multiple signal classification Music Source Separation Source separation Speech processing Training Training data Transformers |
| Title | Hybrid Transformers for Music Source Separation |
| URI | https://ieeexplore.ieee.org/document/10096956 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NS8MwGH5xQ0QvflX8JoLXdqZNm_YowzEvo9AJu42kSWCXVmYn7N_7Jm2nHjx4ahtaQhLevk8-nucBeFRcmdDIzC-liHxGmfBFgo_WPoQmduvPtGYTfDZLF4ss78jqjgujtXaHz3Rgb91evqrLjV0qwwhHwI2AfgADzpOWrLX77aacpQfw0Ilojl7Hz0WRsyyKeWAtwoP-4182Ki6LTI7_Wf8JeN98PJLvMs0p7OnqDI5-SAmew2i6tdwrMu-BKMI6glfijJxJ4dboSaFbqe-68uBt8jIfT_3ODMFf0TRrfGVsNMbScKGsyBym1SwRgsUmCUOFuEBSw0KZYUrCME1pbKhKEU9pzoyhTyq6gGFVV_oSiLESLLFUKlRW2SiRsZBhSali-CrOGK7As21fvrd6F8u-2dd_lN_Aoe1hd4CK3cKwWW_0HeyXn83qY33vRukLiE6RXg |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1JS8NAFH5oFZeLW8TdEbymddLJdpRiSbGWQir0VmYyM9BLKrUV_Pe-N0mrHjx4SjIkMAsv75vl-z6Aex1rG1iV-oWSbV9wIX0Z4SPZh_CItv5sZTYRDwbJeJwOa7K648IYY9zhM9OkW7eXr2fFkpbKMMIRcCOg34Qtss6q6VrrH28Si2QH7moZzVav85jnQ5G2w7hJJuHN1ee_jFRcHuke_LMGh-B9M_LYcJ1rjmDDlMew_0NM8ARa2Sexr9hoBUUR2DG8MmflzHK3Ss9yU4l9z0oPXrtPo07m13YI_pQn6cLXluIxVDaWmmTmMLGmkZQitFEQaEQGilsRqBSTEgZqwkPLdYKIysTCWv6g26fQKGelOQNmSYQlVFoHmrSNIhVKFRSca4Gv4pzhHDxq--StUryYrJp98Uf5Lexmo5f-pN8bPF_CHvW2O04lrqCxmC_NNWwXH4vp-_zGjdgXUF2Upw |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+of+the+...+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%281998%29&rft.atitle=Hybrid+Transformers+for+Music+Source+Separation&rft.au=Rouard%2C+Simon&rft.au=Massa%2C+Francisco&rft.au=Defossez%2C+Alexandre&rft.date=2023-06-04&rft.pub=IEEE&rft.eissn=2379-190X&rft.spage=1&rft.epage=5&rft_id=info:doi/10.1109%2FICASSP49357.2023.10096956&rft.externalDocID=10096956 |