DDSP-BASED SINGING VOCODERS: A NEW SUBTRACTIVE-BASED SYNTHESIZER AND A COMPREHENSIVE EVALUATION.

Gespeichert in:
Bibliographische Detailangaben
Titel: DDSP-BASED SINGING VOCODERS: A NEW SUBTRACTIVE-BASED SYNTHESIZER AND A COMPREHENSIVE EVALUATION.
Autoren: Da-Yi Wu, Wen-Yi Hsiao, Fu-Rong Yang, Friedman, Oscar, Jackson, Warren, Bruzenak, Scott, Yi-Wen Liu, Yi-Hsuan Yang
Quelle: International Society for Music Information Retrieval Conference Proceedings; 2022, p76-83, 8p
Schlagwörter: VOCODER, DIGITAL signal processing, MUSICAL analysis, SYNTHESIZER music, HARMONY in music
Abstract: A vocoder is a conditional audio generation model that converts acoustic features such as mel-spectrograms into waveforms. Taking inspiration from Differentiable Digital Signal Processing (DDSP), we propose a new vocoder named SawSing for singing voices. SawSing synthesizes the harmonic part of singing voices by filtering a sawtooth source signal with a linear time-variant finite impulse response filter whose coefficients are estimated from the input mel-spectrogram by a neural network. As this approach enforces phase continuity, SawSing can generate singing voices without the phase-discontinuity glitch of many existing vocoders. Moreover, the source-filter assumption provides an inductive bias that allows SawSing to be trained on a small amount of data. Our evaluation shows that SawSing converges much faster and outperforms stateof-the-art generative adversarial network- and diffusionbased vocoders in a resource-limited scenario with only 3 training recordings and a 3-hour training time. [ABSTRACT FROM AUTHOR]
Copyright of International Society for Music Information Retrieval Conference Proceedings is the property of Ubiquity Press and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Datenbank: Complementary Index
Beschreibung
Abstract:A vocoder is a conditional audio generation model that converts acoustic features such as mel-spectrograms into waveforms. Taking inspiration from Differentiable Digital Signal Processing (DDSP), we propose a new vocoder named SawSing for singing voices. SawSing synthesizes the harmonic part of singing voices by filtering a sawtooth source signal with a linear time-variant finite impulse response filter whose coefficients are estimated from the input mel-spectrogram by a neural network. As this approach enforces phase continuity, SawSing can generate singing voices without the phase-discontinuity glitch of many existing vocoders. Moreover, the source-filter assumption provides an inductive bias that allows SawSing to be trained on a small amount of data. Our evaluation shows that SawSing converges much faster and outperforms stateof-the-art generative adversarial network- and diffusionbased vocoders in a resource-limited scenario with only 3 training recordings and a 3-hour training time. [ABSTRACT FROM AUTHOR]