A Reverberation-Time-Aware Approach to Speech Dereverberation Based on Deep Neural Networks

A reverberation-time-aware deep-neural-network (DNN)-based speech dereverberation framework is proposed to handle a wide range of reverberation times. There are three key steps in designing a robust system. First, in contrast to sigmoid activation and min-max normalization in state-of-the-art algori...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	IEEE/ACM transactions on audio, speech, and language processing Ročník 25; číslo 1; s. 102 - 111
Hlavní autoři:	Wu, Bo, Li, Kehuang, Yang, Minglei, Lee, Chin-Hui
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Piscataway IEEE 01.01.2017 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Témata:	Acoustic context Activation Artificial neural networks Context deep neural networks (DNNs) Feature extraction frame shift linear output layer mean-variance normalization Reverberation Reverberation time reverberation-time-aware (RTA) Speech speech dereverberation Speech processing Training
ISSN:	2329-9290, 2329-9304
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	A reverberation-time-aware deep-neural-network (DNN)-based speech dereverberation framework is proposed to handle a wide range of reverberation times. There are three key steps in designing a robust system. First, in contrast to sigmoid activation and min-max normalization in state-of-the-art algorithms, a linear activation function at the output layer and global mean-variance normalization of target features are adopted to learn the complicated nonlinear mapping function from reverberant to anechoic speech and to improve the restoration of the low-frequency and intermediate-frequency contents. Next, two key design parameters, namely, frame shift size in speech framing and acoustic context window size at the DNN input, are investigated to show that RT60-dependent parameters are needed in the DNN training stage in order to optimize the system performance in diverse reverberant environments. Finally, the reverberation time is estimated to select the proper frame shift and context window sizes for feature extraction before feeding the log-power spectrum features to the trained DNNs for speech dereverberation. Our experimental results indicate that the proposed framework outperforms the conventional DNNs without taking the reverberation time into account, while achieving a performance only slightly worse than the oracle cases with known reverberation times even for extremely weak and severe reverberant conditions. It also generalizes well to unseen room sizes, loudspeaker and microphone positions, and recorded room impulse responses.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2329-9290 2329-9304
DOI:	10.1109/TASLP.2016.2623559