Semi-Supervised Multichannel Speech Enhancement With a Deep Speech Prior

This paper describes a semi-supervised multichannel speech enhancement method that uses clean speech data for prior training. Although multichannel nonnegative matrix factorization (MNMF) and its constrained variant called independent low-rank matrix analysis (ILRMA) have successfully been used for...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:IEEE/ACM transactions on audio, speech, and language processing Ročník 27; číslo 12; s. 2197 - 2212
Hlavní autoři: Sekiguchi, Kouhei, Bando, Yoshiaki, Nugraha, Aditya Arie, Yoshii, Kazuyoshi, Kawahara, Tatsuya
Médium: Journal Article
Jazyk:angličtina
Vydáno: Piscataway IEEE 01.12.2019
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Témata:
ISSN:2329-9290, 2329-9304
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:This paper describes a semi-supervised multichannel speech enhancement method that uses clean speech data for prior training. Although multichannel nonnegative matrix factorization (MNMF) and its constrained variant called independent low-rank matrix analysis (ILRMA) have successfully been used for unsupervised speech enhancement, the low-rank assumption on the power spectral densities (PSDs) of all sources (speech and noise) does not hold in reality. To solve this problem, we replace a low-rank speech model with a deep generative speech model, i.e., formulate a probabilistic model of noisy speech by integrating a deep speech model, a low-rank noise model, and a full-rank or rank-1 model of spatial characteristics of speech and noise. The deep speech model is trained from clean speech data in an unsupervised auto-encoding variational Bayesian manner. Given multichannel noisy speech spectra, the full-rank or rank-1 spatial covariance matrices and PSDs of speech and noise are estimated in an unsupervised maximum-likelihood manner. Experimental results showed that the full-rank version of the proposed method was significantly better than MNMF, ILRMA, and the rank-1 version. We confirmed that the initialization-sensitivity and local-optimum problems of MNMF with many spatial parameters can be solved by incorporating the precise speech model.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2329-9290
2329-9304
DOI:10.1109/TASLP.2019.2944348