Semi-Supervised Multichannel Speech Enhancement With a Deep Speech Prior

This paper describes a semi-supervised multichannel speech enhancement method that uses clean speech data for prior training. Although multichannel nonnegative matrix factorization (MNMF) and its constrained variant called independent low-rank matrix analysis (ILRMA) have successfully been used for...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	IEEE/ACM transactions on audio, speech, and language processing Ročník 27; číslo 12; s. 2197 - 2212
Hlavní autoři:	Sekiguchi, Kouhei, Bando, Yoshiaki, Nugraha, Aditya Arie, Yoshii, Kazuyoshi, Kawahara, Tatsuya
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Piscataway IEEE 01.12.2019 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Témata:	Covariance matrix Data models deep learning Matrix methods Maximum likelihood estimation Multichannel speech enhancement Noise Noise measurement nonnegative matrix factorization Parameter sensitivity Probabilistic logic Probabilistic models Speech Speech enhancement Speech processing Time-frequency analysis variational autoencoder
ISSN:	2329-9290, 2329-9304
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	This paper describes a semi-supervised multichannel speech enhancement method that uses clean speech data for prior training. Although multichannel nonnegative matrix factorization (MNMF) and its constrained variant called independent low-rank matrix analysis (ILRMA) have successfully been used for unsupervised speech enhancement, the low-rank assumption on the power spectral densities (PSDs) of all sources (speech and noise) does not hold in reality. To solve this problem, we replace a low-rank speech model with a deep generative speech model, i.e., formulate a probabilistic model of noisy speech by integrating a deep speech model, a low-rank noise model, and a full-rank or rank-1 model of spatial characteristics of speech and noise. The deep speech model is trained from clean speech data in an unsupervised auto-encoding variational Bayesian manner. Given multichannel noisy speech spectra, the full-rank or rank-1 spatial covariance matrices and PSDs of speech and noise are estimated in an unsupervised maximum-likelihood manner. Experimental results showed that the full-rank version of the proposed method was significantly better than MNMF, ILRMA, and the rank-1 version. We confirmed that the initialization-sensitivity and local-optimum problems of MNMF with many spatial parameters can be solved by incorporating the precise speech model.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2329-9290 2329-9304
DOI:	10.1109/TASLP.2019.2944348