Semi-Supervised Multichannel Speech Enhancement With a Deep Speech Prior

This paper describes a semi-supervised multichannel speech enhancement method that uses clean speech data for prior training. Although multichannel nonnegative matrix factorization (MNMF) and its constrained variant called independent low-rank matrix analysis (ILRMA) have successfully been used for...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE/ACM transactions on audio, speech, and language processing Jg. 27; H. 12; S. 2197 - 2212
Hauptverfasser:	Sekiguchi, Kouhei, Bando, Yoshiaki, Nugraha, Aditya Arie, Yoshii, Kazuyoshi, Kawahara, Tatsuya
Format:	Journal Article
Sprache:	Englisch
Veröffentlicht:	Piscataway IEEE 01.12.2019 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Schlagworte:	Covariance matrix Data models deep learning Matrix methods Maximum likelihood estimation Multichannel speech enhancement Noise Noise measurement nonnegative matrix factorization Parameter sensitivity Probabilistic logic Probabilistic models Speech Speech enhancement Speech processing Time-frequency analysis variational autoencoder
ISSN:	2329-9290, 2329-9304
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This paper describes a semi-supervised multichannel speech enhancement method that uses clean speech data for prior training. Although multichannel nonnegative matrix factorization (MNMF) and its constrained variant called independent low-rank matrix analysis (ILRMA) have successfully been used for unsupervised speech enhancement, the low-rank assumption on the power spectral densities (PSDs) of all sources (speech and noise) does not hold in reality. To solve this problem, we replace a low-rank speech model with a deep generative speech model, i.e., formulate a probabilistic model of noisy speech by integrating a deep speech model, a low-rank noise model, and a full-rank or rank-1 model of spatial characteristics of speech and noise. The deep speech model is trained from clean speech data in an unsupervised auto-encoding variational Bayesian manner. Given multichannel noisy speech spectra, the full-rank or rank-1 spatial covariance matrices and PSDs of speech and noise are estimated in an unsupervised maximum-likelihood manner. Experimental results showed that the full-rank version of the proposed method was significantly better than MNMF, ILRMA, and the rank-1 version. We confirmed that the initialization-sensitivity and local-optimum problems of MNMF with many spatial parameters can be solved by incorporating the precise speech model.
Bibliographie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2329-9290 2329-9304
DOI:	10.1109/TASLP.2019.2944348