Multimodal Weibull Variational Autoencoder for Jointly Modeling Image-Text Data

For multimodal representation learning, traditional black-box approaches often fall short of extracting interpretable multilayer hidden structures, which contribute to visualize the connections between different modalities at multiple semantic levels. To extract interpretable multimodal latent repre...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on cybernetics Vol. 52; no. 10; pp. 11156 - 11171
Main Authors:	Wang, Chaojie, Chen, Bo, Xiao, Sucheng, Wang, Zhengjue, Zhang, Hao, Wang, Penghui, Han, Ning, Zhou, Mingyuan
Format:	Journal Article
Language:	English
Published:	United States IEEE 01.10.2022 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:	Bayesian inference Belief networks Coders Computational modeling Data mining Data models deep topic model Multilayers multimodal representation learning Predictive models Probabilistic logic Qualitative analysis Representations Semantics Task analysis variational autoencoder (VAE)
ISSN:	2168-2267, 2168-2275, 2168-2275
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	For multimodal representation learning, traditional black-box approaches often fall short of extracting interpretable multilayer hidden structures, which contribute to visualize the connections between different modalities at multiple semantic levels. To extract interpretable multimodal latent representations and visualize the hierarchial semantic relationships between different modalities, based on deep topic models, we develop a novel multimodal Poisson gamma belief network (mPGBN) that tightly couples the observations of different modalities via imposing sparse connections between their modality-specific hidden layers. To alleviate the time-consuming Gibbs sampler adopted by traditional topic models in the testing stage, we construct a Weibull-based variational inference network (encoder) to directly map the observations to their latent representations, and further combine it with the mPGBN (decoder), resulting in a novel multimodal Weibull variational autoencoder (MWVAE), which is fast in out-of-sample prediction and can handle large-scale multimodal datasets. Qualitative evaluations on bimodal data consisting of image-text pairs show that the developed MWVAE can successfully extract expressive multimodal latent representations for downstream tasks like missing modality imputation and multimodal retrieval. Further extensive quantitative results demonstrate that both MWVAE and its supervised extension sMWVAE achieve state-of-the-art performance on various multimodal benchmarks.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2168-2267 2168-2275 2168-2275
DOI:	10.1109/TCYB.2021.3070881