Cross-modal emotion hashing network: Efficient binary coding for large-scale multimodal emotion retrieval

•Reframe multimodal emotion recognition as affect-oriented binary hashing with the Cross-Modal Emotion Hashing Network (CEHN).•Employ a three-hop gated cross-modal attention hierarchy to fuse audio, visual, and textual cues efficiently.•Use polarity-aware contrastive distillation to produce balanced...

Full description

Saved in:
Bibliographic Details
Published in:Knowledge-based systems Vol. 331; p. 114771
Main Authors: Li, Chenhao, Huang, Wenti, Yang, Zhan, Long, Jun
Format: Journal Article
Language:English
Published: Elsevier B.V 03.12.2025
Subjects:
ISSN:0950-7051
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•Reframe multimodal emotion recognition as affect-oriented binary hashing with the Cross-Modal Emotion Hashing Network (CEHN).•Employ a three-hop gated cross-modal attention hierarchy to fuse audio, visual, and textual cues efficiently.•Use polarity-aware contrastive distillation to produce balanced 32–128-bit codes whose Hamming distances preserve emotion similarity.•On MELD and CMU-MOSI, a 64-bit CEHN variant matches dense-vector baselines while shrinking memory up to 256× and cutting CPU latency to <3 ms.•Enable billion-scale, privacy-preserving retrieval on commodity hardware. Multimodal emotion recognition (MER) remains computationally heavy because speech, vision, and language features are high-dimensional. We reframe MER as affect-oriented retrieval and propose a Cross-Modal Emotion Hashing Network (CEHN), which encodes each utterance into compact binary codes that preserve fine-grained emotional semantics. CEHN couples pre-trained acoustic, visual, and textual encoders with gated cross-modal attention, and then applies a polarity-aware distillation loss that aligns continuous affect vectors with a temperature-controlled sign layer. The result is 32–128–bit hash codes whose Hamming distances approximate emotion similarity. On MELD and CMU-MOSI, CEHN matches or surpasses dense state-of-the-art models, improving F1 by up to 1.9 points while shrinking representation size by up to 16× and reducing CPU inference to <2 ms per utterance. Latency-matched quantization baselines trail by as much as 5 mAP points. Ablations confirm the critical roles of gated attention and polarity-aware distillation. By uniting efficient hashing with modality-aware fusion, CEHN enables real-time, privacy-preserving emotion understanding for dialogue systems, recommendation engines, and safety-critical moderation on resource-constrained devices.
ISSN:0950-7051
DOI:10.1016/j.knosys.2025.114771