Cross-modal emotion hashing network: Efficient binary coding for large-scale multimodal emotion retrieval
•Reframe multimodal emotion recognition as affect-oriented binary hashing with the Cross-Modal Emotion Hashing Network (CEHN).•Employ a three-hop gated cross-modal attention hierarchy to fuse audio, visual, and textual cues efficiently.•Use polarity-aware contrastive distillation to produce balanced...
Uloženo v:
| Vydáno v: | Knowledge-based systems Ročník 331; s. 114771 |
|---|---|
| Hlavní autoři: | , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Elsevier B.V
03.12.2025
|
| Témata: | |
| ISSN: | 0950-7051 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | •Reframe multimodal emotion recognition as affect-oriented binary hashing with the Cross-Modal Emotion Hashing Network (CEHN).•Employ a three-hop gated cross-modal attention hierarchy to fuse audio, visual, and textual cues efficiently.•Use polarity-aware contrastive distillation to produce balanced 32–128-bit codes whose Hamming distances preserve emotion similarity.•On MELD and CMU-MOSI, a 64-bit CEHN variant matches dense-vector baselines while shrinking memory up to 256× and cutting CPU latency to <3 ms.•Enable billion-scale, privacy-preserving retrieval on commodity hardware.
Multimodal emotion recognition (MER) remains computationally heavy because speech, vision, and language features are high-dimensional. We reframe MER as affect-oriented retrieval and propose a Cross-Modal Emotion Hashing Network (CEHN), which encodes each utterance into compact binary codes that preserve fine-grained emotional semantics. CEHN couples pre-trained acoustic, visual, and textual encoders with gated cross-modal attention, and then applies a polarity-aware distillation loss that aligns continuous affect vectors with a temperature-controlled sign layer. The result is 32–128–bit hash codes whose Hamming distances approximate emotion similarity. On MELD and CMU-MOSI, CEHN matches or surpasses dense state-of-the-art models, improving F1 by up to 1.9 points while shrinking representation size by up to 16× and reducing CPU inference to <2 ms per utterance. Latency-matched quantization baselines trail by as much as 5 mAP points. Ablations confirm the critical roles of gated attention and polarity-aware distillation. By uniting efficient hashing with modality-aware fusion, CEHN enables real-time, privacy-preserving emotion understanding for dialogue systems, recommendation engines, and safety-critical moderation on resource-constrained devices. |
|---|---|
| ISSN: | 0950-7051 |
| DOI: | 10.1016/j.knosys.2025.114771 |