Distributed Audio-Visual Parsing Based On Multimodal Transformer and Deep Joint Source Channel Coding

Audio-visual parsing (AVP) is a newly emerged multimodal perception task, which detects and classifies audio-visual events in video. However, most existing AVP networks only use a simple attention mechanism to guide audio-visual multimodal events, and are implemented in a single end. This makes it u...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) S. 4623 - 4627
Hauptverfasser: Wang, Penghong, Li, Jiahui, Ma, Mengyao, Fan, Xiaopeng
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: IEEE 23.05.2022
Schlagworte:
ISSN:2379-190X
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Audio-visual parsing (AVP) is a newly emerged multimodal perception task, which detects and classifies audio-visual events in video. However, most existing AVP networks only use a simple attention mechanism to guide audio-visual multimodal events, and are implemented in a single end. This makes it unable to effectively capture the relationship between audio-visual events, and is not suitable for implementation in the network transmission scenario. In this paper, we focus on these problems and propose a distributed audio-visual parsing network (DAVPNet) based on multimodal transformer and deep joint source channel coding (DJSCC). Multimodal transformers are used to enhance the attention calculation between audio-visual events, and DJSCC is used to apply DAVP tasks to network transmission scenarios. Finally, the Look, Listen, and Parse (LLP) dataset is used to test the algorithm performance, and the experimental results show that the DAVPNet has superior parsing performance.
ISSN:2379-190X
DOI:10.1109/ICASSP43922.2022.9746660