Distributed Audio-Visual Parsing Based On Multimodal Transformer and Deep Joint Source Channel Coding
Audio-visual parsing (AVP) is a newly emerged multimodal perception task, which detects and classifies audio-visual events in video. However, most existing AVP networks only use a simple attention mechanism to guide audio-visual multimodal events, and are implemented in a single end. This makes it u...
Saved in:
| Published in: | Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) pp. 4623 - 4627 |
|---|---|
| Main Authors: | , , , |
| Format: | Conference Proceeding |
| Language: | English |
| Published: |
IEEE
23.05.2022
|
| Subjects: | |
| ISSN: | 2379-190X |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Be the first to leave a comment!