Multimodal Token Fusion for Vision Transformers

Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images. Intuitively, feeding multiple modalities of data to vision transformers could improve the performance, yet the innermodal attentive we...

Full description

Saved in:
Bibliographic Details
Published in:Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 12176 - 12185
Main Authors: Wang, Yikai, Chen, Xinghao, Cao, Lele, Huang, Wenbing, Sun, Fuchun, Wang, Yunhe
Format: Conference Proceeding
Language:English
Published: IEEE 01.06.2022
Subjects:
ISSN:1063-6919
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Be the first to leave a comment!
You must be logged in first