Multimodal Token Fusion for Vision Transformers

Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images. Intuitively, feeding multiple modalities of data to vision transformers could improve the performance, yet the innermodal attentive we...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) s. 12176 - 12185
Hlavní autoři: Wang, Yikai, Chen, Xinghao, Cao, Lele, Huang, Wenbing, Sun, Fuchun, Wang, Yunhe
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 01.06.2022
Témata:
ISSN:1063-6919
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images. Intuitively, feeding multiple modalities of data to vision transformers could improve the performance, yet the innermodal attentive weights may be diluted, which could thus greatly undermine the final performance. In this paper, we propose a multimodal token fusion method (TokenFusion), tailored for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitute these tokens with projected and aggregated inter-modal features. Residual positional alignment is also adopted to enable explicit utilization of the inter-modal alignments after fusion. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact. Extensive experiments are conducted on a variety of homogeneous and heterogeneous modalities and demonstrate that TokenFusion surpasses state-of-the-art methods in three typical vision tasks: multimodal image-to-image translation, RGB-depth semantic segmentation, and 3D object detection with point cloud and images. Code will be released 1 1 https://github.com/huawei-noah/noah-research 2 2 https://gitee.com/mindspore/models/tree/master/research/cv/TokenFusion.
AbstractList Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images. Intuitively, feeding multiple modalities of data to vision transformers could improve the performance, yet the innermodal attentive weights may be diluted, which could thus greatly undermine the final performance. In this paper, we propose a multimodal token fusion method (TokenFusion), tailored for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitute these tokens with projected and aggregated inter-modal features. Residual positional alignment is also adopted to enable explicit utilization of the inter-modal alignments after fusion. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact. Extensive experiments are conducted on a variety of homogeneous and heterogeneous modalities and demonstrate that TokenFusion surpasses state-of-the-art methods in three typical vision tasks: multimodal image-to-image translation, RGB-depth semantic segmentation, and 3D object detection with point cloud and images. Code will be released 1 1 https://github.com/huawei-noah/noah-research 2 2 https://gitee.com/mindspore/models/tree/master/research/cv/TokenFusion.
Author Sun, Fuchun
Chen, Xinghao
Cao, Lele
Wang, Yikai
Huang, Wenbing
Wang, Yunhe
Author_xml – sequence: 1
  givenname: Yikai
  surname: Wang
  fullname: Wang, Yikai
  email: wangyk17@mails.tsinghua.edu.cn
  organization: Tsinghua University,Beijing National Research Center for Information Science and Technology (BNRist), State Key Lab on Intelligent Technology and Systems,Department of Computer Science and Technology
– sequence: 2
  givenname: Xinghao
  surname: Chen
  fullname: Chen, Xinghao
  email: xinghao.chen@huawei.com
  organization: Huawei Noah's Ark Lab
– sequence: 3
  givenname: Lele
  surname: Cao
  fullname: Cao, Lele
  email: caolele@gmail.com
  organization: Tsinghua University,Beijing National Research Center for Information Science and Technology (BNRist), State Key Lab on Intelligent Technology and Systems,Department of Computer Science and Technology
– sequence: 4
  givenname: Wenbing
  surname: Huang
  fullname: Huang, Wenbing
  email: hwenbing@126.com
  organization: Institute for AI Industry Research (AIR), Tsinghua University
– sequence: 5
  givenname: Fuchun
  surname: Sun
  fullname: Sun, Fuchun
  email: fuchuns@tsinghua.edu.cn
  organization: Tsinghua University,Beijing National Research Center for Information Science and Technology (BNRist), State Key Lab on Intelligent Technology and Systems,Department of Computer Science and Technology
– sequence: 6
  givenname: Yunhe
  surname: Wang
  fullname: Wang, Yunhe
  email: yunhe.wang@huawei.com
  organization: Huawei Noah's Ark Lab
BookMark eNotjs1OwzAQhA0Cibb0CeCQF0i6aye294giCkhFIBR6rezYlgz5QXF74O2JgNPMSJ9mZskuhnHwjN0iFIhAm3r_-lZxqXXBgfMCELU6Y0uUsiollVKcswWCFLkkpCu2TukDAARHlKQXbPN86o6xH53psmb89EO2PaU4DlkYp2wff20zmSHNufdTumaXwXTJr_91xd639039mO9eHp7qu10eOYhjPo_Pn4IwwZJyJa98K1pNQZNuUZaVc6ElaxV4dIKMd96EyipjZ8AIbcWK3fz1Ru_94WuKvZm-D6QVgZLiBwLfRuA
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR52688.2022.01187
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 1665469463
9781665469463
EISSN 1063-6919
EndPage 12185
ExternalDocumentID 9879076
Genre orig-research
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i203t-654268f3afb97d425ec3c89f898c1645ddfc9bb70e1d39aedeaf5b7ab898a38b3
IEDL.DBID RIE
ISICitedReferencesCount 160
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000870759105026&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:15:11 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i203t-654268f3afb97d425ec3c89f898c1645ddfc9bb70e1d39aedeaf5b7ab898a38b3
PageCount 10
ParticipantIDs ieee_primary_9879076
PublicationCentury 2000
PublicationDate 2022-June
PublicationDateYYYYMMDD 2022-06-01
PublicationDate_xml – month: 06
  year: 2022
  text: 2022-June
PublicationDecade 2020
PublicationTitle Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev CVPR
PublicationYear 2022
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211698
Score 2.6113155
Snippet Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like...
SourceID ieee
SourceType Publisher
StartPage 12176
SubjectTerms categorization
Computer architecture
Deep learning architectures and techniques; Recognition: detection
grouping and shape analysis; Vision + X
Image segmentation
Object detection
Point cloud compression
retrieval; Segmentation
Semantics
Shape
Three-dimensional displays
Title Multimodal Token Fusion for Vision Transformers
URI https://ieeexplore.ieee.org/document/9879076
WOSCitedRecordID wos000870759105026&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEB7a4sFT1VZ8k4NHt91uzCY5F4sHKYuspbeSxwSKuit9-PtN0rUiePEWQkiYhGG-mcw3A3DrMQN3DtOExdBNntlEedc2YSa3mmcamYtFXJ_4dCrmc1m04G7PhUHEmHyGgzCMf_m2NtsQKhv6Tbwvl7ehzXm-42rt4ynUezK5FA07bpTK4XhWPIdiJiGBK8sGsbH2rx4q0YRMuv87_Aj6P1w8UuytzDG0sDqBbgMeSaOa6x4MI5X2vbbqjZT1K1Zksg2RMOJRKZlFBjkpv1Gqx3x9eJk8lOPHpOmGkCyzlG6S0FkqF44qpyW3XtXQUCOkE1IY7_Mwa52RWvMUR5ZKhRaVY5or7RcoKjQ9hU5VV3gGxGlmcEQ5pR5-pFTp1LJ7g5lfaDzkcOfQC_IvPnYFLxaN6Bd_T1_CYbjgXf7UFXQ2qy1ew4H53CzXq5v4Sl9zz5PL
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEB5qFfRUtRXf5uDRbbcbd5Oci6ViLUXW0lvJYwKi7kof_n6TdK0IXryFEBKSMOSbyXzzAVw7zMCsxThKQ-gmS0wknWsbpToziiUKUxuKuA7ZaMSnUzGuwc2GC4OIIfkM274Z_vJNqVc-VNZxkzhfLtuCba-cVbG1NhEV6nyZTPCKH9eNRac3GT_5ciY-hStJ2kFa-5eKSnhE-o3_Lb8PrR82Hhlv3pkDqGFxCI0KPpLKOBdN6AQy7Xtp5BvJy1csSH_lY2HE4VIyCRxykn_jVIf6WvDcv8t7g6jSQ4hekpguI68tlXFLpVWCGWdsqKnmwnLBtfN6UmOsFkqxGLuGCokGpU0Vk8oNkJQregT1oizwGIhVqcYuZZQ6ABJTqWKT3mpM3EDtQIc9gabf_-xjXfJiVm399O_uK9gd5I_D2fB-9HAGe_6w19lU51Bfzld4ATv6c_mymF-GG_sCWKKXFA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Multimodal+Token+Fusion+for+Vision+Transformers&rft.au=Wang%2C+Yikai&rft.au=Chen%2C+Xinghao&rft.au=Cao%2C+Lele&rft.au=Huang%2C+Wenbing&rft.date=2022-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=12176&rft.epage=12185&rft_id=info:doi/10.1109%2FCVPR52688.2022.01187&rft.externalDocID=9879076