Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding

Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations....

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) Ročník 2025; s. 26147 - 26159
Hlavní autoři: Tang, Feilong, Liu, Chengzhi, Xu, Zhongxing, Hu, Ming, Huang, Zile, Xue, Haochen, Chen, Ziyang, Peng, Zelin, Yang, Zhiwei, Zhou, Sijin, Li, Wenxue, Li, Yulong, Song, Wenxuan, Su, Shiyan, Feng, Wei, Su, Jionglong, Lin, Minquan, Peng, Yifan, Cheng, Xuelian, Razzak, Imran, Ge, Zongyuan
Médium: Konferenční příspěvek Journal Article
Jazyk:angličtina
Vydáno: United States IEEE 01.06.2025
Témata:
ISSN:1063-6919, 1063-6919
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks.
AbstractList Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks.
Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness.Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness.
Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness.
Author Xue, Haochen
Tang, Feilong
Chen, Ziyang
Cheng, Xuelian
Peng, Zelin
Liu, Chengzhi
Ge, Zongyuan
Li, Wenxue
Yang, Zhiwei
Lin, Minquan
Zhou, Sijin
Li, Yulong
Su, Shiyan
Feng, Wei
Peng, Yifan
Song, Wenxuan
Razzak, Imran
Hu, Ming
Su, Jionglong
Xu, Zhongxing
Huang, Zile
Author_xml – sequence: 1
  givenname: Feilong
  surname: Tang
  fullname: Tang, Feilong
  email: Feilong.Tang@monash.edu
  organization: Monash University
– sequence: 2
  givenname: Chengzhi
  surname: Liu
  fullname: Liu, Chengzhi
  organization: XJTLU
– sequence: 3
  givenname: Zhongxing
  surname: Xu
  fullname: Xu, Zhongxing
  organization: Monash University
– sequence: 4
  givenname: Ming
  surname: Hu
  fullname: Hu, Ming
  organization: Monash University
– sequence: 5
  givenname: Zile
  surname: Huang
  fullname: Huang, Zile
  organization: XJTLU
– sequence: 6
  givenname: Haochen
  surname: Xue
  fullname: Xue, Haochen
  organization: XJTLU
– sequence: 7
  givenname: Ziyang
  surname: Chen
  fullname: Chen, Ziyang
  organization: Northwestern Polytechnical University
– sequence: 8
  givenname: Zelin
  surname: Peng
  fullname: Peng, Zelin
  organization: Shanghai Jiaotong University
– sequence: 9
  givenname: Zhiwei
  surname: Yang
  fullname: Yang, Zhiwei
  organization: Fudan University
– sequence: 10
  givenname: Sijin
  surname: Zhou
  fullname: Zhou, Sijin
  organization: Monash University
– sequence: 11
  givenname: Wenxue
  surname: Li
  fullname: Li, Wenxue
  organization: Monash University
– sequence: 12
  givenname: Yulong
  surname: Li
  fullname: Li, Yulong
  organization: MBZUAI
– sequence: 13
  givenname: Wenxuan
  surname: Song
  fullname: Song, Wenxuan
  organization: Monash University
– sequence: 14
  givenname: Shiyan
  surname: Su
  fullname: Su, Shiyan
  organization: Monash University
– sequence: 15
  givenname: Wei
  surname: Feng
  fullname: Feng, Wei
  organization: Monash University
– sequence: 16
  givenname: Jionglong
  surname: Su
  fullname: Su, Jionglong
  organization: XJTLU
– sequence: 17
  givenname: Minquan
  surname: Lin
  fullname: Lin, Minquan
  organization: University of Minnesota
– sequence: 18
  givenname: Yifan
  surname: Peng
  fullname: Peng, Yifan
  organization: Cornell University
– sequence: 19
  givenname: Xuelian
  surname: Cheng
  fullname: Cheng, Xuelian
  organization: Monash University
– sequence: 20
  givenname: Imran
  surname: Razzak
  fullname: Razzak, Imran
  organization: MBZUAI
– sequence: 21
  givenname: Zongyuan
  surname: Ge
  fullname: Ge, Zongyuan
  organization: Monash University
BackLink https://www.ncbi.nlm.nih.gov/pubmed/40951258$$D View this record in MEDLINE/PubMed
BookMark eNpNkU1PAjEQhqvBCCL_wJgevYDt9GO33sj6gQlE41fiiZTdqdYsXdzuxvDvhYjG08zkeWYO7xyRTqgCEnLC2YhzZs6zl_sHBYmQI2CgRgykUHtkYBKTCsGVFFqm-6THmRZDbbjp_Ou7ZBDjB2NMAOfapIekK5lRHFTaI6-PiD680WtbUxsKmpVo63J9QWe-8W-22bKJLcs292EzVSFSH-hsOp1F-uWbdzpuGgxbQDPbRlvSS8yrYrN2TA6cLSMOdrVPnq-vnrLJcHp3c5uNp0PPpWmGbgFaAhYOVaFzFNppq0Cn4AwHCwl36GwBCRRu4ZhVuQMjxQIg1eCczUWfnP3cXdXVZ4uxmS99zLEsbcCqjXMBigmj001SfXK6U9vFEov5qvZLW6_nv3FshJMfwSPiH95-AGSSim9spXFT
CODEN IEEPAD
ContentType Conference Proceeding
Journal Article
DBID 6IE
6IH
CBEJK
RIE
RIO
NPM
7X8
DOI 10.1109/CVPR52734.2025.02435
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE/IET Electronic Library
IEEE Proceedings Order Plans (POP) 1998-present
PubMed
MEDLINE - Academic
DatabaseTitle PubMed
MEDLINE - Academic
DatabaseTitleList
MEDLINE - Academic
PubMed
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: RIE
  name: IEEE Xplore Digital Library (LUT)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
– sequence: 3
  dbid: 7X8
  name: MEDLINE - Academic
  url: https://search.proquest.com/medline
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
Computer Science
EISBN 9798331543648
EISSN 1063-6919
EndPage 26159
ExternalDocumentID 40951258
11092478
Genre orig-research
Journal Article
GrantInformation_xml – fundername: NCI NIH HHS
  grantid: R01 CA289249
– fundername: NHLBI NIH HHS
  grantid: 75N92020D00021
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
23M
29F
29O
6IK
ABDPE
ACGFS
IPLJI
M43
NPM
RNS
7X8
ID FETCH-LOGICAL-i149t-fb2642edfe5d6ce36f6a52682f912a271fefad272dfbf0a5cf2943b22862ffac3
IEDL.DBID RIE
ISSN 1063-6919
IngestDate Mon Sep 15 18:47:20 EDT 2025
Wed Sep 17 02:13:19 EDT 2025
Wed Nov 05 07:16:39 EST 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i149t-fb2642edfe5d6ce36f6a52682f912a271fefad272dfbf0a5cf2943b22862ffac3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
PMID 40951258
PQID 3250396898
PQPubID 23479
PageCount 13
ParticipantIDs ieee_primary_11092478
pubmed_primary_40951258
proquest_miscellaneous_3250396898
PublicationCentury 2000
PublicationDate 2025-Jun
PublicationDateYYYYMMDD 2025-06-01
PublicationDate_xml – month: 06
  year: 2025
  text: 2025-Jun
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev CVPR
PublicationTitleAlternate Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit
PublicationYear 2025
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211698
ssj0023720
Score 2.5169914
Snippet Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often...
SourceID proquest
pubmed
ieee
SourceType Aggregation Database
Index Database
Publisher
StartPage 26147
SubjectTerms Data mining
Decoding
Encoding
FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks
Heart
Interference
Large language models
proving its effectiveness
Question answering (information retrieval)
Registers
Video sequences
Visualization
With extensive experiments
Title Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding
URI https://ieeexplore.ieee.org/document/11092478
https://www.ncbi.nlm.nih.gov/pubmed/40951258
https://www.proquest.com/docview/3250396898
Volume 2025
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8MwDI4AceDEa8B4TEHiWraljyTcUGHiwNDES-NUuYkjVao6tLZI_HuSthsnDtwq1Wkr21U-O_ZnQq70OADNReAJGSkv4CP0hAblhZojqiiUqTTNsAn-9CTmcznrmtWbXhhEbIrP8NpdNmf5eqFqlyobOnZMFnCxSTY5522z1jqh4ttQJpKia4-zksP4ffbs-MVc6oSF1457zw2qCRyqYG7GezNP5W9o2Wwxk91_ftwe6f0269HZehvaJxtYHJDdDl3S7t8tD8nHC6IVoBNYUig0jd3AiPz7hk6zlmfD3nuAPK9V1mYIS5oVdPr4OC2py9bS26pqayNpDHUJOb2zoat7Z4-8Te5f4wevG6zgZTYgqjyTWhjEUBsMdaTQj0wEjvaFGTlmwPjYoAHNONMmNSMIlWEy8FPGbPhjDCj_iGwViwJPCDViZIU0E2MOgQhSARZxGT8FEMbi9lGf9Jyqks-WOyNZaalPLldaT6xDu1MKKHBRl4lvQZkvIyGtzHFrjvXqlelO_3jqGdlxpm5Luc7JVrWs8YJsq68qK5cD6zVzMWi85geZ-sCU
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1RT9swELa2Dok9lUGBMmBG4jW0cZzE3ttUqDqtraqtoPIUXeyzFKlKpyZB4t9jJ2n3xMPeIuWcRD5H_u58932E3Gqfg44F94SMlMfjIXpCg_JCHSOqKJSpNLXYRDyfi9VKLtpm9boXBhHr4jO8c5f1Wb7eqMqlygaOHZPxWHwkn0LOmd-0a-1TKoENZiIp2gY5azsYPS1-O4Yxlzxh4Z1j33NSNdzhCuZU3mtFlffBZb3JjLv_-XlHpPevXY8u9hvRF_IB82PSbfElbf_e4oQ8_0G0BnQMWwq5piMnGbF-_U5nWcO0Ye9NYL2uVNbkCAua5XQ2nc4K6vK19EdZNtWRdARVAWt6b4NX984eeRw_LEcTr5VW8DIbEpWeSS0QYqgNhjpSGEQmAkf8woz0GbDYN2hAs5hpk5ohhMowyYOUMRsAGQMqOCWdfJPjOaFGDK2RZsKPgQueCrCYywQpgDAWuQ_7pOemKvnbsGcku1nqk5vdrCd2SbtzCshxUxVJYGFZICMhrc1Z44796J3rLt556jdyOFnOpsn05_zXV_LZub0p7LoknXJb4RU5UC9lVmyv67XzBj2LwvM
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Seeing+Far+and+Clearly%3A+Mitigating+Hallucinations+in+MLLMs+with+Attention+Causal+Decoding&rft.au=Tang%2C+Feilong&rft.au=Liu%2C+Chengzhi&rft.au=Xu%2C+Zhongxing&rft.au=Hu%2C+Ming&rft.date=2025-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=26147&rft.epage=26159&rft_id=info:doi/10.1109%2FCVPR52734.2025.02435&rft_id=info%3Apmid%2F40951258&rft.externalDocID=11092478
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1063-6919&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1063-6919&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1063-6919&client=summon