Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding
Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations....
Uložené v:
| Vydané v: | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) Ročník 2025; s. 26147 - 26159 |
|---|---|
| Hlavní autori: | , , , , , , , , , , , , , , , , , , , , |
| Médium: | Konferenčný príspevok.. Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
United States
IEEE
01.06.2025
|
| Predmet: | |
| ISSN: | 1063-6919, 1063-6919 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. |
|---|---|
| AbstractList | Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness.Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness. Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness. |
| Author | Xue, Haochen Tang, Feilong Chen, Ziyang Cheng, Xuelian Peng, Zelin Liu, Chengzhi Ge, Zongyuan Li, Wenxue Yang, Zhiwei Lin, Minquan Zhou, Sijin Li, Yulong Su, Shiyan Feng, Wei Peng, Yifan Song, Wenxuan Razzak, Imran Hu, Ming Su, Jionglong Xu, Zhongxing Huang, Zile |
| Author_xml | – sequence: 1 givenname: Feilong surname: Tang fullname: Tang, Feilong email: Feilong.Tang@monash.edu organization: Monash University – sequence: 2 givenname: Chengzhi surname: Liu fullname: Liu, Chengzhi organization: XJTLU – sequence: 3 givenname: Zhongxing surname: Xu fullname: Xu, Zhongxing organization: Monash University – sequence: 4 givenname: Ming surname: Hu fullname: Hu, Ming organization: Monash University – sequence: 5 givenname: Zile surname: Huang fullname: Huang, Zile organization: XJTLU – sequence: 6 givenname: Haochen surname: Xue fullname: Xue, Haochen organization: XJTLU – sequence: 7 givenname: Ziyang surname: Chen fullname: Chen, Ziyang organization: Northwestern Polytechnical University – sequence: 8 givenname: Zelin surname: Peng fullname: Peng, Zelin organization: Shanghai Jiaotong University – sequence: 9 givenname: Zhiwei surname: Yang fullname: Yang, Zhiwei organization: Fudan University – sequence: 10 givenname: Sijin surname: Zhou fullname: Zhou, Sijin organization: Monash University – sequence: 11 givenname: Wenxue surname: Li fullname: Li, Wenxue organization: Monash University – sequence: 12 givenname: Yulong surname: Li fullname: Li, Yulong organization: MBZUAI – sequence: 13 givenname: Wenxuan surname: Song fullname: Song, Wenxuan organization: Monash University – sequence: 14 givenname: Shiyan surname: Su fullname: Su, Shiyan organization: Monash University – sequence: 15 givenname: Wei surname: Feng fullname: Feng, Wei organization: Monash University – sequence: 16 givenname: Jionglong surname: Su fullname: Su, Jionglong organization: XJTLU – sequence: 17 givenname: Minquan surname: Lin fullname: Lin, Minquan organization: University of Minnesota – sequence: 18 givenname: Yifan surname: Peng fullname: Peng, Yifan organization: Cornell University – sequence: 19 givenname: Xuelian surname: Cheng fullname: Cheng, Xuelian organization: Monash University – sequence: 20 givenname: Imran surname: Razzak fullname: Razzak, Imran organization: MBZUAI – sequence: 21 givenname: Zongyuan surname: Ge fullname: Ge, Zongyuan organization: Monash University |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/40951258$$D View this record in MEDLINE/PubMed |
| BookMark | eNpNkU1PAjEQhqvBCCL_wJgevYDt9GO33sj6gQlE41fiiZTdqdYsXdzuxvDvhYjG08zkeWYO7xyRTqgCEnLC2YhzZs6zl_sHBYmQI2CgRgykUHtkYBKTCsGVFFqm-6THmRZDbbjp_Ou7ZBDjB2NMAOfapIekK5lRHFTaI6-PiD680WtbUxsKmpVo63J9QWe-8W-22bKJLcs292EzVSFSH-hsOp1F-uWbdzpuGgxbQDPbRlvSS8yrYrN2TA6cLSMOdrVPnq-vnrLJcHp3c5uNp0PPpWmGbgFaAhYOVaFzFNppq0Cn4AwHCwl36GwBCRRu4ZhVuQMjxQIg1eCczUWfnP3cXdXVZ4uxmS99zLEsbcCqjXMBigmj001SfXK6U9vFEov5qvZLW6_nv3FshJMfwSPiH95-AGSSim9spXFT |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding Journal Article |
| DBID | 6IE 6IH CBEJK RIE RIO NPM 7X8 |
| DOI | 10.1109/CVPR52734.2025.02435 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present PubMed MEDLINE - Academic |
| DatabaseTitle | PubMed MEDLINE - Academic |
| DatabaseTitleList | MEDLINE - Academic PubMed |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: RIE name: IEEE Xplore url: https://ieeexplore.ieee.org/ sourceTypes: Publisher – sequence: 3 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Applied Sciences Computer Science |
| EISBN | 9798331543648 |
| EISSN | 1063-6919 |
| EndPage | 26159 |
| ExternalDocumentID | 40951258 11092478 |
| Genre | orig-research Journal Article |
| GrantInformation_xml | – fundername: NCI NIH HHS grantid: R01 CA289249 – fundername: NHLBI NIH HHS grantid: 75N92020D00021 |
| GroupedDBID | 6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO 23M 29F 29O 6IK ABDPE ACGFS IPLJI M43 NPM RNS 7X8 |
| ID | FETCH-LOGICAL-i149t-fb2642edfe5d6ce36f6a52682f912a271fefad272dfbf0a5cf2943b22862ffac3 |
| IEDL.DBID | RIE |
| ISSN | 1063-6919 |
| IngestDate | Mon Sep 15 18:47:20 EDT 2025 Wed Sep 17 02:13:19 EDT 2025 Wed Nov 05 07:16:39 EST 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i149t-fb2642edfe5d6ce36f6a52682f912a271fefad272dfbf0a5cf2943b22862ffac3 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| PMID | 40951258 |
| PQID | 3250396898 |
| PQPubID | 23479 |
| PageCount | 13 |
| ParticipantIDs | ieee_primary_11092478 pubmed_primary_40951258 proquest_miscellaneous_3250396898 |
| PublicationCentury | 2000 |
| PublicationDate | 2025-Jun |
| PublicationDateYYYYMMDD | 2025-06-01 |
| PublicationDate_xml | – month: 06 year: 2025 text: 2025-Jun |
| PublicationDecade | 2020 |
| PublicationPlace | United States |
| PublicationPlace_xml | – name: United States |
| PublicationTitle | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) |
| PublicationTitleAbbrev | CVPR |
| PublicationTitleAlternate | Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit |
| PublicationYear | 2025 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0003211698 ssj0023720 |
| Score | 2.5169914 |
| Snippet | Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often... |
| SourceID | proquest pubmed ieee |
| SourceType | Aggregation Database Index Database Publisher |
| StartPage | 26147 |
| SubjectTerms | Data mining Decoding Encoding FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks Heart Interference Large language models proving its effectiveness Question answering (information retrieval) Registers Video sequences Visualization With extensive experiments |
| Title | Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding |
| URI | https://ieeexplore.ieee.org/document/11092478 https://www.ncbi.nlm.nih.gov/pubmed/40951258 https://www.proquest.com/docview/3250396898 |
| Volume | 2025 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8MwDI7YxIHTeAwYjylIXMdW95GWGxpMHDY08dI4TWniSJWqDq0tEv-euO3GiQO3SnUfst3Gn2N_Zuw6iEDEaGGqjjTQCDP7H0Q_HoAy6MAI3UBV7PpT8fQULhbRvGlWr3phELEqPsMbOqz28vVKlZQqGxI7JngibLGWEKJu1tomVFwLZYIobNrjrORw_D5_Jn4xSp2Af0PcezSoxqOoAmjGezVP5e_QslpiJp1_vtw-6_426_H5dhk6YDuYHbJOE13y5tvNj9jHC6IV4BO55jLTfEwDI9LvWz5Lap4Ne-5RpmmpkjpDmPMk47PpdJZzytbyu6KoayP5WJa5TPm9ha70zC57mzy8jh8HzWCFQWIBUTEwsQ2DALVBXwfK2sMEkmhfwEQOSBCOQSM1CNAmNiPpKwOR58YAFv4YI5V7zNrZKsNTxkPf95UXak-OXC-QARWlggJBwNNxYq_HuqSq5WfNnbHcaKnHrjZaX1qHpl0KmeGqzJeuDcrcKAgjK3NSm2N79cZ0Z3_c9ZztkanrUq4L1i7WJV6yXfVVJPm6b71mEfYrr_kBHim-lQ |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT9tAEF7RgFROKZC2KX0sEteQeLx-cUNpo6A6UQQBhZO13p2VLEUOim0k_n13bCecOPRmyeuHZtbe-WZnvo-xSz-CIEULU3WkgSTM7H8QvXQAyqADI3R9VbPrx8F8Hq5W0aJtVq97YRCxLj7DKzqs9_L1RlWUKhsSOyaIIPzADj0hwGnatfYpFdeCGT8K2wY5O3Y4flzcEcMYJU_AuyL2PZKqERRXAKm814oq7weX9SIz6f7n631ivbd2Pb7YL0Qn7ADzU9Zt40vefr3FGXu6R7QD-ERuucw1H5NkxPr1ms-yhmnDnpvK9bpSWZMjLHiW81kczwpO-Vp-U5ZNdSQfy6qQa_7bgld6Zo89TP4sx9NBK60wyCwkKgcmtYEQoDboaV9ZjxhfEvELmMgBCYFj0EgNAWiTmpH0lIFIuCmABUDGSOV-Zp18k-NXxkPP85QItZAjV_jSp7JUUBAQ9HScVPRZj0yVPDfsGcnOSn12sbN6Yqc07VPIHDdVkbg2LHMjP4zsmC-NO_ZX71z37Z27_mIfp8tZnMS387_n7Jjc3hR2fWedclvhD3akXsqs2P6s584_tHXA9A |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Seeing+Far+and+Clearly%3A+Mitigating+Hallucinations+in+MLLMs+with+Attention+Causal+Decoding&rft.au=Tang%2C+Feilong&rft.au=Liu%2C+Chengzhi&rft.au=Xu%2C+Zhongxing&rft.au=Hu%2C+Ming&rft.date=2025-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=26147&rft.epage=26159&rft_id=info:doi/10.1109%2FCVPR52734.2025.02435&rft_id=info%3Apmid%2F40951258&rft.externalDocID=11092478 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1063-6919&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1063-6919&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1063-6919&client=summon |