Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy

The rapid development of multimodal large language models has resulted in remarkable advancements in visual perception and understanding, consolidating several tasks into a single visual question-answering framework. However, these models are prone to hallucinations, which limit their reliability as...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on circuits and systems for video technology p. 1
Main Authors: Sun, Yinan, Min, Xiongkuo, Zhang, Zicheng, Gao, Yixuan, Cao, Yuqin, Zhai, Guangtao
Format: Journal Article
Language:English
Published: IEEE 2025
Subjects:
ISSN:1051-8215, 1558-2205
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract The rapid development of multimodal large language models has resulted in remarkable advancements in visual perception and understanding, consolidating several tasks into a single visual question-answering framework. However, these models are prone to hallucinations, which limit their reliability as artificial intelligence systems. While this issue is extensively researched in natural language processing and image captioning, there remains a lack of investigation of hallucinations in Low-level Visual Perception and Understanding (HLPU), especially in the context of image quality assessment tasks. We consider that these hallucinations arise from an absence of clear self-awareness within the models. To address this issue, we first introduce the HLPU instruction database, the first instruction database specifically focused on hallucinations in low-level vision tasks. This database contains approximately 200K question-answer pairs and comprises four subsets, each covering different types of instructions. Subsequently, we propose the Self-Awareness Failure Elimination (SAFEQA) model, which utilizes image features, salient region features and quality features to improve the perception and comprehension abilities of the model in low-level vision tasks. Furthermore, we propose the Enhancing Self-Awareness Preference Optimization (ESA-PO) framework to increase the model's awareness of knowledge boundaries, thereby mitigating the incidence of hallucination. Finally, we conduct comprehensive experiments on low-level vision tasks, with the results demonstrating that our proposed method significantly enhances self-awareness of the model in these tasks and reduces hallucinations. Notably, our proposed method improves both accuracy and self-awareness of the proposed model and outperforms close-source models in terms of various evaluation metrics. This research contributes to the advancement of self-awareness capabilities in multimodal large language models, particularly for low-level visual perception and understanding tasks.
AbstractList The rapid development of multimodal large language models has resulted in remarkable advancements in visual perception and understanding, consolidating several tasks into a single visual question-answering framework. However, these models are prone to hallucinations, which limit their reliability as artificial intelligence systems. While this issue is extensively researched in natural language processing and image captioning, there remains a lack of investigation of hallucinations in Low-level Visual Perception and Understanding (HLPU), especially in the context of image quality assessment tasks. We consider that these hallucinations arise from an absence of clear self-awareness within the models. To address this issue, we first introduce the HLPU instruction database, the first instruction database specifically focused on hallucinations in low-level vision tasks. This database contains approximately 200K question-answer pairs and comprises four subsets, each covering different types of instructions. Subsequently, we propose the Self-Awareness Failure Elimination (SAFEQA) model, which utilizes image features, salient region features and quality features to improve the perception and comprehension abilities of the model in low-level vision tasks. Furthermore, we propose the Enhancing Self-Awareness Preference Optimization (ESA-PO) framework to increase the model's awareness of knowledge boundaries, thereby mitigating the incidence of hallucination. Finally, we conduct comprehensive experiments on low-level vision tasks, with the results demonstrating that our proposed method significantly enhances self-awareness of the model in these tasks and reduces hallucinations. Notably, our proposed method improves both accuracy and self-awareness of the proposed model and outperforms close-source models in terms of various evaluation metrics. This research contributes to the advancement of self-awareness capabilities in multimodal large language models, particularly for low-level visual perception and understanding tasks.
Author Min, Xiongkuo
Zhang, Zicheng
Gao, Yixuan
Cao, Yuqin
Sun, Yinan
Zhai, Guangtao
Author_xml – sequence: 1
  givenname: Yinan
  orcidid: 0009-0004-3054-4268
  surname: Sun
  fullname: Sun, Yinan
– sequence: 2
  givenname: Xiongkuo
  orcidid: 0000-0001-5693-0416
  surname: Min
  fullname: Min, Xiongkuo
– sequence: 3
  givenname: Zicheng
  orcidid: 0000-0002-7247-7938
  surname: Zhang
  fullname: Zhang, Zicheng
– sequence: 4
  givenname: Yixuan
  orcidid: 0000-0002-6292-0529
  surname: Gao
  fullname: Gao, Yixuan
– sequence: 5
  givenname: Yuqin
  surname: Cao
  fullname: Cao, Yuqin
– sequence: 6
  givenname: Guangtao
  orcidid: 0000-0001-8165-9322
  surname: Zhai
  fullname: Zhai, Guangtao
BookMark eNpFkF1LwzAYhYNMcJv-AfEiP8DOvEmTdt6N-TGhQ3B1t-Vtk45ITTXpHPv3dm7g1TlweM7FMyID1zpDyDWwCQCb3uXz1TqfcMblRCiYSpmekSH0EXHO5KDvTEKUcpAXZBTCB2MQp3EyJHZpO7vBzroNzdpdlJkf09C1DVts6AKbZltZ18-tC_TNfG-tN4GuTFNHsx1640wI9_QBOywxmFu6bHWPo9M092jd4XXVeezMZn9Jzmtsgrk65Zi8Pz3m80WUvT6_zGdZVIFQXSRrBF0qzZIqQdQx04mSlWJTUaVCoZKJqGPQMUcDNRPAy5KBFoiSMVYqFGPCj7-Vb0Pwpi6-vP1Evy-AFQdZxZ-s4iCrOMnqoZsjZI0x_wDANJFCil8S9mlk
CODEN ITCTEM
ContentType Journal Article
DBID 97E
RIA
RIE
AAYXX
CITATION
DOI 10.1109/TCSVT.2025.3619558
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE/IET Electronic Library (IEL)
CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 1558-2205
EndPage 1
ExternalDocumentID 10_1109_TCSVT_2025_3619558
11197535
Genre orig-research
GroupedDBID -~X
0R~
29I
4.4
5GY
6IK
97E
AAJGR
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
ACGFO
ACGFS
ACIWK
AENEX
AGQYO
AHBIQ
AKJIK
AKQYR
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
EBS
HZ~
IFIPE
IPLJI
JAVBF
LAI
O9-
OCL
P2P
RIA
RIE
RNS
RXW
TAE
TN5
5VS
AAYXX
AETIX
AGSQL
AI.
AIBXA
ALLEH
CITATION
EJD
H~9
ICLAB
IFJZH
M43
VH1
ID FETCH-LOGICAL-c136t-5fa1db6d07c7aad40d765c6093c836a6573f41d42ae1f0312bb01d3aa5000b6a3
IEDL.DBID RIE
ISSN 1051-8215
IngestDate Sat Nov 29 07:11:06 EST 2025
Wed Oct 15 14:20:46 EDT 2025
IsPeerReviewed true
IsScholarly true
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c136t-5fa1db6d07c7aad40d765c6093c836a6573f41d42ae1f0312bb01d3aa5000b6a3
ORCID 0000-0001-5693-0416
0009-0004-3054-4268
0000-0001-8165-9322
0000-0002-7247-7938
0000-0002-6292-0529
PageCount 1
ParticipantIDs crossref_primary_10_1109_TCSVT_2025_3619558
ieee_primary_11197535
PublicationCentury 2000
PublicationDate 2025-00-00
PublicationDateYYYYMMDD 2025-01-01
PublicationDate_xml – year: 2025
  text: 2025-00-00
PublicationDecade 2020
PublicationTitle IEEE transactions on circuits and systems for video technology
PublicationTitleAbbrev TCSVT
PublicationYear 2025
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0014847
Score 2.4584467
Snippet The rapid development of multimodal large language models has resulted in remarkable advancements in visual perception and understanding, consolidating several...
SourceID crossref
ieee
SourceType Index Database
Publisher
StartPage 1
SubjectTerms Accuracy
Analytical models
Feature extraction
hallucination
Image quality
image quality assessment
Large language models
low-level vision
Multimodal large language models
Reliability
Training
Visual databases
Visual perception
Visualization
Title Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy
URI https://ieeexplore.ieee.org/document/11197535
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL)
  customDbUrl:
  eissn: 1558-2205
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0014847
  issn: 1051-8215
  databaseCode: RIE
  dateStart: 19910101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV05T8MwFLZoxQADZxHlkgc2SJvEcZywVYWqQ6kQDVW3yFdQpChFTULFv8d2UujCwGZFthS9Z_v5Xd8HwG0oNC6WRJbncGx5SsXqHmQ6XUiD0LV5gIkwZBNkOg0Wi_ClaVY3vTBSSlN8Jnt6aHL5YskrHSrrOzrnhRFugRYhft2s9ZMy8ALDJqbeC44VKEO26ZCxw340nM0j5Qu6uIeUw4A1v_uWFdqiVTFWZXT4z_85AgfN8xEOan0fgx2Zn4D9LVDBU5A-pzVuRv4OJ8u1NdFlQXCeFpVaOKZZVvG0DgEW8FXqQmBZwJnMEmuw1p1h6up7gI-0pNrA3UNNlpZBmgsYNWwSsEG0_eqAt9FTNBxbDaGCxR3klxZOqCOYL2zCCaXCswXxMfftEPEA-dTHBCWeIzyXSidRp91lzHYEooY1gfkUnYF2vszlOYAahl8qb0RTmHkBDVnoKWc3YUy4iTrkrAvuNgKOP2rcjNj4G3YYG3XEWh1xo44u6Gjp_s5sBHvxx_dLsKeX16GQK9AuV5W8Brv8s0yL1Y3ZF99M6rVG
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3LT8IwGG8UTdSDT4z47MGbTra13cMbQQlGIEYm4bb0NbOEDMOYxP_ethvKxYO3pdmW5vvafv1evx8A16HQuFgSWdjhxMJKxeocZDpdSIPQtXlAfGHIJvzBIBiPw5eqWd30wkgpTfGZvNOPJpcvprzQobKmo3NeBJF1sEEwdu2yXesnaYADwyembgyOFShTtuyRscNm1B6OIuUNuuQOKZeBaIb3FTu0Qqxi7Epn758z2ge71QUStkqNH4A1mR2CnRVYwSOQ9tMSOSN7h73pwurpwiA4SvNCfdilk0nB0zIImMNXqUuBZQ6HcpJYrYXuDVOH3z18oHOqTdwt1HRpE0gzAaOKTwJWmLZfdfDWeYzaXauiVLC4g7y5RRLqCOYJ2-c-pQLbwvcI9-wQ8QB51CM-SrAjsEulk6j97jJmOwJRw5vAPIqOQS2bZvIEQA3EL5U_oknMcEBDFmLl7iaMCTdR25w1wM1SwPFHiZwRG4_DDmOjjlirI67U0QB1Ld3fNyvBnv4xfgW2ulG_F_eeBs9nYFv_qgyMnIPafFbIC7DJP-dpPrs0a-Qb6Eu4jQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Mitigating+Low-Level+Visual+Hallucinations+Requires+Self-Awareness%3A+Database%2C+Model+and+Training+Strategy&rft.jtitle=IEEE+transactions+on+circuits+and+systems+for+video+technology&rft.au=Sun%2C+Yinan&rft.au=Min%2C+Xiongkuo&rft.au=Zhang%2C+Zicheng&rft.au=Gao%2C+Yixuan&rft.date=2025&rft.issn=1051-8215&rft.eissn=1558-2205&rft.spage=1&rft.epage=1&rft_id=info:doi/10.1109%2FTCSVT.2025.3619558&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TCSVT_2025_3619558
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1051-8215&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1051-8215&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1051-8215&client=summon