An fMRI visual neural encoding method with multimodal large language model

•In summary, our contributions are primarily threefold:.•To our knowledge, we establish the first multimodal framework combining MLLM with fMRI visual neural encoding, introducing a systematic three-phase training paradigm specifically optimized for neural encoding tasks.•Building upon the Vicuna ar...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	Knowledge-based systems Ročník 326; s. 114049
Hlavní autori:	Ma, Shuxiao, Wang, Linyuan, Hou, Libin, Hou, Senbao, Yan, Bin
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	Elsevier B.V 27.09.2025
Predmet:	fMRI Activity Multimodal large language model Three-phase training paradigm Visual neural encoding method Visual neural encoding method fMRI Activity Multimodal large language model Three-phase training paradigm
ISSN:	0950-7051
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	•In summary, our contributions are primarily threefold:.•To our knowledge, we establish the first multimodal framework combining MLLM with fMRI visual neural encoding, introducing a systematic three-phase training paradigm specifically optimized for neural encoding tasks.•Building upon the Vicuna architecture, we develop an 8-billion-parameter foundation model that demonstrates dual advantages in parameter efficiency and task performance. Specifically, our method achieves 5th place in Algonauts 2023, establishing a new benchmark in large-scale visual encoding processing.•Ablation studies demonstrate that introducing the MLLM module yields a 2.87 % performance gain. Through our multi-stage training paradigm, we achieve a 2.61 % performance improvement by fine-tuning only 1.33 % of the parameters (Q-former) via parameter-efficient fine-tuning, achieving a balance between computational efficiency and performance enhancement. Multimodal large language models (MLLMs) are revolutionizing the field of artificial intelligence by integrating diverse data types, such as text and image, into a unified representational framework. These foundation models leverage the power of deep learning and attention mechanisms to capture the intricate relationships between modalities, enhancing feature extraction and understanding. However, MLLM’s potential in the field of brain visual information processing has not been fully explored. Research has found that text semantic information is an important additional information in the brain's visual information processing. Based on the current state of development, this paper proposes a functional magnetic resonance imaging (fMRI) visual encoding method with the MLLM. The proposed method designs a visual encoding model based on MLLM and a three-stage training/fine-tuning paradigm. The method constructs the MLLM module by combining Vicuna with stimulus images and user instructions, then implements fMRI-based visual encoding through multi-subject fusion and voxel mapping modules. Our method utilizes the advantages of MLLM in cross-modal information processing, effectively integrating textual semantic information with visual information, capturing key features in visual information more accurately, and achieving more efficient visual encoding. Our method achieves a 2.87 % improvement in neural response prediction accuracy compared to non-LLM approaches. The experimental results indicate that the multimodal method achieves impressive performance in the MIT-hosted Algonauts 2023 Challenge, a benchmark for modeling human brain responses to visual stimuli. This method validates the effectiveness of MLLM, bringing new solutions for visual neural encoding.
ISSN:	0950-7051
DOI:	10.1016/j.knosys.2025.114049