FusionCLM: enhanced molecular property prediction via knowledge fusion of chemical language models

Chemical Language Models (CLMs) have demonstrated capabilities in extracting patterns and predicting from vast volume of the Simplified Molecular Input Line Entry System (SMILES), a notation used to represent molecular structures. Different CLMs, developed from various architectures, can provide uni...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of cheminformatics Jg. 17; H. 1; S. 133 - 12
Hauptverfasser: Lu, Yutong, Li, Yan Yi, Sun, Yan, Hu, Pingzhao
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Cham Springer International Publishing 29.08.2025
BioMed Central Ltd
Springer Nature B.V
BMC
Schlagworte:
ISSN:1758-2946, 1758-2946
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Chemical Language Models (CLMs) have demonstrated capabilities in extracting patterns and predicting from vast volume of the Simplified Molecular Input Line Entry System (SMILES), a notation used to represent molecular structures. Different CLMs, developed from various architectures, can provide unique insights into molecular properties. To harness the uniqueness of different CLMs, we propose FusionCLM, a novel stacking-ensemble learning algorithm that integrate the outputs of multiple CLMs into a unified framework. FusionCLM first generates SMILES embeddings, predictions, and losses from each CLM. Auxiliary models are trained on these first-level predictions and embeddings to estimate test losses during inference. The losses and predictions are then concatenated to create an integrated feature matrix, which trains second-level meta-models for final predictions. Empirical testing on five datasets demonstrates that FusionCLM have better performance than individual CLM at the first level and three advanced multimodal deep learning frameworks, showcasing FusionCLM’s potential in advancing molecular property prediction. Scientific Contribution FusionCLM uses the stacking-ensemble learning method that integrates unique representation learning from multiple CLMs, allowing a more comprehensive learning of molecular SMILES data. This results in providing more accurate molecular property prediction, which can help in facilitating early discovery and development of promising drug candidates. By evaluating and comparing its performance against individual CLMs and existing multimodal deep learning frameworks, FusionCLM demonstrates improvements in prediction accuracy, distinguishing itself from prior models in this domain.
Bibliographie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:1758-2946
1758-2946
DOI:10.1186/s13321-025-01073-6