FusionCLM: enhanced molecular property prediction via knowledge fusion of chemical language models
Chemical Language Models (CLMs) have demonstrated capabilities in extracting patterns and predicting from vast volume of the Simplified Molecular Input Line Entry System (SMILES), a notation used to represent molecular structures. Different CLMs, developed from various architectures, can provide uni...
Uloženo v:
| Vydáno v: | Journal of cheminformatics Ročník 17; číslo 1; s. 133 - 12 |
|---|---|
| Hlavní autoři: | , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Cham
Springer International Publishing
29.08.2025
BioMed Central Ltd Springer Nature B.V BMC |
| Témata: | |
| ISSN: | 1758-2946, 1758-2946 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | Chemical Language Models (CLMs) have demonstrated capabilities in extracting patterns and predicting from vast volume of the Simplified Molecular Input Line Entry System (SMILES), a notation used to represent molecular structures. Different CLMs, developed from various architectures, can provide unique insights into molecular properties. To harness the uniqueness of different CLMs, we propose FusionCLM, a novel stacking-ensemble learning algorithm that integrate the outputs of multiple CLMs into a unified framework. FusionCLM first generates SMILES embeddings, predictions, and losses from each CLM. Auxiliary models are trained on these first-level predictions and embeddings to estimate test losses during inference. The losses and predictions are then concatenated to create an integrated feature matrix, which trains second-level meta-models for final predictions. Empirical testing on five datasets demonstrates that FusionCLM have better performance than individual CLM at the first level and three advanced multimodal deep learning frameworks, showcasing FusionCLM’s potential in advancing molecular property prediction.
Scientific Contribution
FusionCLM uses the stacking-ensemble learning method that integrates unique representation learning from multiple CLMs, allowing a more comprehensive learning of molecular SMILES data. This results in providing more accurate molecular property prediction, which can help in facilitating early discovery and development of promising drug candidates. By evaluating and comparing its performance against individual CLMs and existing multimodal deep learning frameworks, FusionCLM demonstrates improvements in prediction accuracy, distinguishing itself from prior models in this domain. |
|---|---|
| Bibliografie: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 |
| ISSN: | 1758-2946 1758-2946 |
| DOI: | 10.1186/s13321-025-01073-6 |