The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications

Recent advances in machine learning (ML) have led to newer model architectures including transformers (large language models, LLMs) showing state of the art results in text generation and image analysis as well as few-shot learning (FSLC) models which offer predictive power with extremely small data...

Full description

Saved in:
Bibliographic Details
Published in:Communications chemistry Vol. 7; no. 1; pp. 134 - 11
Main Authors: Snyder, Scott H., Vignaux, Patricia A., Ozalp, Mustafa Kemal, Gerlach, Jacob, Puhl, Ana C., Lane, Thomas R., Corbett, John, Urbina, Fabio, Ekins, Sean
Format: Journal Article
Language:English
Published: London Nature Publishing Group UK 12.06.2024
Nature Publishing Group
Nature Portfolio
Subjects:
ISSN:2399-3669, 2399-3669
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Recent advances in machine learning (ML) have led to newer model architectures including transformers (large language models, LLMs) showing state of the art results in text generation and image analysis as well as few-shot learning (FSLC) models which offer predictive power with extremely small datasets. These new architectures may offer promise, yet the ‘no-free lunch’ theorem suggests that no single model algorithm can outperform at all possible tasks. Here, we explore the capabilities of classical (SVR), FSLC, and transformer models (MolBART) over a range of dataset tasks and show a ‘goldilocks zone’ for each model type, in which dataset size and feature distribution (i.e. dataset “diversity”) determines the optimal algorithm strategy. When datasets are small ( < 50 molecules), FSLC tend to outperform both classical ML and transformers. When datasets are small-to-medium sized (50-240 molecules) and diverse, transformers outperform both classical models and few-shot learning. Finally, when datasets are of larger and of sufficient size, classical models then perform the best, suggesting that the optimal model to choose likely depends on the dataset available, its size and diversity. These findings may help to answer the perennial question of which ML algorithm is to be used when faced with a new dataset. Machine learning (ML) is a powerful tool in the field of drug discovery, with the continuous development of new models, however, rational selection of the most appropriate model based on the task remains challenging. Here, the authors explore the capabilities of classical ML algorithms and newer models over a range of dataset tasks and show an optimal zone for each model type, developing a predictive model to aid in the selection of a modeling method based on dataset size and diversity.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:2399-3669
2399-3669
DOI:10.1038/s42004-024-01220-4