MICER: a pre-trained encoder–decoder architecture for molecular image captioning

Automatic recognition of chemical structures from molecular images provides an important avenue for the rediscovery of chemicals. Traditional rule-based approaches that rely on expert knowledge and fail to consider all the stylistic variations of molecular images usually suffer from cumbersome recog...

Full description

Saved in:
Bibliographic Details
Published in:Bioinformatics (Oxford, England) Vol. 38; no. 19; pp. 4562 - 4572
Main Authors: Yi, Jiacai, Wu, Chengkun, Zhang, Xiaochen, Xiao, Xinyi, Qiu, Yanlong, Zhao, Wentao, Hou, Tingjun, Cao, Dongsheng
Format: Journal Article
Language:English
Published: England 30.09.2022
Subjects:
ISSN:1367-4803, 1367-4811, 1367-4811
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Automatic recognition of chemical structures from molecular images provides an important avenue for the rediscovery of chemicals. Traditional rule-based approaches that rely on expert knowledge and fail to consider all the stylistic variations of molecular images usually suffer from cumbersome recognition processes and low generalization ability. Deep learning-based methods that integrate different image styles and automatically learn valuable features are flexible, but currently under-researched and have limitations, and are therefore not fully exploited. MICER, an encoder-decoder-based, reconstructed architecture for molecular image captioning, combines transfer learning, attention mechanisms and several strategies to strengthen effectiveness and plasticity in different datasets. The effects of stereochemical information, molecular complexity, data volume and pre-trained encoders on MICER performance were evaluated. Experimental results show that the intrinsic features of the molecular images and the sub-model match have a significant impact on the performance of this task. These findings inspire us to design the training dataset and the encoder for the final validation model, and the experimental results suggest that the MICER model consistently outperforms the state-of-the-art methods on four datasets. MICER was more reliable and scalable due to its interpretability and transfer capacity and provides a practical framework for developing comprehensive and accurate automated molecular structure identification tools to explore unknown chemical space. https://github.com/Jiacai-Yi/MICER. Supplementary data are available at Bioinformatics online.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1367-4803
1367-4811
1367-4811
DOI:10.1093/bioinformatics/btac545