AbSet: A Standardized Data Set of Antibody Structures for Machine Learning Applications

Saved in:
Bibliographic Details
Title: AbSet: A Standardized Data Set of Antibody Structures for Machine Learning Applications
Authors: Diego S. Almeida, Matheus V. Almeida, Jean V. Sampaio, Eduardo M. Gaieta, Andrielly H. S. Costa, Francisco F. A. Rabelo, César L. Cavalcante, Geraldo R. Sartori, João H. M. Silva
Publication Year: 2025
Subject Terms: Biophysics, Biochemistry, Biotechnology, Immunology, Cancer, Hematology, Infectious Diseases, Biological Sciences not elsewhere classified, Chemical Sciences not elsewhere classified, Information Systems not elsewhere classified, specific databases often, reference experimental complexes, provide molecular descriptors, machine learning applications, incorrect quality based, corresponding molecular descriptors, accompanying scripts hosted, protein data bank, publicly available via, generate structural variants, https :// github, curated dataset comprising, standardized data set, 000 antibody structures, data sets, structural similarity, decoy set, available structures, zenodo repository, therapeutic antibodies
Description: Machine learning algorithms have played a fundamental role in the development of therapeutic antibodies by being trained on data sets of sequences and/or structures. However, structural data sets remain limited, especially those that include antibody–antigen complexes. Additionally, many of the available structures are not standardized, and antibody-specific databases often do not provide molecular descriptors that could enhance ML models. To address this gap, we introduce AbSet, a curated dataset comprising over 800,000 antibody structures and corresponding molecular descriptors, including both experimentally determined and in silico-generated antibody–antigen complexes. We systematically retrieved antibody structures from the Protein Data Bank (PDB), applied rigorous standardization protocols, and expanded the dataset through large-scale protein–protein docking to generate structural variants of antibody–antigen interactions. Each model was classified as high, medium, acceptable, or incorrect quality based on structural similarity to reference experimental complexes. This classification enables both the construction of a decoy set of confirmed non-binders and the generation of high-confidence augmented structural data for machine learning applications. AbSet is publicly available via the Zenodo repository, with accompanying scripts hosted on GitHub (https://github.com/SFBBGroup/AbSet.git).
Document Type: article in journal/newspaper
Language: unknown
Relation: https://figshare.com/articles/journal_contribution/AbSet_A_Standardized_Data_Set_of_Antibody_Structures_for_Machine_Learning_Applications/29031922
DOI: 10.1021/acs.jcim.5c00410.s002
Availability: https://doi.org/10.1021/acs.jcim.5c00410.s002
https://figshare.com/articles/journal_contribution/AbSet_A_Standardized_Data_Set_of_Antibody_Structures_for_Machine_Learning_Applications/29031922
Rights: CC BY-NC 4.0
Accession Number: edsbas.7AB13308
Database: BASE
Be the first to leave a comment!
You must be logged in first