DiscHPO: Generative Models and Sentence Transformers for the Recognition and Normalisation of Continuous and Discontinuous Phenotype Mentions

Uloženo v:
Podrobná bibliografie
Název: DiscHPO: Generative Models and Sentence Transformers for the Recognition and Normalisation of Continuous and Discontinuous Phenotype Mentions
Autoři: Alhassan, Areej, Schlegel, Viktor, Aloud, Monira, Batista-Navarro, Riza Theresa, Nenadic, Goran
Zdroj: Alhassan, A, Schlegel, V, Aloud, M, Batista-Navarro, R T & Nenadic, G 2025, 'DiscHPO: Generative Models and Sentence Transformers for the Recognition and Normalisation of Continuous and Discontinuous Phenotype Mentions', JMIR medical informatics.
Informace o vydavateli: JMIR Publications Inc, 2025.
Rok vydání: 2025
Témata: Named Entity Normalisation, Discontinuous NER, Sentence Transformers, Human Phenotype Ontology, LLMs, Clinical Information Extraction, Named Entity Recognition
Popis: Background: Extracting genetic phenotype mentions from clinical reports and normalising them to standardised concepts within the HPO ontology are essential for consistent interpretation and representation of genetic conditions. This is particularly important in fields such as dysmorphology and plays a key role in advancing personalised healthcare. However, modern clinical Named Entity Recognition (NER) methods face challenges in accurately identifying discontinuous mentions (i.e., entity spans that are interrupted by unrelated words) which can be found in these clinical reports.Objective: This study aims to develop a system that can accurately extract and normalise genetic phenotypes, specifically from physical examination reports related to dysmorphology assessment. These mentions appear in both continuous and discontinuous lexical forms, with a focus on addressing challenging disjoint (discontinuous) entity spans.Methods: We introduce DiscHPO, a two-phase pipeline consisting of (1) a sequence-to-sequence NER model for span extraction, and (2) an entity normaliser that employs a Sentence Transformer bi-encoder for candidate generation and a crossencoder re-ranker for selecting the best candidate as the normalised concept. This system was tested as part of our participationin Track 3 of the BioCreative VIII shared task.Results: For overall performance on the test set, the top-performing model for entity normalisation achieved an F1 score of 0.7229, while the best span extraction model reached an F1 score of 0.6647. Both scores surpassed those of two baseline models using the same dataset, indicating superior efficacy in handling both continuous and discontinuous spans. Approximately 14% ofentity mentions in the dataset are disjoint spans. On the validation set, we were able to demonstrate our system's ability to recognise these mentions, with the model achieving an F1 score of 0.6235 for exact match on discontinuous spans only.Conclusions: The findings suggest that exact extraction of entity spans may not always be necessary for successful normalisation. Partial mention matches can be sufficient as long as they capture the essential concept information, supporting the system’s utility in clinical downstream tasks.
Druh dokumentu: Article
Jazyk: English
ISSN: 2291-9694
Přístupová URL adresa: https://research.manchester.ac.uk/en/publications/c45b0f93-3c06-4a3a-bf6e-822796d8ad42
Přístupové číslo: edsair.dedup.wf.002..e9ec4e44dd977e407425431cbb87b24c
Databáze: OpenAIRE
Popis
Abstrakt:Background: Extracting genetic phenotype mentions from clinical reports and normalising them to standardised concepts within the HPO ontology are essential for consistent interpretation and representation of genetic conditions. This is particularly important in fields such as dysmorphology and plays a key role in advancing personalised healthcare. However, modern clinical Named Entity Recognition (NER) methods face challenges in accurately identifying discontinuous mentions (i.e., entity spans that are interrupted by unrelated words) which can be found in these clinical reports.Objective: This study aims to develop a system that can accurately extract and normalise genetic phenotypes, specifically from physical examination reports related to dysmorphology assessment. These mentions appear in both continuous and discontinuous lexical forms, with a focus on addressing challenging disjoint (discontinuous) entity spans.Methods: We introduce DiscHPO, a two-phase pipeline consisting of (1) a sequence-to-sequence NER model for span extraction, and (2) an entity normaliser that employs a Sentence Transformer bi-encoder for candidate generation and a crossencoder re-ranker for selecting the best candidate as the normalised concept. This system was tested as part of our participationin Track 3 of the BioCreative VIII shared task.Results: For overall performance on the test set, the top-performing model for entity normalisation achieved an F1 score of 0.7229, while the best span extraction model reached an F1 score of 0.6647. Both scores surpassed those of two baseline models using the same dataset, indicating superior efficacy in handling both continuous and discontinuous spans. Approximately 14% ofentity mentions in the dataset are disjoint spans. On the validation set, we were able to demonstrate our system's ability to recognise these mentions, with the model achieving an F1 score of 0.6235 for exact match on discontinuous spans only.Conclusions: The findings suggest that exact extraction of entity spans may not always be necessary for successful normalisation. Partial mention matches can be sufficient as long as they capture the essential concept information, supporting the system’s utility in clinical downstream tasks.
ISSN:22919694