pyMeSHSim: an integrative python package for biomedical named entity recognition, normalization and comparison

Increasing disease causal genes have been identified through different methods, while there are still no uniform biomedical named entity (bio-NE) annotations of the disease phenotypes. Furthermore, semantic similarity comparison between two bio-NE annotations, like disease descriptions, has become i...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:bioRxiv
Hlavní autori: Luo, Zhi-Hui, Shi, Meng-Wei, Yang, Zhuang, Zhang, Hong-Yu, Chen, Zhen-Xia
Médium: Paper
Jazyk:English
Vydavateľské údaje: Cold Spring Harbor Laboratory 15.03.2019
Vydanie:1.3
Predmet:
ISSN:2692-8205
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Increasing disease causal genes have been identified through different methods, while there are still no uniform biomedical named entity (bio-NE) annotations of the disease phenotypes. Furthermore, semantic similarity comparison between two bio-NE annotations, like disease descriptions, has become important for data integration or system genetics analysis. The package pyMeSHSim realizes bio-NEs recognition using MetaMap, which produces Unified Medical Language System (UMLS) concepts in natural language process. To map the UMLS concepts to MeSH, pyMeSHSim embedded a house made dataset containing the Medical Subject Headings (MeSH) main headings (MHs), supplementary concept records (SCRs) and relations between them. Based on the dataset, pyMeSHSim implemented four information content (IC) based algorithms and one graph-based algorithm to measure the semantic similarity between two MeSH terms. To evaluate its performance, we used pyMeSHSim to parse OMIM and GWAS phenotypes. The inclusion of SCRs and the curation strategy of non-MeSH-synonymous UMLS concepts used by pyMeSHSim improved the performance of pyMeSHSim in the recognition of OMIM phenotypes. In the curation of GWAS phenotypes, pyMeSHSim and previous manual work recognized the same MeSH terms from 276/461 GWAS phenotypes, and the correlation between their semantic similarity calculated by pyMeSHSim and another semantic analysis tool meshes was as high as 0.53-0.97. With the embedded dataset including both MeSH MHs and SCRs, the integrative MeSH tool pyMeSHSim realized the disease recognition, normalization and comparison in biomedical text-mining. Package’s source code and test datasets are available under the GPLv3 license at https://github.com/luozhhub/pyMeSHSim
ISSN:2692-8205
DOI:10.1101/459172