pyMeSHSim: an integrative python package for biomedical named entity recognition, normalization and comparison
Increasing disease causal genes have been identified through different methods, while there are still no uniform biomedical named entity (bio-NE) annotations of the disease phenotypes. Furthermore, semantic similarity comparison between two bio-NE annotations, like disease descriptions, has become i...
Uložené v:
| Vydané v: | bioRxiv |
|---|---|
| Hlavní autori: | , , , , |
| Médium: | Paper |
| Jazyk: | English |
| Vydavateľské údaje: |
Cold Spring Harbor Laboratory
15.03.2019
|
| Vydanie: | 1.3 |
| Predmet: | |
| ISSN: | 2692-8205 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | Increasing disease causal genes have been identified through different methods, while there are still no uniform biomedical named entity (bio-NE) annotations of the disease phenotypes. Furthermore, semantic similarity comparison between two bio-NE annotations, like disease descriptions, has become important for data integration or system genetics analysis.
The package pyMeSHSim realizes bio-NEs recognition using MetaMap, which produces Unified Medical Language System (UMLS) concepts in natural language process. To map the UMLS concepts to MeSH, pyMeSHSim embedded a house made dataset containing the Medical Subject Headings (MeSH) main headings (MHs), supplementary concept records (SCRs) and relations between them. Based on the dataset, pyMeSHSim implemented four information content (IC) based algorithms and one graph-based algorithm to measure the semantic similarity between two MeSH terms.
To evaluate its performance, we used pyMeSHSim to parse OMIM and GWAS phenotypes. The inclusion of SCRs and the curation strategy of non-MeSH-synonymous UMLS concepts used by pyMeSHSim improved the performance of pyMeSHSim in the recognition of OMIM phenotypes. In the curation of GWAS phenotypes, pyMeSHSim and previous manual work recognized the same MeSH terms from 276/461 GWAS phenotypes, and the correlation between their semantic similarity calculated by pyMeSHSim and another semantic analysis tool meshes was as high as 0.53-0.97.
With the embedded dataset including both MeSH MHs and SCRs, the integrative MeSH tool pyMeSHSim realized the disease recognition, normalization and comparison in biomedical text-mining.
Package’s source code and test datasets are available under the GPLv3 license at https://github.com/luozhhub/pyMeSHSim |
|---|---|
| AbstractList | Increasing disease causal genes have been identified through different methods, while there are still no uniform biomedical named entity (bio-NE) annotations of the disease phenotypes. Furthermore, semantic similarity comparison between two bio-NE annotations, like disease descriptions, has become important for data integration or system genetics analysis.
The package pyMeSHSim realizes bio-NEs recognition using MetaMap, which produces Unified Medical Language System (UMLS) concepts in natural language process. To map the UMLS concepts to MeSH, pyMeSHSim embedded a house made dataset containing the Medical Subject Headings (MeSH) main headings (MHs), supplementary concept records (SCRs) and relations between them. Based on the dataset, pyMeSHSim implemented four information content (IC) based algorithms and one graph-based algorithm to measure the semantic similarity between two MeSH terms.
To evaluate its performance, we used pyMeSHSim to parse OMIM and GWAS phenotypes. The inclusion of SCRs and the curation strategy of non-MeSH-synonymous UMLS concepts used by pyMeSHSim improved the performance of pyMeSHSim in the recognition of OMIM phenotypes. In the curation of GWAS phenotypes, pyMeSHSim and previous manual work recognized the same MeSH terms from 276/461 GWAS phenotypes, and the correlation between their semantic similarity calculated by pyMeSHSim and another semantic analysis tool meshes was as high as 0.53-0.97.
With the embedded dataset including both MeSH MHs and SCRs, the integrative MeSH tool pyMeSHSim realized the disease recognition, normalization and comparison in biomedical text-mining.
Package’s source code and test datasets are available under the GPLv3 license at https://github.com/luozhhub/pyMeSHSim |
| Author | Shi, Meng-Wei Yang, Zhuang Chen, Zhen-Xia Luo, Zhi-Hui Zhang, Hong-Yu |
| Author_xml | – sequence: 1 givenname: Zhi-Hui surname: Luo fullname: Luo, Zhi-Hui organization: Hubei Key Laboratory of Agricultural Bioinformatics, College of Life Science and Technology, Huazhong Agricultural University – sequence: 2 givenname: Meng-Wei surname: Shi fullname: Shi, Meng-Wei organization: Hubei Key Laboratory of Agricultural Bioinformatics, College of Life Science and Technology, Huazhong Agricultural University – sequence: 3 givenname: Zhuang surname: Yang fullname: Yang, Zhuang organization: Hubei Key Laboratory of Agricultural Bioinformatics, College of Life Science and Technology, Huazhong Agricultural University – sequence: 4 givenname: Hong-Yu surname: Zhang fullname: Zhang, Hong-Yu email: zhen-xia.chen@mail.hzau.edu.cn organization: Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University – sequence: 5 givenname: Zhen-Xia surname: Chen fullname: Chen, Zhen-Xia email: zhen-xia.chen@mail.hzau.edu.cn organization: Hubei Key Laboratory of Agricultural Bioinformatics, College of Life Science and Technology, Huazhong Agricultural University |
| BookMark | eNotkNFKwzAYhYMoOOd8AyEPYDVJk6bxToZzwsSL7X78Tf_U6JqUtAzr01uZV-c7Nx-Hc0XOQwxIyA1n95wz_iCV4VqckZkojMhKwdQlWfT9J2NMmILnWs5I6MY33K63vn2kEKgPAzYJBn9E2o3DRwy0A_sFDVIXE618bLH2Fg40wEQUw-CHkSa0sQl-8DHc0RBTCwf_A391ktbUxraD5PsYrsmFg0OPi_-ck93qebdcZ5v3l9fl0yarlBQZFlJoUE4bUxtWFcrlpbBW5E4qzV1hag25sVrVAo3QZSXLwlgUJWpTVQzzObk9aafB6dsf913yLaRxf7ok_wUkuVjt |
| ContentType | Paper |
| Copyright | 2019, Posted by Cold Spring Harbor Laboratory |
| Copyright_xml | – notice: 2019, Posted by Cold Spring Harbor Laboratory |
| DBID | FX. |
| DOI | 10.1101/459172 |
| DatabaseName | bioRxiv |
| DatabaseTitleList | |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Biology |
| EISSN | 2692-8205 |
| Edition | 1.3 |
| ExternalDocumentID | 459172v3 |
| GroupedDBID | 8FE 8FH AFKRA ALMA_UNASSIGNED_HOLDINGS BBNVY BENPR BHPHI FX. HCIFZ LK8 M7P NQS PIMPY PROAC RHI |
| ID | FETCH-LOGICAL-b542-e6427a5f799d90b65f382cc23f4571f69d7a39c75d2e9278b4869ce28e79bb0e3 |
| IngestDate | Tue Jan 07 18:59:54 EST 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | false |
| IsScholarly | false |
| Keywords | UMLS MeSH named entity recognition semantic similarity disease supplementary concept records |
| Language | English |
| License | This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at http://creativecommons.org/licenses/by-nc-nd/4.0 |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-b542-e6427a5f799d90b65f382cc23f4571f69d7a39c75d2e9278b4869ce28e79bb0e3 |
| OpenAccessLink | https://www.biorxiv.org/content/10.1101/459172 |
| PageCount | 19 |
| ParticipantIDs | biorxiv_primary_459172 |
| PublicationCentury | 2000 |
| PublicationDate | 20190315 |
| PublicationDateYYYYMMDD | 2019-03-15 |
| PublicationDate_xml | – month: 3 year: 2019 text: 20190315 day: 15 |
| PublicationDecade | 2010 |
| PublicationTitle | bioRxiv |
| PublicationYear | 2019 |
| Publisher | Cold Spring Harbor Laboratory |
| Publisher_xml | – name: Cold Spring Harbor Laboratory |
| References | Tsuyuzaki, Morota, Ishii, Nakazato, Miyazaki, Nikaido (459172v3.3) 2015; 16 Yepes, Mork, Demner-Fushman, Aronson (459172v3.16) 2013 Cui, Zhang, Huang (459172v3.8) 2018; 46 Liu, Tseng, Li (459172v3.30) 2014; 42 Resnik (459172v3.25) 1995 Zhou, Shui, Peng, Li, Mamitsuka, Zhu (459172v3.13) 2015; 13 Becker, Barnes, Bright, Wang (459172v3.19) 2004; 36 Bodenreider (459172v3.14) 2004; 32 Sayers (459172v3.23) 2009 Amberger, Bocchini, Schiettecatte, Scott, Hamosh (459172v3.18) 2014; 43 Chen, Ji, Chen (459172v3.21) 2002; 30 Yu, Wang, Yan, He (459172v3.31) 2014; 31 Lipscomb (459172v3.6) 2000; 88 Yu (459172v3.12) 2018 Li, Wang, Liu (459172v3.17) 2011; 40 Aronson, Mork, Gay, Humphrey, Rogers (459172v3.7) 2004 McCray, Burgun, Bodenreider (459172v3.29) 2001; 84 Leaman, Khare, Lu (459172v3.5) 2015; 57 Wishart, Knox, Guo (459172v3.20) 2006; 34 Schriml, Arze, Nadendla (459172v3.11) 2012; 40 Lin (459172v3.24) 1998 Schlicker, Domingues, Rahnenführer, Lengauer (459172v3.26) 2006; 7 Zemojtel, Köhler, Mackenroth (459172v3.1) 2014; 6 Wang, Gu, Wei, Cao, Liu (459172v3.2) 2015; 97 Consortium (459172v3.10) 2004; 32 Pinero, Bravo, Queralt-Rosinach (459172v3.9) 2017; 45 Aronson, Lang (459172v3.15) 2010; 17 Nelson, Tipney, Painter (459172v3.4) 2015; 47 McInnes, Pedersen, Pakhomov (459172v3.22) 2009 Jiang, Conrath (459172v3.27) 1997 Wang, Du, Payattakool, Yu, Chen (459172v3.28) 2007; 23 |
| References_xml | – year: 1995 ident: 459172v3.25 article-title: Using information content to evaluate semantic similarity in a taxonomy – start-page: 1 year: 2018 end-page: 2 ident: 459172v3.12 article-title: Using meshes for MeSH term enrichment and semantic analyses publication-title: Bioinformatics – volume: 40 start-page: D940 issue: Database issue year: 2012 end-page: 6 ident: 459172v3.11 article-title: Disease Ontology: a backbone for disease semantic integration publication-title: Nucleic Acids Res – volume: 36 start-page: 431 issue: 5 year: 2004 ident: 459172v3.19 article-title: The genetic association database publication-title: Nature genetics – volume: 88 start-page: 265 issue: 3 year: 2000 ident: 459172v3.6 article-title: Medical subject headings (MeSH) publication-title: Bulletin of the Medical Library Association – volume: 42 start-page: W137 issue: W1 year: 2014 end-page: W46 ident: 459172v3.30 article-title: DiseaseConnect: a comprehensive web server for mechanism-based disease–disease connections publication-title: Nucleic acids research – volume: 47 start-page: 856 issue: 8 year: 2015 end-page: 60 ident: 459172v3.4 article-title: The support of human genetic evidence for approved drug indications publication-title: Nature Genetics – volume: 17 start-page: 229 issue: 3 year: 2010 end-page: 36 ident: 459172v3.15 article-title: An overview of MetaMap: historical perspective and recent advances publication-title: Journal of the American Medical Informatics Association – volume: 32 start-page: D267 issue: suppl_1 year: 2004 end-page: D70 ident: 459172v3.14 article-title: The unified medical language system (UMLS): integrating biomedical terminology publication-title: Nucleic acids research – start-page: 431 year: 2009 ident: 459172v3.22 publication-title: AMIA Annual Symposium Proceedings; 2009 – volume: 6 start-page: 252ra123 issue: 252 year: 2014 ident: 459172v3.1 article-title: Effective diagnosis of genetic disease by computational phenotype analysis of the disease-associated genome publication-title: Science Translational Medicine – volume: 34 start-page: D668 issue: suppl_1 year: 2006 end-page: D72 ident: 459172v3.20 article-title: DrugBank: a comprehensive resource for in silico drug discovery and exploration publication-title: Nucleic acids research – volume: 32 start-page: D258 issue: suppl_1 year: 2004 end-page: D61 ident: 459172v3.10 article-title: The Gene Ontology (GO) database and informatics resource publication-title: Nucleic acids research – year: 1997 ident: 459172v3.27 article-title: Semantic similarity based on corpus statistics and lexical taxonomy – volume: 23 start-page: 1274 issue: 10 year: 2007 end-page: 81 ident: 459172v3.28 article-title: A new method to measure the semantic similarity of GO terms publication-title: Bioinformatics – volume: 84 start-page: 216 issue: 0 1 year: 2001 ident: 459172v3.29 article-title: Aggregating UMLS semantic types for reducing conceptual complexity publication-title: Studies in health technology and informatics – volume: 7 start-page: 302 issue: 1 year: 2006 ident: 459172v3.26 article-title: A new measure for functional similarity of gene products based on Gene Ontology publication-title: BMC bioinformatics – volume: 16 start-page: 45 issue: 1 year: 2015 ident: 459172v3.3 article-title: MeSH ORA framework: R/Bioconductor packages to support MeSH over-representation analysis publication-title: Bmc Bioinformatics – start-page: 89 year: 2004 ident: 459172v3.7 article-title: The NLM indexing initiative’s medical text indexer publication-title: Medinfo – volume: 46 start-page: D371 issue: Database issue year: 2018 end-page: D4 ident: 459172v3.8 article-title: MNDR v2.0: an updated resource of ncRNA–disease associations in mammals publication-title: Nucleic Acids Research – volume: 30 start-page: 412 issue: 1 year: 2002 end-page: 5 ident: 459172v3.21 article-title: TTD: therapeutic target database publication-title: Nucleic acids research – year: 2009 ident: 459172v3.23 article-title: The E-utilities in-depth: parameters, syntax and more publication-title: Entrez Programming Utilities Help [Internet] – start-page: 296 year: 1998 end-page: 304 ident: 459172v3.24 article-title: An information-theoretic definition of similarity publication-title: Icml; 1998: Citeseer – volume: 97 start-page: 451 issue: 5 year: 2015 ident: 459172v3.2 article-title: Mining drug-disease relationships as a complement to medical genetics-based drug repositioning: Where a recommendation system meets genome-wide association studies publication-title: Clinical Pharmacology & Therapeutics – volume: 43 start-page: D789 issue: D1 year: 2014 end-page: D98 ident: 459172v3.18 article-title: OMIM. org: Online Mendelian Inheritance in Man (OMIM(®)), an online catalog of human genes and genetic disorders publication-title: Nucleic acids research – volume: 57 start-page: 28 year: 2015 end-page: 37 ident: 459172v3.5 article-title: Challenges in clinical natural language processing for automated disorder normalization publication-title: Journal of biomedical informatics – volume: 40 start-page: D1047 issue: D1 year: 2011 end-page: D54 ident: 459172v3.17 article-title: GWASdb: a database for human genetic variants identified by genome-wide association studies publication-title: Nucleic acids research – volume: 45 start-page: D833 issue: D1 year: 2017 end-page: D9 ident: 459172v3.9 article-title: DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants publication-title: Nucleic Acids Res – start-page: 709 year: 2013 ident: 459172v3.16 publication-title: AMIA annual symposium proceedings; 2013 – volume: 13 start-page: 1542002 issue: 06 year: 2015 ident: 459172v3.13 article-title: MeSHSim: An R/Bioconductor package for measuring semantic similarity over MeSH headings and MEDLINE documents publication-title: Journal of bioinformatics and computational biology – volume: 31 start-page: 608 issue: 4 year: 2014 end-page: 9 ident: 459172v3.31 article-title: DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis publication-title: Bioinformatics |
| SSID | ssj0002961374 |
| Score | 1.5576339 |
| SecondaryResourceType | preprint |
| Snippet | Increasing disease causal genes have been identified through different methods, while there are still no uniform biomedical named entity (bio-NE) annotations... |
| SourceID | biorxiv |
| SourceType | Open Access Repository |
| SubjectTerms | Bioinformatics |
| Title | pyMeSHSim: an integrative python package for biomedical named entity recognition, normalization and comparison |
| URI | https://www.biorxiv.org/content/10.1101/459172 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtZ3Nb9MwGMYtWEHixoCJjzH5wK1YpP6IY45DQz2Mqtoqse0y2YmjRZvcqFuq9r_njW3cIIEEBy5R5LQ--Oc6r98-j1-EPpjaVBMuNMlKVhNuqpIok3FS5Lpm_RFZxnij8KmczYqLCzWP5Y7ufTkB6Vyx2aj2v6KGNoDdW2f_AXfqFBrgHqDDFbDD9a_At9tv9nx63vgMvHbpQIheItRu-6MCxrBPvu21Ol6q6f33HpXTcDf2xt3tOCmLwt_yro9t76Jp86cZLlYwHAa40N3ZplknnU_nU7FXNw2Zdk1K5_hKwl5RS77b1HwZc9dXN52OL9RhSnu6hI9fdsM8RW-NYiQ4NYNcaHkHAbRPVvb-JJjg49MwzZdR72z9qkdzBUs09V7s36zvvq4AF7DJpLs3WNIVhgdr9hiNqBQKVrnR8clsfpayblRB-CJ5LDIFvX0KX4HdD4zPCsZnEF0snqPRXLd2tY8eWfcCPQ3lQbcvkUsoP2Pt8AAkDiBxBIkBJN6BxB4kDiDxAORH_AtG6LTCO4yv0OLryeLLlMT6GMQITomFraPUopZKVSozuahZQcuSspoLOalzVUnNVClFRa2isjC8yFVpaWGlMiaz7ADtuaWzrxHmmSkzpWtqM8b1RCkDP9VKlRC9ykow8wYdxOG5bsMhKNdh3N7-6cE79Gw3CQ7R3sOqs-_Rk3L90NyvjiKWH1wGUcg |
| linkProvider | ProQuest |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=pyMeSHSim%3A+an+integrative+python+package+for+biomedical+named+entity+recognition%2C+normalization+and+comparison&rft.jtitle=bioRxiv&rft.au=Luo%2C+Zhi-Hui&rft.au=Shi%2C+Meng-Wei&rft.au=Yang%2C+Zhuang&rft.au=Zhang%2C+Hong-Yu&rft.date=2019-03-15&rft.pub=Cold+Spring+Harbor+Laboratory&rft.eissn=2692-8205&rft_id=info:doi/10.1101%2F459172&rft.externalDocID=459172v3 |