pyMeSHSim: an integrative python package for biomedical named entity recognition, normalization and comparison

Increasing disease causal genes have been identified through different methods, while there are still no uniform biomedical named entity (bio-NE) annotations of the disease phenotypes. Furthermore, semantic similarity comparison between two bio-NE annotations, like disease descriptions, has become i...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:bioRxiv
Hlavní autori: Luo, Zhi-Hui, Shi, Meng-Wei, Yang, Zhuang, Zhang, Hong-Yu, Chen, Zhen-Xia
Médium: Paper
Jazyk:English
Vydavateľské údaje: Cold Spring Harbor Laboratory 15.03.2019
Vydanie:1.3
Predmet:
ISSN:2692-8205
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Increasing disease causal genes have been identified through different methods, while there are still no uniform biomedical named entity (bio-NE) annotations of the disease phenotypes. Furthermore, semantic similarity comparison between two bio-NE annotations, like disease descriptions, has become important for data integration or system genetics analysis. The package pyMeSHSim realizes bio-NEs recognition using MetaMap, which produces Unified Medical Language System (UMLS) concepts in natural language process. To map the UMLS concepts to MeSH, pyMeSHSim embedded a house made dataset containing the Medical Subject Headings (MeSH) main headings (MHs), supplementary concept records (SCRs) and relations between them. Based on the dataset, pyMeSHSim implemented four information content (IC) based algorithms and one graph-based algorithm to measure the semantic similarity between two MeSH terms. To evaluate its performance, we used pyMeSHSim to parse OMIM and GWAS phenotypes. The inclusion of SCRs and the curation strategy of non-MeSH-synonymous UMLS concepts used by pyMeSHSim improved the performance of pyMeSHSim in the recognition of OMIM phenotypes. In the curation of GWAS phenotypes, pyMeSHSim and previous manual work recognized the same MeSH terms from 276/461 GWAS phenotypes, and the correlation between their semantic similarity calculated by pyMeSHSim and another semantic analysis tool meshes was as high as 0.53-0.97. With the embedded dataset including both MeSH MHs and SCRs, the integrative MeSH tool pyMeSHSim realized the disease recognition, normalization and comparison in biomedical text-mining. Package’s source code and test datasets are available under the GPLv3 license at https://github.com/luozhhub/pyMeSHSim
AbstractList Increasing disease causal genes have been identified through different methods, while there are still no uniform biomedical named entity (bio-NE) annotations of the disease phenotypes. Furthermore, semantic similarity comparison between two bio-NE annotations, like disease descriptions, has become important for data integration or system genetics analysis. The package pyMeSHSim realizes bio-NEs recognition using MetaMap, which produces Unified Medical Language System (UMLS) concepts in natural language process. To map the UMLS concepts to MeSH, pyMeSHSim embedded a house made dataset containing the Medical Subject Headings (MeSH) main headings (MHs), supplementary concept records (SCRs) and relations between them. Based on the dataset, pyMeSHSim implemented four information content (IC) based algorithms and one graph-based algorithm to measure the semantic similarity between two MeSH terms. To evaluate its performance, we used pyMeSHSim to parse OMIM and GWAS phenotypes. The inclusion of SCRs and the curation strategy of non-MeSH-synonymous UMLS concepts used by pyMeSHSim improved the performance of pyMeSHSim in the recognition of OMIM phenotypes. In the curation of GWAS phenotypes, pyMeSHSim and previous manual work recognized the same MeSH terms from 276/461 GWAS phenotypes, and the correlation between their semantic similarity calculated by pyMeSHSim and another semantic analysis tool meshes was as high as 0.53-0.97. With the embedded dataset including both MeSH MHs and SCRs, the integrative MeSH tool pyMeSHSim realized the disease recognition, normalization and comparison in biomedical text-mining. Package’s source code and test datasets are available under the GPLv3 license at https://github.com/luozhhub/pyMeSHSim
Author Shi, Meng-Wei
Yang, Zhuang
Chen, Zhen-Xia
Luo, Zhi-Hui
Zhang, Hong-Yu
Author_xml – sequence: 1
  givenname: Zhi-Hui
  surname: Luo
  fullname: Luo, Zhi-Hui
  organization: Hubei Key Laboratory of Agricultural Bioinformatics, College of Life Science and Technology, Huazhong Agricultural University
– sequence: 2
  givenname: Meng-Wei
  surname: Shi
  fullname: Shi, Meng-Wei
  organization: Hubei Key Laboratory of Agricultural Bioinformatics, College of Life Science and Technology, Huazhong Agricultural University
– sequence: 3
  givenname: Zhuang
  surname: Yang
  fullname: Yang, Zhuang
  organization: Hubei Key Laboratory of Agricultural Bioinformatics, College of Life Science and Technology, Huazhong Agricultural University
– sequence: 4
  givenname: Hong-Yu
  surname: Zhang
  fullname: Zhang, Hong-Yu
  email: zhen-xia.chen@mail.hzau.edu.cn
  organization: Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University
– sequence: 5
  givenname: Zhen-Xia
  surname: Chen
  fullname: Chen, Zhen-Xia
  email: zhen-xia.chen@mail.hzau.edu.cn
  organization: Hubei Key Laboratory of Agricultural Bioinformatics, College of Life Science and Technology, Huazhong Agricultural University
BookMark eNotkNFKwzAYhYMoOOd8AyEPYDVJk6bxToZzwsSL7X78Tf_U6JqUtAzr01uZV-c7Nx-Hc0XOQwxIyA1n95wz_iCV4VqckZkojMhKwdQlWfT9J2NMmILnWs5I6MY33K63vn2kEKgPAzYJBn9E2o3DRwy0A_sFDVIXE618bLH2Fg40wEQUw-CHkSa0sQl-8DHc0RBTCwf_A391ktbUxraD5PsYrsmFg0OPi_-ck93qebdcZ5v3l9fl0yarlBQZFlJoUE4bUxtWFcrlpbBW5E4qzV1hag25sVrVAo3QZSXLwlgUJWpTVQzzObk9aafB6dsf913yLaRxf7ok_wUkuVjt
ContentType Paper
Copyright 2019, Posted by Cold Spring Harbor Laboratory
Copyright_xml – notice: 2019, Posted by Cold Spring Harbor Laboratory
DBID FX.
DOI 10.1101/459172
DatabaseName bioRxiv
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Biology
EISSN 2692-8205
Edition 1.3
ExternalDocumentID 459172v3
GroupedDBID 8FE
8FH
AFKRA
ALMA_UNASSIGNED_HOLDINGS
BBNVY
BENPR
BHPHI
FX.
HCIFZ
LK8
M7P
NQS
PIMPY
PROAC
RHI
ID FETCH-LOGICAL-b542-e6427a5f799d90b65f382cc23f4571f69d7a39c75d2e9278b4869ce28e79bb0e3
IngestDate Tue Jan 07 18:59:54 EST 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Keywords UMLS
MeSH
named entity recognition
semantic similarity
disease
supplementary concept records
Language English
License This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at http://creativecommons.org/licenses/by-nc-nd/4.0
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-b542-e6427a5f799d90b65f382cc23f4571f69d7a39c75d2e9278b4869ce28e79bb0e3
OpenAccessLink https://www.biorxiv.org/content/10.1101/459172
PageCount 19
ParticipantIDs biorxiv_primary_459172
PublicationCentury 2000
PublicationDate 20190315
PublicationDateYYYYMMDD 2019-03-15
PublicationDate_xml – month: 3
  year: 2019
  text: 20190315
  day: 15
PublicationDecade 2010
PublicationTitle bioRxiv
PublicationYear 2019
Publisher Cold Spring Harbor Laboratory
Publisher_xml – name: Cold Spring Harbor Laboratory
References Tsuyuzaki, Morota, Ishii, Nakazato, Miyazaki, Nikaido (459172v3.3) 2015; 16
Yepes, Mork, Demner-Fushman, Aronson (459172v3.16) 2013
Cui, Zhang, Huang (459172v3.8) 2018; 46
Liu, Tseng, Li (459172v3.30) 2014; 42
Resnik (459172v3.25) 1995
Zhou, Shui, Peng, Li, Mamitsuka, Zhu (459172v3.13) 2015; 13
Becker, Barnes, Bright, Wang (459172v3.19) 2004; 36
Bodenreider (459172v3.14) 2004; 32
Sayers (459172v3.23) 2009
Amberger, Bocchini, Schiettecatte, Scott, Hamosh (459172v3.18) 2014; 43
Chen, Ji, Chen (459172v3.21) 2002; 30
Yu, Wang, Yan, He (459172v3.31) 2014; 31
Lipscomb (459172v3.6) 2000; 88
Yu (459172v3.12) 2018
Li, Wang, Liu (459172v3.17) 2011; 40
Aronson, Mork, Gay, Humphrey, Rogers (459172v3.7) 2004
McCray, Burgun, Bodenreider (459172v3.29) 2001; 84
Leaman, Khare, Lu (459172v3.5) 2015; 57
Wishart, Knox, Guo (459172v3.20) 2006; 34
Schriml, Arze, Nadendla (459172v3.11) 2012; 40
Lin (459172v3.24) 1998
Schlicker, Domingues, Rahnenführer, Lengauer (459172v3.26) 2006; 7
Zemojtel, Köhler, Mackenroth (459172v3.1) 2014; 6
Wang, Gu, Wei, Cao, Liu (459172v3.2) 2015; 97
Consortium (459172v3.10) 2004; 32
Pinero, Bravo, Queralt-Rosinach (459172v3.9) 2017; 45
Aronson, Lang (459172v3.15) 2010; 17
Nelson, Tipney, Painter (459172v3.4) 2015; 47
McInnes, Pedersen, Pakhomov (459172v3.22) 2009
Jiang, Conrath (459172v3.27) 1997
Wang, Du, Payattakool, Yu, Chen (459172v3.28) 2007; 23
References_xml – year: 1995
  ident: 459172v3.25
  article-title: Using information content to evaluate semantic similarity in a taxonomy
– start-page: 1
  year: 2018
  end-page: 2
  ident: 459172v3.12
  article-title: Using meshes for MeSH term enrichment and semantic analyses
  publication-title: Bioinformatics
– volume: 40
  start-page: D940
  issue: Database issue
  year: 2012
  end-page: 6
  ident: 459172v3.11
  article-title: Disease Ontology: a backbone for disease semantic integration
  publication-title: Nucleic Acids Res
– volume: 36
  start-page: 431
  issue: 5
  year: 2004
  ident: 459172v3.19
  article-title: The genetic association database
  publication-title: Nature genetics
– volume: 88
  start-page: 265
  issue: 3
  year: 2000
  ident: 459172v3.6
  article-title: Medical subject headings (MeSH)
  publication-title: Bulletin of the Medical Library Association
– volume: 42
  start-page: W137
  issue: W1
  year: 2014
  end-page: W46
  ident: 459172v3.30
  article-title: DiseaseConnect: a comprehensive web server for mechanism-based disease–disease connections
  publication-title: Nucleic acids research
– volume: 47
  start-page: 856
  issue: 8
  year: 2015
  end-page: 60
  ident: 459172v3.4
  article-title: The support of human genetic evidence for approved drug indications
  publication-title: Nature Genetics
– volume: 17
  start-page: 229
  issue: 3
  year: 2010
  end-page: 36
  ident: 459172v3.15
  article-title: An overview of MetaMap: historical perspective and recent advances
  publication-title: Journal of the American Medical Informatics Association
– volume: 32
  start-page: D267
  issue: suppl_1
  year: 2004
  end-page: D70
  ident: 459172v3.14
  article-title: The unified medical language system (UMLS): integrating biomedical terminology
  publication-title: Nucleic acids research
– start-page: 431
  year: 2009
  ident: 459172v3.22
  publication-title: AMIA Annual Symposium Proceedings; 2009
– volume: 6
  start-page: 252ra123
  issue: 252
  year: 2014
  ident: 459172v3.1
  article-title: Effective diagnosis of genetic disease by computational phenotype analysis of the disease-associated genome
  publication-title: Science Translational Medicine
– volume: 34
  start-page: D668
  issue: suppl_1
  year: 2006
  end-page: D72
  ident: 459172v3.20
  article-title: DrugBank: a comprehensive resource for in silico drug discovery and exploration
  publication-title: Nucleic acids research
– volume: 32
  start-page: D258
  issue: suppl_1
  year: 2004
  end-page: D61
  ident: 459172v3.10
  article-title: The Gene Ontology (GO) database and informatics resource
  publication-title: Nucleic acids research
– year: 1997
  ident: 459172v3.27
  article-title: Semantic similarity based on corpus statistics and lexical taxonomy
– volume: 23
  start-page: 1274
  issue: 10
  year: 2007
  end-page: 81
  ident: 459172v3.28
  article-title: A new method to measure the semantic similarity of GO terms
  publication-title: Bioinformatics
– volume: 84
  start-page: 216
  issue: 0 1
  year: 2001
  ident: 459172v3.29
  article-title: Aggregating UMLS semantic types for reducing conceptual complexity
  publication-title: Studies in health technology and informatics
– volume: 7
  start-page: 302
  issue: 1
  year: 2006
  ident: 459172v3.26
  article-title: A new measure for functional similarity of gene products based on Gene Ontology
  publication-title: BMC bioinformatics
– volume: 16
  start-page: 45
  issue: 1
  year: 2015
  ident: 459172v3.3
  article-title: MeSH ORA framework: R/Bioconductor packages to support MeSH over-representation analysis
  publication-title: Bmc Bioinformatics
– start-page: 89
  year: 2004
  ident: 459172v3.7
  article-title: The NLM indexing initiative’s medical text indexer
  publication-title: Medinfo
– volume: 46
  start-page: D371
  issue: Database issue
  year: 2018
  end-page: D4
  ident: 459172v3.8
  article-title: MNDR v2.0: an updated resource of ncRNA–disease associations in mammals
  publication-title: Nucleic Acids Research
– volume: 30
  start-page: 412
  issue: 1
  year: 2002
  end-page: 5
  ident: 459172v3.21
  article-title: TTD: therapeutic target database
  publication-title: Nucleic acids research
– year: 2009
  ident: 459172v3.23
  article-title: The E-utilities in-depth: parameters, syntax and more
  publication-title: Entrez Programming Utilities Help [Internet]
– start-page: 296
  year: 1998
  end-page: 304
  ident: 459172v3.24
  article-title: An information-theoretic definition of similarity
  publication-title: Icml; 1998: Citeseer
– volume: 97
  start-page: 451
  issue: 5
  year: 2015
  ident: 459172v3.2
  article-title: Mining drug-disease relationships as a complement to medical genetics-based drug repositioning: Where a recommendation system meets genome-wide association studies
  publication-title: Clinical Pharmacology & Therapeutics
– volume: 43
  start-page: D789
  issue: D1
  year: 2014
  end-page: D98
  ident: 459172v3.18
  article-title: OMIM. org: Online Mendelian Inheritance in Man (OMIM(®)), an online catalog of human genes and genetic disorders
  publication-title: Nucleic acids research
– volume: 57
  start-page: 28
  year: 2015
  end-page: 37
  ident: 459172v3.5
  article-title: Challenges in clinical natural language processing for automated disorder normalization
  publication-title: Journal of biomedical informatics
– volume: 40
  start-page: D1047
  issue: D1
  year: 2011
  end-page: D54
  ident: 459172v3.17
  article-title: GWASdb: a database for human genetic variants identified by genome-wide association studies
  publication-title: Nucleic acids research
– volume: 45
  start-page: D833
  issue: D1
  year: 2017
  end-page: D9
  ident: 459172v3.9
  article-title: DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants
  publication-title: Nucleic Acids Res
– start-page: 709
  year: 2013
  ident: 459172v3.16
  publication-title: AMIA annual symposium proceedings; 2013
– volume: 13
  start-page: 1542002
  issue: 06
  year: 2015
  ident: 459172v3.13
  article-title: MeSHSim: An R/Bioconductor package for measuring semantic similarity over MeSH headings and MEDLINE documents
  publication-title: Journal of bioinformatics and computational biology
– volume: 31
  start-page: 608
  issue: 4
  year: 2014
  end-page: 9
  ident: 459172v3.31
  article-title: DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis
  publication-title: Bioinformatics
SSID ssj0002961374
Score 1.5576339
SecondaryResourceType preprint
Snippet Increasing disease causal genes have been identified through different methods, while there are still no uniform biomedical named entity (bio-NE) annotations...
SourceID biorxiv
SourceType Open Access Repository
SubjectTerms Bioinformatics
Title pyMeSHSim: an integrative python package for biomedical named entity recognition, normalization and comparison
URI https://www.biorxiv.org/content/10.1101/459172
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtZ3Nb9MwGMYtWEHixoCJjzH5wK1YpP6IY45DQz2Mqtoqse0y2YmjRZvcqFuq9r_njW3cIIEEBy5R5LQ--Oc6r98-j1-EPpjaVBMuNMlKVhNuqpIok3FS5Lpm_RFZxnij8KmczYqLCzWP5Y7ufTkB6Vyx2aj2v6KGNoDdW2f_AXfqFBrgHqDDFbDD9a_At9tv9nx63vgMvHbpQIheItRu-6MCxrBPvu21Ol6q6f33HpXTcDf2xt3tOCmLwt_yro9t76Jp86cZLlYwHAa40N3ZplknnU_nU7FXNw2Zdk1K5_hKwl5RS77b1HwZc9dXN52OL9RhSnu6hI9fdsM8RW-NYiQ4NYNcaHkHAbRPVvb-JJjg49MwzZdR72z9qkdzBUs09V7s36zvvq4AF7DJpLs3WNIVhgdr9hiNqBQKVrnR8clsfpayblRB-CJ5LDIFvX0KX4HdD4zPCsZnEF0snqPRXLd2tY8eWfcCPQ3lQbcvkUsoP2Pt8AAkDiBxBIkBJN6BxB4kDiDxAORH_AtG6LTCO4yv0OLryeLLlMT6GMQITomFraPUopZKVSozuahZQcuSspoLOalzVUnNVClFRa2isjC8yFVpaWGlMiaz7ADtuaWzrxHmmSkzpWtqM8b1RCkDP9VKlRC9ykow8wYdxOG5bsMhKNdh3N7-6cE79Gw3CQ7R3sOqs-_Rk3L90NyvjiKWH1wGUcg
linkProvider ProQuest
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=pyMeSHSim%3A+an+integrative+python+package+for+biomedical+named+entity+recognition%2C+normalization+and+comparison&rft.jtitle=bioRxiv&rft.au=Luo%2C+Zhi-Hui&rft.au=Shi%2C+Meng-Wei&rft.au=Yang%2C+Zhuang&rft.au=Zhang%2C+Hong-Yu&rft.date=2019-03-15&rft.pub=Cold+Spring+Harbor+Laboratory&rft.eissn=2692-8205&rft_id=info:doi/10.1101%2F459172&rft.externalDocID=459172v3