An Automated Approach for Domain-Specific Knowledge Graph Generation─Graph Measures and Characterization

In 2020, nearly 3 million scientific and engineering papers were published worldwide (White, K. Publications Output: U.S. Trends And International Comparisons). The vastness of the literature that already exists, the increasing rate of appearance of new publications, and the timely translation of ar...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Journal of chemical information and modeling Ročník 65; číslo 3; s. 1243
Hlavní autoři: O'Ryan, Connor, Hayes, Kevin D, VanGessel, Francis G, Doherty, Ruth M, Wilson, William, Fischer, John, Boukouvalas, Zois, Chung, Peter W
Médium: Journal Article
Jazyk:angličtina
Vydáno: United States 10.02.2025
Témata:
ISSN:1549-960X, 1549-960X
On-line přístup:Zjistit podrobnosti o přístupu
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:In 2020, nearly 3 million scientific and engineering papers were published worldwide (White, K. Publications Output: U.S. Trends And International Comparisons). The vastness of the literature that already exists, the increasing rate of appearance of new publications, and the timely translation of artificial intelligence methods into scientific and engineering communities have ushered in the development of automated methods for mining and extracting information from technical documents. However, domain-specific approaches for extracting knowledge graph representations from semantic information remain limited. In this paper, we develop a natural language processing (NLP) approach to extract knowledge graphs resulting in a semantically structured network (SSN) that can be queried. After a detailed exposition of the modeling method, the approach is demonstrated specifically for the synthetic chemistry of organic molecules from the text of approximately 100,000 full-length patents. In this paper, we focus specifically on characterizing the knowledge graph to develop insights into the linguistic patterns and trends within the data and to establish objective graph characteristics that may enable comparisons among other text-based knowledge graphs across domains. Graph characterization is performed for network motif structures, assortativity, and eigenvector centrality. The structural information provided by the measures reveals language tendencies commonly employed by authors in the text discourse for chemical reactions. These include observations of the prevalence of descriptions of specific compound names, that common solvents and drying agents cut across large numbers of chemical synthesis approaches, and that power-law trends clearly emerge in the limit of larger corpora. The findings provide important quantitative characterizations of knowledge graphs for use in validation in large data settings.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1549-960X
1549-960X
DOI:10.1021/acs.jcim.4c01904