Natural Scientific Language Processing and Research Knowledge Graphs First International Workshop, NSLP 2024, Hersonissos, Crete, Greece, May 27, 2024, Proceedings

This Open Access book constitutes the refereed proceedings of the First International Workshop on Natural Scientific Language Processing and Research Knowledge Graphs, NSLP 2024, held in Hersonissos, Crete, Greece, on May 27, 2024. The 10 full papers and 11 short papers included in this volume were...

Celý popis

Uloženo v:
Podrobná bibliografie
Hlavní autoři: Rehm, Georg, Dietze, Stefan, Krüger, Frank
Médium: E-kniha
Jazyk:angličtina
Vydáno: Cham Springer Nature 2024
Springer
Vydání:1
Edice:Lecture Notes in Computer Science; Lecture Notes in Artificial Intelligence
Témata:
ISBN:9783031657931, 3031657934, 3031657942, 9783031657948
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Obsah:
  • Intro -- Preface -- Organization -- Contents -- Scholarly Information Processing -- Scholarly Question Answering Using Large Language Models in the NFDI4DataScience Gateway -- 1 Introduction -- 2 Related Works -- 3 Methodological Framework -- 3.1 The Gateway - Federated Search -- 3.2 Scholarly Question Answering -- 4 Evaluation -- 4.1 Evaluation Dataset -- 4.2 Evaluation Metrics -- 4.3 Results -- 5 Limitations and Future Directions -- 6 Conclusion -- References -- Cite-worthiness Detection on Social Media: A Preliminary Study -- 1 Introduction -- 2 Related Work -- 3 Data -- 4 Experiments -- 4.1 Setting -- 4.2 Results -- 4.3 Discussion -- 5 Limitations -- 6 Conclusion -- References -- Towards a Novel Classification of Table Types in Scholarly Publications -- 1 Introduction -- 2 Related Work -- 3 Methodology -- 3.1 Data -- 3.2 Taxonomies Construction -- 3.3 Annotation -- 3.4 Models -- 3.5 Evaluation Metrics -- 4 Results -- 4.1 Dataset Analysis -- 4.2 Table Type Classification -- 5 Discussion -- 6 Limitations -- 7 Conclusion -- A Examples of Matrix, Horizontal Listing, and Vertical Listing Tables -- B Illustrations of Table Features -- References -- OCR Cleaning of Scientific Texts with LLMs -- 1 Introduction -- 2 Motivation -- 3 Related Work -- 4 The Training Data -- 4.1 The ``JIM Corpus'' -- 4.2 The Datamaker Pipeline -- 4.3 Synthetic Data -- 4.4 The Gold Standard Corpus -- 5 The Training Method -- 6 Results and Discussion -- 6.1 Metric Definitions -- 6.2 The SOTA -- 6.3 Performance Improvement -- 7 Conclusion -- References -- Identifying and Leveraging Research Software -- RTaC: A Generalized Framework for Tooling -- 1 Introduction -- 2 Related Works -- 2.1 Dataset and Tooling Benchmarks -- 2.2 Tooling LLMs -- 2.3 Prompting Methods -- 3 Method -- 3.1 RTaC (Reimagining Tooling as Coding) -- 3.2 Evaluation Metrics -- 4 Experiments
  • 4.1 Retrievers -- 4.2 Closed Source LLMs -- 4.3 Open Source LLMs -- 5 Results -- 6 Conclusion -- 7 Future Scope -- A Appendix -- A.1 JSON Converter -- A.2 Prompt for Section4.3 -- A.3 Default Tools Used to Generate New Tools -- References -- Scientific Software Citation Intent Classification Using Large Language Models -- 1 Introduction -- 2 Related Work -- 2.1 Research Software Studies -- 2.2 Citation Intent Classification -- 3 Methods -- 3.1 Citation Intent Classes -- 3.2 Data -- 3.3 Training Models -- 4 Results -- 4.1 Results of BERT Models -- 4.2 Results of GPT-3.5/GPT-4 -- 5 Data and Code Availability Statement -- 6 Discussion -- 7 Conclusion -- References -- RepoFromPaper: An Approach to Extract Software Code Implementations from Scientific Publications -- 1 Introduction -- 2 Related Work -- 3 RepoFromPaper: Methodology -- 3.1 PDF-to-Text Conversion -- 3.2 Sentence Extraction -- 3.3 Sentence Classification -- 3.4 Sentence Ranking -- 3.5 Repository Link Search -- 4 Evaluation Methods -- 4.1 Mean Reciprocal Rank (MRR) -- 4.2 Precision, Recall, and F1 Score -- 4.3 Training and Testing Corpora -- 5 Results -- 6 A Corpus of Papers and Their Corresponding Implementations -- 7 Discussion -- 8 Conclusions and Future Work -- References -- Automated Extraction of Research Software Installation Instructions from README Files: An Initial Analysis -- 1 Introduction -- 2 Related Work -- 3 PlanStep: Extracting Installation Instructions from README Files -- 3.1 Classical Planning: Software Installation Instructions -- 3.2 PlanStep Methodology -- 3.3 PlanStep Corpus Creation -- 3.4 Ground Truth Extraction for PlanStep -- 3.5 Distribution of the Installation Instructions of README Files -- 3.6 PlanStep Prompting -- 4 Experiments -- 4.1 Experimental Setup -- 4.2 Evaluation Metrics -- 4.3 Evaluation Results -- 4.4 Analysis -- 5 Discussion
  • 6 Conclusion and Future Work -- References -- A Technical/Scientific Document Management Platform -- 1 Introduction -- 2 Related Work -- 3 A Brief Overview of DANKE -- 4 DANKE-U -- 5 Use Case -- 6 Conclusions and Directions for Future Work -- A Appendix -- References -- Research Knowledge Graphs -- The Effect of Knowledge Graph Schema on Classifying Future Research Suggestions -- 1 Introduction -- 1.1 Problem Statement -- 1.2 Research Questions -- 2 Related Work -- 3 Methodology -- 3.1 Schema and Dataset -- 3.2 Graph Classification of Challenges and Directions -- 3.3 Experimental Setup -- 4 Results -- 4.1 Graph Classification -- 5 Discussion -- 5.1 Limitations and Future Research -- 6 Conclusion -- A Tables and Figures -- References -- Assessing the Overlap of Science Knowledge Graphs: A Quantitative Analysis -- 1 Introduction -- 2 A Methodology for Assessing SKG Overlap -- 2.1 Data Preparation -- 2.2 Category Alignment -- 3 Initial SKG Overlap Assessment: OpenAIRE and OpenAlex -- 3.1 Paper-Category Schema Representation -- 3.2 SKG Overlap Analysis: Data Preparation -- 3.3 SKG Overlap Analysis: Category Alignment -- 4 Results -- 5 Related Work -- 5.1 KG Alignment Based on Embeddings -- 5.2 KG Alignment Based on Machine Learning -- 6 Conclusions and Future Work -- References -- Shared Task: FoRC -- FoRC@NSLP2024: Overview and Insights from the Field of Research Classification Shared Task -- 1 Introduction -- 2 Related Work -- 3 Tasks Description -- 4 Shared Task Datasets -- 4.1 Subtask I -- 4.2 Subtask II -- 5 Results -- 5.1 Baselines -- 5.2 Subtask I -- 5.3 Subtask II -- 6 Discussion -- 7 Conclusion -- References -- NRK at FoRC 2024 Subtask I: Exploiting BERT-Based Models for Multi-class Classification of Scholarly Papers -- 1 Introduction -- 2 Related Work -- 3 Approach -- 4 Experimental Setup -- 4.1 Data -- 4.2 Configuration Settings
  • 5 Results and Discussion -- 5.1 Baseline Performance -- 5.2 Results -- 5.3 Error Analysis -- 6 Conclusion and Future Work -- References -- Advancing Automatic Subject Indexing: Combining Weak Supervision with Extreme Multi-label Classification -- 1 Introduction -- 2 Related Work -- 3 Data Analysis -- 4 Experiments -- 5 Results and Discussion -- 6 Conclusion and Future Work -- References -- Single-Label Multi-modal Field of Research Classification -- 1 Introduction -- 2 Related Work -- 3 The SLAMFORC System -- 3.1 Multi-modal Data -- 3.2 Classifier -- 4 Experiments -- 5 Conclusions -- References -- Enriched BERT Embeddings for Scholarly Publication Classification -- 1 Introduction and Background -- 2 Dataset -- 3 Enrichment -- 4 Approaches -- 4.1 BERT-Embeddings -- 4.2 Combined BERT-Embeddings (TwinBERT) -- 5 Results -- 6 Discussion and Conclusion -- References -- Shared Task: SOMD -- SOMD@NSLP2024: Overview and Insights from the Software Mention Detection Shared Task -- 1 Introduction -- 2 Related Work -- 3 Tasks Description -- 4 Dataset -- 5 Results -- 5.1 Subtask I -- 5.2 Subtask II -- 5.3 Subtask III -- 6 Conclusion -- References -- Software Mention Recognition with a Three-Stage Framework Based on BERTology Models at SOMD 2024 -- 1 Introduction -- 2 Related Work -- 3 Approach -- 3.1 Approach 1: Token Classification with BERTs -- 3.2 Approach 2: Two-Stage Framework for Entity Extraction and Classification -- 3.3 Approach 3: Three-Stage Framework -- 4 Experimental Setup -- 4.1 Data and Evaluation Metrics -- 4.2 System Settings -- 5 Main Results -- 6 Conclusion and Future Work -- References -- ABCD Team at SOMD 2024: Software Mention Detection in Scholarly Publications with Large Language Models -- 1 Introduction -- 2 Related Work -- 3 Approach -- 3.1 Overview -- 3.2 Low-Rank Adaptation -- 4 Experimental Setup -- 4.1 Dataset and Evaluation Metrics
  • 4.2 System Settings -- 5 Main Results -- 6 Discussion -- 6.1 Challenges in Applying LLMs to NER Tasks -- 6.2 Conclusion and Future Work -- References -- Falcon 7b for Software Mention Detection in Scholarly Documents -- 1 Introduction -- 2 Related Work -- 2.1 Rule-Based and Classical Machine Learning Approaches -- 2.2 Deep Learning-Based Approaches -- 2.3 Large Language Model-Based Approaches -- 3 Method -- 4 Experimental Results -- 4.1 Results -- 5 Conclusion -- References -- Enhancing Software-Related Information Extraction via Single-Choice Question Answering with Large Language Models -- 1 Introduction -- 2 Related Work -- 3 SOMD Shared Task -- 4 Using LLMs for Software Related IE-Tasks -- 4.1 Challenges in Applying LLMs to NER Tasks -- 4.2 Sample Retrieval for RAG on Various IE-Tasks -- 4.3 Extraction of Software Entities -- 4.4 Extraction of Software Attributes -- 4.5 Relation Extraction as Single-Choice Question Answering Task -- 5 Experiments -- 5.1 Models -- 5.2 Prompting -- 5.3 Train Sample Retrieval for Few-Shot Generation -- 5.4 Relation Extraction Baseline -- 6 Results -- 7 Conclusion -- A Dataset Overview -- B Similarity Search Examples -- B.1 Search by Entity Similarity -- B.2 Search by Sentence Similarity -- C Prompting Examples -- References -- Author Index