Geo-FuB: A method for constructing an Operator-Function knowledge base for geospatial code generation with large language models
The rapid growth of spatiotemporal data and the increasing demand for geospatial modeling have driven the automation of these tasks with large language models (LLMs) to enhance research efficiency. However, general LLMs often encounter hallucinations when generating geospatial code due to a lack of...
Uložené v:
| Vydané v: | Knowledge-based systems Ročník 319; s. 113624 |
|---|---|
| Hlavní autori: | , , , , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
Elsevier B.V
15.06.2025
|
| Predmet: | |
| ISSN: | 0950-7051 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Shrnutí: | The rapid growth of spatiotemporal data and the increasing demand for geospatial modeling have driven the automation of these tasks with large language models (LLMs) to enhance research efficiency. However, general LLMs often encounter hallucinations when generating geospatial code due to a lack of domain-specific knowledge on geospatial functions and related operators. The retrieval-augmented generation (RAG) technique, integrated with an external operator-function knowledge base, provides an effective solution to this challenge. To date, no widely recognized framework exists for building such a knowledge base. This study presents a comprehensive framework for constructing the operator-function knowledge base, leveraging semantic and structural knowledge embedded in geospatial scripts. The framework consists of three core components: Function Semantic Framework Construction (Geo-FuSE), Frequent Operator Combination Statistics (Geo-FuST), and Combination and Semantic Framework Mapping (Geo-FuM). Geo-FuSE employs techniques like Chain-of-Thought (CoT), TF-IDF, t-SNE, and Gaussian Mixture Models (GMM) to extract semantic features from scripts; Geo-FuST uses Abstract Syntax Trees (AST) and the Apriori algorithm to identify frequent operator combinations; Geo-FuM combines LLMs with a fuzzy matching algorithm to align these combinations with the semantic framework, forming the Geo-FuB knowledge base. The instance of Geo-FuB, named GEE-FuB, has been developed using 154,075 Google Earth Engine scripts and is available at https://github.com/whuhsy/GEE-FuB. Based on a set of well-defined evaluation metrics introduced in this study, the GEE-FuB construction achieved an overall accuracy of 88.89 %, demonstrating a 31 % to 34 % reduction in hallucinations compared to mainstream LLMs without external knowledge integration. This research introduces a novel approach to knowledge mining and knowledge base construction specifically tailored for geospatial code generation tasks, broadening the applications of knowledge base construction and providing valuable theoretical insights, practical examples, and data resources for related research fields. |
|---|---|
| ISSN: | 0950-7051 |
| DOI: | 10.1016/j.knosys.2025.113624 |