Geo-FuB: A method for constructing an Operator-Function knowledge base for geospatial code generation with large language models

The rapid growth of spatiotemporal data and the increasing demand for geospatial modeling have driven the automation of these tasks with large language models (LLMs) to enhance research efficiency. However, general LLMs often encounter hallucinations when generating geospatial code due to a lack of...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Knowledge-based systems Ročník 319; s. 113624
Hlavní autori: Hou, Shuyang, Zhao, Anqi, Liang, Jianyuan, Shen, Zhangxiao, Wu, Huayi
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Elsevier B.V 15.06.2025
Predmet:
ISSN:0950-7051
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:The rapid growth of spatiotemporal data and the increasing demand for geospatial modeling have driven the automation of these tasks with large language models (LLMs) to enhance research efficiency. However, general LLMs often encounter hallucinations when generating geospatial code due to a lack of domain-specific knowledge on geospatial functions and related operators. The retrieval-augmented generation (RAG) technique, integrated with an external operator-function knowledge base, provides an effective solution to this challenge. To date, no widely recognized framework exists for building such a knowledge base. This study presents a comprehensive framework for constructing the operator-function knowledge base, leveraging semantic and structural knowledge embedded in geospatial scripts. The framework consists of three core components: Function Semantic Framework Construction (Geo-FuSE), Frequent Operator Combination Statistics (Geo-FuST), and Combination and Semantic Framework Mapping (Geo-FuM). Geo-FuSE employs techniques like Chain-of-Thought (CoT), TF-IDF, t-SNE, and Gaussian Mixture Models (GMM) to extract semantic features from scripts; Geo-FuST uses Abstract Syntax Trees (AST) and the Apriori algorithm to identify frequent operator combinations; Geo-FuM combines LLMs with a fuzzy matching algorithm to align these combinations with the semantic framework, forming the Geo-FuB knowledge base. The instance of Geo-FuB, named GEE-FuB, has been developed using 154,075 Google Earth Engine scripts and is available at https://github.com/whuhsy/GEE-FuB. Based on a set of well-defined evaluation metrics introduced in this study, the GEE-FuB construction achieved an overall accuracy of 88.89 %, demonstrating a 31 % to 34 % reduction in hallucinations compared to mainstream LLMs without external knowledge integration. This research introduces a novel approach to knowledge mining and knowledge base construction specifically tailored for geospatial code generation tasks, broadening the applications of knowledge base construction and providing valuable theoretical insights, practical examples, and data resources for related research fields.
ISSN:0950-7051
DOI:10.1016/j.knosys.2025.113624