Query-guided expansion and contraction of document sets

Document set expansion is a fundamental problem in information retrieval, involving the augmentation of an initial document set with additional relevant documents from a larger corpus. Query reformulation techniques, such as query expansion and refinement, offer effective means to modify initial que...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Information sciences Jg. 726; S. 122779
Hauptverfasser: Deloose, Arne, De Baets, Bernard, Verwaeren, Jan
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Elsevier Inc 01.02.2026
Schlagworte:
ISSN:0020-0255
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Document set expansion is a fundamental problem in information retrieval, involving the augmentation of an initial document set with additional relevant documents from a larger corpus. Query reformulation techniques, such as query expansion and refinement, offer effective means to modify initial queries and retrieve supplementary documents, but they are inherently limited by dependence on an initial query and relevance feedback. In contrast, embedding techniques represent documents in a feature space, enabling expansion or contraction by exploiting document similarities. However, existing approaches often fail to reconcile the advantages of both strategies. To address this, we propose a novel method that integrates query reformulation and embedding techniques into a unified framework. Our method aims to augment document sets—allowing both expansion and contraction—while satisfying desirable properties, including high intra-set document embedding similarity, fidelity to the initial set, and simplicity through low-complexity queries. While we show that our proposal admits a Mixed Integer Quadratic Programming formulation, which can be globally optimized for small problem sizes, we also present a computationally efficient heuristic search algorithm for larger instances. We illustrate our method using a running toy example and demonstrate its efficacy through experimental evaluation on a benchmark corpus of newspaper articles. •Combines query reformulation and embeddings for retrieval.•Introduces MIQP and heuristic methods for document expansion.•Achieves high precision and recall across benchmark datasets.•Balances interpretability with computational efficiency.•Applicable to tracking evolving themes and AI retrieval systems.
ISSN:0020-0255
DOI:10.1016/j.ins.2025.122779