Query-guided expansion and contraction of document sets

Document set expansion is a fundamental problem in information retrieval, involving the augmentation of an initial document set with additional relevant documents from a larger corpus. Query reformulation techniques, such as query expansion and refinement, offer effective means to modify initial que...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Information sciences Ročník 726; s. 122779
Hlavní autoři: Deloose, Arne, De Baets, Bernard, Verwaeren, Jan
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier Inc 01.02.2026
Témata:
ISSN:0020-0255
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Document set expansion is a fundamental problem in information retrieval, involving the augmentation of an initial document set with additional relevant documents from a larger corpus. Query reformulation techniques, such as query expansion and refinement, offer effective means to modify initial queries and retrieve supplementary documents, but they are inherently limited by dependence on an initial query and relevance feedback. In contrast, embedding techniques represent documents in a feature space, enabling expansion or contraction by exploiting document similarities. However, existing approaches often fail to reconcile the advantages of both strategies. To address this, we propose a novel method that integrates query reformulation and embedding techniques into a unified framework. Our method aims to augment document sets—allowing both expansion and contraction—while satisfying desirable properties, including high intra-set document embedding similarity, fidelity to the initial set, and simplicity through low-complexity queries. While we show that our proposal admits a Mixed Integer Quadratic Programming formulation, which can be globally optimized for small problem sizes, we also present a computationally efficient heuristic search algorithm for larger instances. We illustrate our method using a running toy example and demonstrate its efficacy through experimental evaluation on a benchmark corpus of newspaper articles. •Combines query reformulation and embeddings for retrieval.•Introduces MIQP and heuristic methods for document expansion.•Achieves high precision and recall across benchmark datasets.•Balances interpretability with computational efficiency.•Applicable to tracking evolving themes and AI retrieval systems.
ISSN:0020-0255
DOI:10.1016/j.ins.2025.122779