Query-guided expansion and contraction of document sets
Document set expansion is a fundamental problem in information retrieval, involving the augmentation of an initial document set with additional relevant documents from a larger corpus. Query reformulation techniques, such as query expansion and refinement, offer effective means to modify initial que...
Uloženo v:
| Vydáno v: | Information sciences Ročník 726; s. 122779 |
|---|---|
| Hlavní autoři: | , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Elsevier Inc
01.02.2026
|
| Témata: | |
| ISSN: | 0020-0255 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | Document set expansion is a fundamental problem in information retrieval, involving the augmentation of an initial document set with additional relevant documents from a larger corpus. Query reformulation techniques, such as query expansion and refinement, offer effective means to modify initial queries and retrieve supplementary documents, but they are inherently limited by dependence on an initial query and relevance feedback. In contrast, embedding techniques represent documents in a feature space, enabling expansion or contraction by exploiting document similarities. However, existing approaches often fail to reconcile the advantages of both strategies. To address this, we propose a novel method that integrates query reformulation and embedding techniques into a unified framework. Our method aims to augment document sets—allowing both expansion and contraction—while satisfying desirable properties, including high intra-set document embedding similarity, fidelity to the initial set, and simplicity through low-complexity queries. While we show that our proposal admits a Mixed Integer Quadratic Programming formulation, which can be globally optimized for small problem sizes, we also present a computationally efficient heuristic search algorithm for larger instances. We illustrate our method using a running toy example and demonstrate its efficacy through experimental evaluation on a benchmark corpus of newspaper articles.
•Combines query reformulation and embeddings for retrieval.•Introduces MIQP and heuristic methods for document expansion.•Achieves high precision and recall across benchmark datasets.•Balances interpretability with computational efficiency.•Applicable to tracking evolving themes and AI retrieval systems. |
|---|---|
| ISSN: | 0020-0255 |
| DOI: | 10.1016/j.ins.2025.122779 |