Query-guided expansion and contraction of document sets
Document set expansion is a fundamental problem in information retrieval, involving the augmentation of an initial document set with additional relevant documents from a larger corpus. Query reformulation techniques, such as query expansion and refinement, offer effective means to modify initial que...
Saved in:
| Published in: | Information sciences Vol. 726; p. 122779 |
|---|---|
| Main Authors: | , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Elsevier Inc
01.02.2026
|
| Subjects: | |
| ISSN: | 0020-0255 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Document set expansion is a fundamental problem in information retrieval, involving the augmentation of an initial document set with additional relevant documents from a larger corpus. Query reformulation techniques, such as query expansion and refinement, offer effective means to modify initial queries and retrieve supplementary documents, but they are inherently limited by dependence on an initial query and relevance feedback. In contrast, embedding techniques represent documents in a feature space, enabling expansion or contraction by exploiting document similarities. However, existing approaches often fail to reconcile the advantages of both strategies. To address this, we propose a novel method that integrates query reformulation and embedding techniques into a unified framework. Our method aims to augment document sets—allowing both expansion and contraction—while satisfying desirable properties, including high intra-set document embedding similarity, fidelity to the initial set, and simplicity through low-complexity queries. While we show that our proposal admits a Mixed Integer Quadratic Programming formulation, which can be globally optimized for small problem sizes, we also present a computationally efficient heuristic search algorithm for larger instances. We illustrate our method using a running toy example and demonstrate its efficacy through experimental evaluation on a benchmark corpus of newspaper articles.
•Combines query reformulation and embeddings for retrieval.•Introduces MIQP and heuristic methods for document expansion.•Achieves high precision and recall across benchmark datasets.•Balances interpretability with computational efficiency.•Applicable to tracking evolving themes and AI retrieval systems. |
|---|---|
| ISSN: | 0020-0255 |
| DOI: | 10.1016/j.ins.2025.122779 |