Query-guided expansion and contraction of document sets

Document set expansion is a fundamental problem in information retrieval, involving the augmentation of an initial document set with additional relevant documents from a larger corpus. Query reformulation techniques, such as query expansion and refinement, offer effective means to modify initial que...

Full description

Saved in:
Bibliographic Details
Published in:Information sciences Vol. 726; p. 122779
Main Authors: Deloose, Arne, De Baets, Bernard, Verwaeren, Jan
Format: Journal Article
Language:English
Published: Elsevier Inc 01.02.2026
Subjects:
ISSN:0020-0255
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Document set expansion is a fundamental problem in information retrieval, involving the augmentation of an initial document set with additional relevant documents from a larger corpus. Query reformulation techniques, such as query expansion and refinement, offer effective means to modify initial queries and retrieve supplementary documents, but they are inherently limited by dependence on an initial query and relevance feedback. In contrast, embedding techniques represent documents in a feature space, enabling expansion or contraction by exploiting document similarities. However, existing approaches often fail to reconcile the advantages of both strategies. To address this, we propose a novel method that integrates query reformulation and embedding techniques into a unified framework. Our method aims to augment document sets—allowing both expansion and contraction—while satisfying desirable properties, including high intra-set document embedding similarity, fidelity to the initial set, and simplicity through low-complexity queries. While we show that our proposal admits a Mixed Integer Quadratic Programming formulation, which can be globally optimized for small problem sizes, we also present a computationally efficient heuristic search algorithm for larger instances. We illustrate our method using a running toy example and demonstrate its efficacy through experimental evaluation on a benchmark corpus of newspaper articles. •Combines query reformulation and embeddings for retrieval.•Introduces MIQP and heuristic methods for document expansion.•Achieves high precision and recall across benchmark datasets.•Balances interpretability with computational efficiency.•Applicable to tracking evolving themes and AI retrieval systems.
ISSN:0020-0255
DOI:10.1016/j.ins.2025.122779