Augmented decoding method using semantic diverse beam search for language generation model

Image captioning, the task of automatically generating natural language descriptions from visual content, has achieved remarkable accuracy in recent years. However, current approaches face a critical limitation in semantic diversity. Most diversity-oriented methods evaluate similarity at the surface...

Full description

Saved in:
Bibliographic Details
Published in:Knowledge-based systems Vol. 329; p. 114400
Main Authors: Na, HyungSun, Jun, Hee-Gook, Ahn, Jinhyun, Im, Dong-Hyuk
Format: Journal Article
Language:English
Published: Elsevier B.V 04.11.2025
Subjects:
ISSN:0950-7051
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Image captioning, the task of automatically generating natural language descriptions from visual content, has achieved remarkable accuracy in recent years. However, current approaches face a critical limitation in semantic diversity. Most diversity-oriented methods evaluate similarity at the surface lexical level, incorrectly treating lexically different but semantically equivalent phrases (e.g., 'dog runs' vs 'canine sprints') as meaningfully diverse outputs. This superficial approach fails to capture true semantic variation. Consequently, generated captions appear different but convey essentially identical meanings. To address this fundamental limitation, we propose Semantic Diverse Beam Search (SDBS), an augmented decoding algorithm that operates in semantic space rather than surface lexical space. SDBS integrates four key innovations: knowledge graph-based semantic similarity scoring, adaptive thresholding for important word focus, statistics-based stratified top-k sampling, and beam size normalization. Additionally, we introduce an early-stop strategy that significantly reduces computational complexity while maintaining generation quality, making SDBS practically viable for real-world applications. Comprehensive experiments demonstrate that SDBS achieves superior performance on both traditional metrics and modern evaluation approaches (BARTScore++, LLM-based assessment), generating captions with genuine semantic diversity while maintaining high accuracy and computational efficiency.
ISSN:0950-7051
DOI:10.1016/j.knosys.2025.114400