Interactive Image Caption Generation Reflecting User Intent from Trace Using a Diffusion Language Model

This study proposes an image captioning method designed to incorporate user-specific explanatory intentions into the generated text, as signaled by the user’s trace on the image. We extract areas of interest from dense sections of the trace, determine the order of explanations by tracking changes in...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Journal of advanced computational intelligence and intelligent informatics Ročník 29; číslo 6; s. 1417 - 1426
Hlavní autoři:	Hirano, Satoko, Kobayashi, Ichiro
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Tokyo Fuji Technology Press Co. Ltd 20.11.2025
Témata:	Autoregressive models Datasets Diffusion models Language Natural language Sentences
ISSN:	1343-0130, 1883-8014
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	This study proposes an image captioning method designed to incorporate user-specific explanatory intentions into the generated text, as signaled by the user’s trace on the image. We extract areas of interest from dense sections of the trace, determine the order of explanations by tracking changes in the pen-tip coordinates, and assess the degree of interest in each area by analyzing the time spent on them. Additionally, a diffusion language model is utilized to generate sentences in a non-autoregressive manner, allowing control over sentence length based on the temporal data of the trace. In the actual caption generation task, the proposed method achieved higher string similarity than conventional methods, including autoregressive models, and successfully captured user intent from the trace and faithfully reflected it in the generated text.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1343-0130 1883-8014
DOI:	10.20965/jaciii.2025.p1417