Leveraging Visual Captions for Enhanced Zero-Shot HOI Detection

Zero-shot Human-Object Interaction (HOI) detection aims to identify both seen and unseen HOI categories in an image. Most existing methods rely on semantic knowledge distilled from CLIP to find novel interactions but fail to fully exploit the powerful generalization ability of vision-language models...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) s. 1 - 5
Hlavní autoři: Zeng, Yanqing, Mao, Yunyao, Lu, Zhenbo, Zhou, Wengang, Li, Houqiang
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 06.04.2025
Témata:
ISSN:2379-190X
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Zero-shot Human-Object Interaction (HOI) detection aims to identify both seen and unseen HOI categories in an image. Most existing methods rely on semantic knowledge distilled from CLIP to find novel interactions but fail to fully exploit the powerful generalization ability of vision-language models, leading to impaired transferability. In this paper, we introduce a novel framework for zero-shot HOI detection. We first utilize vision-language models (VLMs) to generate visual captions from multiple perspectives, including humans, objects, and environments, to enhance interaction understanding. Then, we propose a multi-modal fusion encoder to fully leverage these visual captions. Additionally, to equip the HOI detector with a thorough consideration of contextual information in the image, we design a novel multi-branch HOI network that aggregates features at the instance, union, and global levels. Experiments on prevalent benchmarks demonstrate that our model achieves promising performance under a variety of zero-shot settings. The source codes are available at https://github.com/aqingcv/VC-HOI.
ISSN:2379-190X
DOI:10.1109/ICASSP49660.2025.10888344