Leveraging Visual Captions for Enhanced Zero-Shot HOI Detection

Zero-shot Human-Object Interaction (HOI) detection aims to identify both seen and unseen HOI categories in an image. Most existing methods rely on semantic knowledge distilled from CLIP to find novel interactions but fail to fully exploit the powerful generalization ability of vision-language models...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) s. 1 - 5
Hlavní autori: Zeng, Yanqing, Mao, Yunyao, Lu, Zhenbo, Zhou, Wengang, Li, Houqiang
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 06.04.2025
Predmet:
ISSN:2379-190X
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Zero-shot Human-Object Interaction (HOI) detection aims to identify both seen and unseen HOI categories in an image. Most existing methods rely on semantic knowledge distilled from CLIP to find novel interactions but fail to fully exploit the powerful generalization ability of vision-language models, leading to impaired transferability. In this paper, we introduce a novel framework for zero-shot HOI detection. We first utilize vision-language models (VLMs) to generate visual captions from multiple perspectives, including humans, objects, and environments, to enhance interaction understanding. Then, we propose a multi-modal fusion encoder to fully leverage these visual captions. Additionally, to equip the HOI detector with a thorough consideration of contextual information in the image, we design a novel multi-branch HOI network that aggregates features at the instance, union, and global levels. Experiments on prevalent benchmarks demonstrate that our model achieves promising performance under a variety of zero-shot settings. The source codes are available at https://github.com/aqingcv/VC-HOI.
ISSN:2379-190X
DOI:10.1109/ICASSP49660.2025.10888344