Leveraging Visual Captions for Enhanced Zero-Shot HOI Detection
Zero-shot Human-Object Interaction (HOI) detection aims to identify both seen and unseen HOI categories in an image. Most existing methods rely on semantic knowledge distilled from CLIP to find novel interactions but fail to fully exploit the powerful generalization ability of vision-language models...
Uloženo v:
| Vydáno v: | Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) s. 1 - 5 |
|---|---|
| Hlavní autoři: | , , , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
06.04.2025
|
| Témata: | |
| ISSN: | 2379-190X |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | Zero-shot Human-Object Interaction (HOI) detection aims to identify both seen and unseen HOI categories in an image. Most existing methods rely on semantic knowledge distilled from CLIP to find novel interactions but fail to fully exploit the powerful generalization ability of vision-language models, leading to impaired transferability. In this paper, we introduce a novel framework for zero-shot HOI detection. We first utilize vision-language models (VLMs) to generate visual captions from multiple perspectives, including humans, objects, and environments, to enhance interaction understanding. Then, we propose a multi-modal fusion encoder to fully leverage these visual captions. Additionally, to equip the HOI detector with a thorough consideration of contextual information in the image, we design a novel multi-branch HOI network that aggregates features at the instance, union, and global levels. Experiments on prevalent benchmarks demonstrate that our model achieves promising performance under a variety of zero-shot settings. The source codes are available at https://github.com/aqingcv/VC-HOI. |
|---|---|
| ISSN: | 2379-190X |
| DOI: | 10.1109/ICASSP49660.2025.10888344 |