Leveraging Visual Captions for Enhanced Zero-Shot HOI Detection

Zero-shot Human-Object Interaction (HOI) detection aims to identify both seen and unseen HOI categories in an image. Most existing methods rely on semantic knowledge distilled from CLIP to find novel interactions but fail to fully exploit the powerful generalization ability of vision-language models...

Full description

Saved in:
Bibliographic Details
Published in:Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) pp. 1 - 5
Main Authors: Zeng, Yanqing, Mao, Yunyao, Lu, Zhenbo, Zhou, Wengang, Li, Houqiang
Format: Conference Proceeding
Language:English
Published: IEEE 06.04.2025
Subjects:
ISSN:2379-190X
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Zero-shot Human-Object Interaction (HOI) detection aims to identify both seen and unseen HOI categories in an image. Most existing methods rely on semantic knowledge distilled from CLIP to find novel interactions but fail to fully exploit the powerful generalization ability of vision-language models, leading to impaired transferability. In this paper, we introduce a novel framework for zero-shot HOI detection. We first utilize vision-language models (VLMs) to generate visual captions from multiple perspectives, including humans, objects, and environments, to enhance interaction understanding. Then, we propose a multi-modal fusion encoder to fully leverage these visual captions. Additionally, to equip the HOI detector with a thorough consideration of contextual information in the image, we design a novel multi-branch HOI network that aggregates features at the instance, union, and global levels. Experiments on prevalent benchmarks demonstrate that our model achieves promising performance under a variety of zero-shot settings. The source codes are available at https://github.com/aqingcv/VC-HOI.
ISSN:2379-190X
DOI:10.1109/ICASSP49660.2025.10888344