VrdONE: One-stage Video Visual Relation Detection

Uložené v:
Podrobná bibliografia
Názov: VrdONE: One-stage Video Visual Relation Detection
Autori: Xinjie Jiang, Chenxi Zheng, Xuemiao Xu, Bangzhen Liu, Weiying Zheng, Huaidong Zhang, Shengfeng He
Zdroj: Proceedings of the 32nd ACM International Conference on Multimedia. :1437-1446
Publication Status: Preprint
Informácie o vydavateľovi: ACM, 2024.
Rok vydania: 2024
Predmety: FOS: Computer and information sciences, Artificial Intelligence and Robotics, One-stage, Video understanding, Spatiotemporally synergism, Computer Vision and Pattern Recognition (cs.CV), Scene understanding, Graphics and Human Computer Interfaces, Computer Science - Computer Vision and Pattern Recognition, Video relation detection, Set prediction
Popis: Video Visual Relation Detection (VidVRD) focuses on understanding how entities interact over time and space in videos, a key step for gaining deeper insights into video scenes beyond basic visual tasks. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation categories are present and another for determining their temporal boundaries. This split overlooks the inherent connection between these elements. Addressing the need to recognize entity pairs' spatiotemporal interactions across a range of durations, we propose VrdONE, a streamlined yet efficacious one-stage model. VrdONE combines the features of subjects and objects, turning predicate detection into 1D instance segmentation on their combined representations. This setup allows for both relation category identification and binary mask generation in one go, eliminating the need for extra steps like proposal generation or post-processing. VrdONE facilitates the interaction of features across various frames, adeptly capturing both short-lived and enduring relations. Additionally, we introduce the Subject-Object Synergy (SOS) module, enhancing how subjects and objects perceive each other before combining. VrdONE achieves state-of-the-art performances on the VidOR benchmark and ImageNet-VidVRD, showcasing its superior capability in discerning relations across different temporal scales. The code is available at https://github.com/lucaspk512/vrdone.
12 pages, 8 figures, accepted by ACM Multimedia 2024
Druh dokumentu: Article
DOI: 10.1145/3664647.3680833
DOI: 10.48550/arxiv.2408.09408
Prístupová URL adresa: http://arxiv.org/abs/2408.09408
Rights: arXiv Non-Exclusive Distribution
URL: https://www.acm.org/publications/policies/copyright_policy#Background
Prístupové číslo: edsair.doi.dedup.....1a1b58a5b8f0e820cc467e68309e7f49
Databáza: OpenAIRE
Popis
Abstrakt:Video Visual Relation Detection (VidVRD) focuses on understanding how entities interact over time and space in videos, a key step for gaining deeper insights into video scenes beyond basic visual tasks. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation categories are present and another for determining their temporal boundaries. This split overlooks the inherent connection between these elements. Addressing the need to recognize entity pairs' spatiotemporal interactions across a range of durations, we propose VrdONE, a streamlined yet efficacious one-stage model. VrdONE combines the features of subjects and objects, turning predicate detection into 1D instance segmentation on their combined representations. This setup allows for both relation category identification and binary mask generation in one go, eliminating the need for extra steps like proposal generation or post-processing. VrdONE facilitates the interaction of features across various frames, adeptly capturing both short-lived and enduring relations. Additionally, we introduce the Subject-Object Synergy (SOS) module, enhancing how subjects and objects perceive each other before combining. VrdONE achieves state-of-the-art performances on the VidOR benchmark and ImageNet-VidVRD, showcasing its superior capability in discerning relations across different temporal scales. The code is available at https://github.com/lucaspk512/vrdone.<br />12 pages, 8 figures, accepted by ACM Multimedia 2024
DOI:10.1145/3664647.3680833