OmniViD: A Generative Framework for Universal Video Understanding
The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically de-tect objects or actions in a video and analyze their temporal evolution. Despite sharing a common goal, different tasks often rely on distinct model architectures and annotation formats. In c...
Uloženo v:
| Vydáno v: | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) s. 18209 - 18220 |
|---|---|
| Hlavní autoři: | , , , , , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
16.06.2024
|
| Témata: | |
| ISSN: | 1063-6919 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically de-tect objects or actions in a video and analyze their temporal evolution. Despite sharing a common goal, different tasks often rely on distinct model architectures and annotation formats. In contrast, natural language processing benefits from a unified output space, i.e., text sequences, which simplifies the training of powerful foundational language models, such as GPT-3, with extensive training cor-pora. Inspired by this, we seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens. In this way, a variety of video tasks could be formulated as video-grounded token generation. This enables us to address var-ious types of video tasks, including classification (such as action recognition), captioning (covering clip captioning, video question answering, and dense video captioning), and localization tasks (such as visual object tracking) within a fully shared encoder-decoder architecture, following a generative framework. Through comprehensive experiments, we demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or compet-itive results on seven video benchmarks, providing a novel perspective for more universal video understanding. Code is available at https://github.com/wangjk666/OmniVid. |
|---|---|
| AbstractList | The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically de-tect objects or actions in a video and analyze their temporal evolution. Despite sharing a common goal, different tasks often rely on distinct model architectures and annotation formats. In contrast, natural language processing benefits from a unified output space, i.e., text sequences, which simplifies the training of powerful foundational language models, such as GPT-3, with extensive training cor-pora. Inspired by this, we seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens. In this way, a variety of video tasks could be formulated as video-grounded token generation. This enables us to address var-ious types of video tasks, including classification (such as action recognition), captioning (covering clip captioning, video question answering, and dense video captioning), and localization tasks (such as visual object tracking) within a fully shared encoder-decoder architecture, following a generative framework. Through comprehensive experiments, we demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or compet-itive results on seven video benchmarks, providing a novel perspective for more universal video understanding. Code is available at https://github.com/wangjk666/OmniVid. |
| Author | Wang, Junke Wu, Zuxuan Jiang, Yu-Gang Chen, Dongdong Luo, Chong Yuan, Lu He, Bo |
| Author_xml | – sequence: 1 givenname: Junke surname: Wang fullname: Wang, Junke organization: School of CS, Fudan University,Shanghai Key Lab of Intell. Info. Processing – sequence: 2 givenname: Dongdong surname: Chen fullname: Chen, Dongdong organization: Microsoft Cloud + AI – sequence: 3 givenname: Chong surname: Luo fullname: Luo, Chong organization: Microsoft Research Asia – sequence: 4 givenname: Bo surname: He fullname: He, Bo organization: University of Maryland,College Park – sequence: 5 givenname: Lu surname: Yuan fullname: Yuan, Lu organization: Microsoft Cloud + AI – sequence: 6 givenname: Zuxuan surname: Wu fullname: Wu, Zuxuan organization: School of CS, Fudan University,Shanghai Key Lab of Intell. Info. Processing – sequence: 7 givenname: Yu-Gang surname: Jiang fullname: Jiang, Yu-Gang organization: School of CS, Fudan University,Shanghai Key Lab of Intell. Info. Processing |
| BookMark | eNotj9FKwzAUhqMoOOfeYBd9gc5zctI08W5MN4XBRFxvR9qcSnRLJS2Kb29Brz6-7-KH_1pcxC6yEHOEBSLY21X1_FLIkmghQaoFYCnVmZjZ0hoqgAoC0OdigqAp1xbtlZj1_TsAkETU1kzEcneKoQr3d9ky23Dk5Ibwxdk6uRN_d-kja7uU7ePYUu-OWRU8d6P7UQcXfYhvN-KydceeZ_-civ364XX1mG93m6fVcpsHLPWQW28sIJBxynDDbVtLVypoNNYAXhVtbRx78JZ00yiLvi1Ry4YNSa1QMk3F_G83MPPhM4WTSz-H8VqhLAH9Am9UTMM |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/CVPR52733.2024.01724 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Applied Sciences |
| EISBN | 9798350353006 |
| EISSN | 1063-6919 |
| EndPage | 18220 |
| ExternalDocumentID | 10654930 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO |
| ID | FETCH-LOGICAL-i176t-9d8901038a48eceffb2a740c61b00d45fb8aed0d936cc491df7162ce8326412e3 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 8 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001342515501053&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 01:55:18 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i176t-9d8901038a48eceffb2a740c61b00d45fb8aed0d936cc491df7162ce8326412e3 |
| PageCount | 12 |
| ParticipantIDs | ieee_primary_10654930 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-June-16 |
| PublicationDateYYYYMMDD | 2024-06-16 |
| PublicationDate_xml | – month: 06 year: 2024 text: 2024-June-16 day: 16 |
| PublicationDecade | 2020 |
| PublicationTitle | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) |
| PublicationTitleAbbrev | CVPR |
| PublicationYear | 2024 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0003211698 |
| Score | 2.417665 |
| Snippet | The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically de-tect objects or actions in a video and analyze... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 18209 |
| SubjectTerms | Benchmark testing Computer architecture Location awareness Robustness Training Visualization Vocabulary |
| Title | OmniViD: A Generative Framework for Universal Video Understanding |
| URI | https://ieeexplore.ieee.org/document/10654930 |
| WOSCitedRecordID | wos001342515501053&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV25TgMxELUgoqAKRxC3XNBu2F17fdBFgYgChQiRKF3kYyytBBuUg-_H9m4CFBR0lhtLY409z37vDUI32m9ybJpBck0TKqRJNCieSEc1V4o7AdEy_4kPh2I6laNGrB61MAAQyWfQDcP4l2_nZh2eynyGMw9niEfou5yzWqy1fVAhHsowKRp5XJbK2_5k9BL8xYiHgTntBrRDfzVRiXfIoP3P1Q9Q51uNh0fbe-YQ7UB1hNpN-Yib5Fweo97ze1VOyvs73MO1m3Q4yvBgQ7_Cvj7FDRFDveFJaWGOxz_FLR00Hjy89h-TpkNCUmacrRJpRaBXEKGoAAPO6VxxmhqW-WyytHBaKLCplYQZQ2VmXTCMMuDTmNEsB3KCWtW8glOEfamilOOp0B6gGQ_KMigU4VQXoJ3WxRnqhJDMPmoTjNkmGud_zF-g_RD1wKrK2CVqrRZruEJ75nNVLhfXceu-AJxrmcY |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NTwIxEG0MmugJPzB-24PXxd1td9t6IyjBiEgMEG6k7U6TTXQxsPj7bcuCevDgremlSSfTzmvfe4PQjbJB9k0zSKxoQLnQgQLJAmGoYlIyw8Fb5vdYv88nEzGoxOpeCwMAnnwGTTf0f_nZTC_dU5nN8NTCGWIR-rZrnVXJtTZPKsSCmVTwSiAXheK2PR68OocxYoFgTJsO79BfbVT8LdKp_3P9fdT41uPhweamOUBbUByielVA4io9F0eo9fJe5OP8_g638MpP2h1muLMmYGFboeKKiiHf8DjPYIZHP-UtDTTqPAzb3aDqkRDkEUvLQGTcESwIl5SDBmNULBkNdRrZfMpoYhSXkIWZIKnWVESZcZZRGmwipzSKgRyjWjEr4ARhW6xIaVjIlYVo2sKyCBJJGFUJKKNUcooabkumHysbjOl6N87-mL9Gu93hc2_ae-w_naM9FwHHsYrSC1Qr50u4RDv6s8wX8ysfxi_Q9Z0P |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=OmniViD%3A+A+Generative+Framework+for+Universal+Video+Understanding&rft.au=Wang%2C+Junke&rft.au=Chen%2C+Dongdong&rft.au=Luo%2C+Chong&rft.au=He%2C+Bo&rft.date=2024-06-16&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=18209&rft.epage=18220&rft_id=info:doi/10.1109%2FCVPR52733.2024.01724&rft.externalDocID=10654930 |