OmniViD: A Generative Framework for Universal Video Understanding

The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically de-tect objects or actions in a video and analyze their temporal evolution. Despite sharing a common goal, different tasks often rely on distinct model architectures and annotation formats. In c...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) s. 18209 - 18220
Hlavní autoři: Wang, Junke, Chen, Dongdong, Luo, Chong, He, Bo, Yuan, Lu, Wu, Zuxuan, Jiang, Yu-Gang
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 16.06.2024
Témata:
ISSN:1063-6919
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically de-tect objects or actions in a video and analyze their temporal evolution. Despite sharing a common goal, different tasks often rely on distinct model architectures and annotation formats. In contrast, natural language processing benefits from a unified output space, i.e., text sequences, which simplifies the training of powerful foundational language models, such as GPT-3, with extensive training cor-pora. Inspired by this, we seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens. In this way, a variety of video tasks could be formulated as video-grounded token generation. This enables us to address var-ious types of video tasks, including classification (such as action recognition), captioning (covering clip captioning, video question answering, and dense video captioning), and localization tasks (such as visual object tracking) within a fully shared encoder-decoder architecture, following a generative framework. Through comprehensive experiments, we demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or compet-itive results on seven video benchmarks, providing a novel perspective for more universal video understanding. Code is available at https://github.com/wangjk666/OmniVid.
AbstractList The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically de-tect objects or actions in a video and analyze their temporal evolution. Despite sharing a common goal, different tasks often rely on distinct model architectures and annotation formats. In contrast, natural language processing benefits from a unified output space, i.e., text sequences, which simplifies the training of powerful foundational language models, such as GPT-3, with extensive training cor-pora. Inspired by this, we seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens. In this way, a variety of video tasks could be formulated as video-grounded token generation. This enables us to address var-ious types of video tasks, including classification (such as action recognition), captioning (covering clip captioning, video question answering, and dense video captioning), and localization tasks (such as visual object tracking) within a fully shared encoder-decoder architecture, following a generative framework. Through comprehensive experiments, we demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or compet-itive results on seven video benchmarks, providing a novel perspective for more universal video understanding. Code is available at https://github.com/wangjk666/OmniVid.
Author Wang, Junke
Wu, Zuxuan
Jiang, Yu-Gang
Chen, Dongdong
Luo, Chong
Yuan, Lu
He, Bo
Author_xml – sequence: 1
  givenname: Junke
  surname: Wang
  fullname: Wang, Junke
  organization: School of CS, Fudan University,Shanghai Key Lab of Intell. Info. Processing
– sequence: 2
  givenname: Dongdong
  surname: Chen
  fullname: Chen, Dongdong
  organization: Microsoft Cloud + AI
– sequence: 3
  givenname: Chong
  surname: Luo
  fullname: Luo, Chong
  organization: Microsoft Research Asia
– sequence: 4
  givenname: Bo
  surname: He
  fullname: He, Bo
  organization: University of Maryland,College Park
– sequence: 5
  givenname: Lu
  surname: Yuan
  fullname: Yuan, Lu
  organization: Microsoft Cloud + AI
– sequence: 6
  givenname: Zuxuan
  surname: Wu
  fullname: Wu, Zuxuan
  organization: School of CS, Fudan University,Shanghai Key Lab of Intell. Info. Processing
– sequence: 7
  givenname: Yu-Gang
  surname: Jiang
  fullname: Jiang, Yu-Gang
  organization: School of CS, Fudan University,Shanghai Key Lab of Intell. Info. Processing
BookMark eNotj9FKwzAUhqMoOOfeYBd9gc5zctI08W5MN4XBRFxvR9qcSnRLJS2Kb29Brz6-7-KH_1pcxC6yEHOEBSLY21X1_FLIkmghQaoFYCnVmZjZ0hoqgAoC0OdigqAp1xbtlZj1_TsAkETU1kzEcneKoQr3d9ky23Dk5Ibwxdk6uRN_d-kja7uU7ePYUu-OWRU8d6P7UQcXfYhvN-KydceeZ_-civ364XX1mG93m6fVcpsHLPWQW28sIJBxynDDbVtLVypoNNYAXhVtbRx78JZ00yiLvi1Ry4YNSa1QMk3F_G83MPPhM4WTSz-H8VqhLAH9Am9UTMM
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR52733.2024.01724
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 9798350353006
EISSN 1063-6919
EndPage 18220
ExternalDocumentID 10654930
Genre orig-research
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i176t-9d8901038a48eceffb2a740c61b00d45fb8aed0d936cc491df7162ce8326412e3
IEDL.DBID RIE
ISICitedReferencesCount 8
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001342515501053&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 01:55:18 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i176t-9d8901038a48eceffb2a740c61b00d45fb8aed0d936cc491df7162ce8326412e3
PageCount 12
ParticipantIDs ieee_primary_10654930
PublicationCentury 2000
PublicationDate 2024-June-16
PublicationDateYYYYMMDD 2024-06-16
PublicationDate_xml – month: 06
  year: 2024
  text: 2024-June-16
  day: 16
PublicationDecade 2020
PublicationTitle Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev CVPR
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211698
Score 2.417665
Snippet The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically de-tect objects or actions in a video and analyze...
SourceID ieee
SourceType Publisher
StartPage 18209
SubjectTerms Benchmark testing
Computer architecture
Location awareness
Robustness
Training
Visualization
Vocabulary
Title OmniViD: A Generative Framework for Universal Video Understanding
URI https://ieeexplore.ieee.org/document/10654930
WOSCitedRecordID wos001342515501053&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV25TgMxELUgoqAKRxC3XNBu2F17fdBFgYgChQiRKF3kYyytBBuUg-_H9m4CFBR0lhtLY409z37vDUI32m9ybJpBck0TKqRJNCieSEc1V4o7AdEy_4kPh2I6laNGrB61MAAQyWfQDcP4l2_nZh2eynyGMw9niEfou5yzWqy1fVAhHsowKRp5XJbK2_5k9BL8xYiHgTntBrRDfzVRiXfIoP3P1Q9Q51uNh0fbe-YQ7UB1hNpN-Yib5Fweo97ze1VOyvs73MO1m3Q4yvBgQ7_Cvj7FDRFDveFJaWGOxz_FLR00Hjy89h-TpkNCUmacrRJpRaBXEKGoAAPO6VxxmhqW-WyytHBaKLCplYQZQ2VmXTCMMuDTmNEsB3KCWtW8glOEfamilOOp0B6gGQ_KMigU4VQXoJ3WxRnqhJDMPmoTjNkmGud_zF-g_RD1wKrK2CVqrRZruEJ75nNVLhfXceu-AJxrmcY
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NTwIxEG0MmugJPzB-24PXxd1td9t6IyjBiEgMEG6k7U6TTXQxsPj7bcuCevDgremlSSfTzmvfe4PQjbJB9k0zSKxoQLnQgQLJAmGoYlIyw8Fb5vdYv88nEzGoxOpeCwMAnnwGTTf0f_nZTC_dU5nN8NTCGWIR-rZrnVXJtTZPKsSCmVTwSiAXheK2PR68OocxYoFgTJsO79BfbVT8LdKp_3P9fdT41uPhweamOUBbUByielVA4io9F0eo9fJe5OP8_g638MpP2h1muLMmYGFboeKKiiHf8DjPYIZHP-UtDTTqPAzb3aDqkRDkEUvLQGTcESwIl5SDBmNULBkNdRrZfMpoYhSXkIWZIKnWVESZcZZRGmwipzSKgRyjWjEr4ARhW6xIaVjIlYVo2sKyCBJJGFUJKKNUcooabkumHysbjOl6N87-mL9Gu93hc2_ae-w_naM9FwHHsYrSC1Qr50u4RDv6s8wX8ysfxi_Q9Z0P
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=OmniViD%3A+A+Generative+Framework+for+Universal+Video+Understanding&rft.au=Wang%2C+Junke&rft.au=Chen%2C+Dongdong&rft.au=Luo%2C+Chong&rft.au=He%2C+Bo&rft.date=2024-06-16&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=18209&rft.epage=18220&rft_id=info:doi/10.1109%2FCVPR52733.2024.01724&rft.externalDocID=10654930