Distributed Audio-Visual Parsing Based On Multimodal Transformer and Deep Joint Source Channel Coding

Audio-visual parsing (AVP) is a newly emerged multimodal perception task, which detects and classifies audio-visual events in video. However, most existing AVP networks only use a simple attention mechanism to guide audio-visual multimodal events, and are implemented in a single end. This makes it u...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) s. 4623 - 4627
Hlavní autoři: Wang, Penghong, Li, Jiahui, Ma, Mengyao, Fan, Xiaopeng
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 23.05.2022
Témata:
ISSN:2379-190X
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Audio-visual parsing (AVP) is a newly emerged multimodal perception task, which detects and classifies audio-visual events in video. However, most existing AVP networks only use a simple attention mechanism to guide audio-visual multimodal events, and are implemented in a single end. This makes it unable to effectively capture the relationship between audio-visual events, and is not suitable for implementation in the network transmission scenario. In this paper, we focus on these problems and propose a distributed audio-visual parsing network (DAVPNet) based on multimodal transformer and deep joint source channel coding (DJSCC). Multimodal transformers are used to enhance the attention calculation between audio-visual events, and DJSCC is used to apply DAVP tasks to network transmission scenarios. Finally, the Look, Listen, and Parse (LLP) dataset is used to test the algorithm performance, and the experimental results show that the DAVPNet has superior parsing performance.
AbstractList Audio-visual parsing (AVP) is a newly emerged multimodal perception task, which detects and classifies audio-visual events in video. However, most existing AVP networks only use a simple attention mechanism to guide audio-visual multimodal events, and are implemented in a single end. This makes it unable to effectively capture the relationship between audio-visual events, and is not suitable for implementation in the network transmission scenario. In this paper, we focus on these problems and propose a distributed audio-visual parsing network (DAVPNet) based on multimodal transformer and deep joint source channel coding (DJSCC). Multimodal transformers are used to enhance the attention calculation between audio-visual events, and DJSCC is used to apply DAVP tasks to network transmission scenarios. Finally, the Look, Listen, and Parse (LLP) dataset is used to test the algorithm performance, and the experimental results show that the DAVPNet has superior parsing performance.
Author Ma, Mengyao
Wang, Penghong
Li, Jiahui
Fan, Xiaopeng
Author_xml – sequence: 1
  givenname: Penghong
  surname: Wang
  fullname: Wang, Penghong
  organization: Harbin Institute of Technology,School of Computer Science and Technology,Harbin,China,150001
– sequence: 2
  givenname: Jiahui
  surname: Li
  fullname: Li, Jiahui
  organization: Huawei,Wireless Technology Lab,Shenzhen,China,518129
– sequence: 3
  givenname: Mengyao
  surname: Ma
  fullname: Ma, Mengyao
  organization: Huawei,Wireless Technology Lab,Shenzhen,China,518129
– sequence: 4
  givenname: Xiaopeng
  surname: Fan
  fullname: Fan, Xiaopeng
  organization: Harbin Institute of Technology,School of Computer Science and Technology,Harbin,China,150001
BookMark eNotkNtKAzEYhKMo2FafwJu8wNZsDpvmsm49Umlhq3hX_jZ_NLJNSrJ74du7YGGYufhgYGZMLkIMSAgt2bQsmbl7qedNs5bCcD7lbDCjZVVV7IyMy6pSkg2qzsmIC22K0rDPKzLO-YcxNtNyNiK48LlLftd3aOm8tz4WHz730NI1pOzDF72HPKBVoG992_lDtAPbJAjZxXTARCFYukA80tfoQ0eb2Kc90vobQsCW1tEOJdfk0kGb8eaUE_L--LCpn4vl6mlYsCw8Z6IrkHHktjJWS7kvpeB2Z7UFMM5KJzVIiwaUdRydcugQlbY7UEqAUXuuQEzI7X-vR8TtMfkDpN_t6RLxB92MWx0
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ICASSP43922.2022.9746660
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISBN 1665405406
9781665405409
EISSN 2379-190X
EndPage 4627
ExternalDocumentID 9746660
Genre orig-research
GrantInformation_xml – fundername: Research and Development
  funderid: 10.13039/100006190
– fundername: National Science Foundation
  funderid: 10.13039/100000001
GroupedDBID 23M
6IE
6IF
6IH
6IK
6IL
6IM
6IN
AAJGR
AAWTH
ABLEC
ACGFS
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
IPLJI
M43
OCL
RIE
RIL
RIO
RNS
ID FETCH-LOGICAL-i203t-e02e2d69d744c1432dbd7daa9fd4f47a4de9a5df2ef5fefee57dba553a95c25a3
IEDL.DBID RIE
ISICitedReferencesCount 9
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000864187904182&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:25:03 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i203t-e02e2d69d744c1432dbd7daa9fd4f47a4de9a5df2ef5fefee57dba553a95c25a3
PageCount 5
ParticipantIDs ieee_primary_9746660
PublicationCentury 2000
PublicationDate 2022-May-23
PublicationDateYYYYMMDD 2022-05-23
PublicationDate_xml – month: 05
  year: 2022
  text: 2022-May-23
  day: 23
PublicationDecade 2020
PublicationTitle Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998)
PublicationTitleAbbrev ICASSP
PublicationYear 2022
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0008748
Score 2.2839684
Snippet Audio-visual parsing (AVP) is a newly emerged multimodal perception task, which detects and classifies audio-visual events in video. However, most existing AVP...
SourceID ieee
SourceType Publisher
StartPage 4623
SubjectTerms Channel coding
Decoding
deep joint source channel coding
distributed audio-visual parsing network
multimodal transformer
Signal processing algorithms
Speech coding
Training
Transformers
Visualization
Title Distributed Audio-Visual Parsing Based On Multimodal Transformer and Deep Joint Source Channel Coding
URI https://ieeexplore.ieee.org/document/9746660
WOSCitedRecordID wos000864187904182&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA61eNCLj1Z8k4NH167ZzWZzrK2iInWhVXor2Z0JLNTd0u76-03ShwpevISQBwMzJPNIvhlCrpjyIZYgjG_ClRemOvDSDJQXmdUs1hlK3wGFX8RgEI_HMmmQ6w0WBhHd5zO8sV33lg9lVttQWcfYvsbaNg76lhDREqu1uXVjEcbrnzq-7Dz1usNhYrQts2gr06z2_iqi4nTIw97_qO-T9jcYjyYbNXNAGlgckt0feQRbBPs2_a2tXIVAuzXkpfeeL2o1pYlywQB6Z5QV0NeCOsDtRwlmbrS2WXFOVQG0jzijz2VeVHToQvrUQg8KnNJeaUm3ydvD_aj36K3qJ3g584PKQ58hg8hIIgwzYxcxSEGAUlJDqEOhQkCpOGiGmmvUiFxAqjgPlOQZ4yo4Is2iLPCYUD9iWjOdgbT-o0yVQoEZD7Q5wVrfshPSsgybzJYpMiYrXp3-PXxGdqxM7CM8C85Js5rXeEG2s88qX8wvnVy_AJKAp68
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3dT8IwEG-ImqgvfoDx2z746GR2K1sfESSgiCSg4Y10vWuyBDcCm3-_7RyoiS--NE0_csld2vtof3eEXDPpQiggML4Jl44fac-JFEinYVazUCsUbgEU7geDQTiZiGGF3KyxMIhYfD7DW9st3vIhVbkNldWN7WusbeOgb9rKWSVaa33vhoEfrv7quKLeazVHo6HRt8zirUxT7v5VRqXQIp29_9HfJ7VvOB4drhXNAalgckh2f2QSrBJs2wS4tnYVAm3mEKfOW7zM5YwOZREOoPdGXQF9SWgBuX1PwcyNV1YrLqhMgLYR5_QxjZOMjoqgPrXggwRntJVa0jXy2nkYt7pOWUHBiZnrZQ66DBk0jCx8XxnLiEEEAUgpNPjaD6QPKCQHzVBzjRqRBxBJzj0puGJcekdkI0kTPCbUbTCtmVYgrAcpIikxQMU9bc6w1nfshFQtw6bzryQZ05JXp38PX5Ht7vi5P-33Bk9nZMfKxz7JM--cbGSLHC_IlvrI4uXispDxJ3qAqvg
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+of+the+...+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%281998%29&rft.atitle=Distributed+Audio-Visual+Parsing+Based+On+Multimodal+Transformer+and+Deep+Joint+Source+Channel+Coding&rft.au=Wang%2C+Penghong&rft.au=Li%2C+Jiahui&rft.au=Ma%2C+Mengyao&rft.au=Fan%2C+Xiaopeng&rft.date=2022-05-23&rft.pub=IEEE&rft.eissn=2379-190X&rft.spage=4623&rft.epage=4627&rft_id=info:doi/10.1109%2FICASSP43922.2022.9746660&rft.externalDocID=9746660