Distributed Audio-Visual Parsing Based On Multimodal Transformer and Deep Joint Source Channel Coding
Audio-visual parsing (AVP) is a newly emerged multimodal perception task, which detects and classifies audio-visual events in video. However, most existing AVP networks only use a simple attention mechanism to guide audio-visual multimodal events, and are implemented in a single end. This makes it u...
Uloženo v:
| Vydáno v: | Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) s. 4623 - 4627 |
|---|---|
| Hlavní autoři: | , , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
23.05.2022
|
| Témata: | |
| ISSN: | 2379-190X |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Audio-visual parsing (AVP) is a newly emerged multimodal perception task, which detects and classifies audio-visual events in video. However, most existing AVP networks only use a simple attention mechanism to guide audio-visual multimodal events, and are implemented in a single end. This makes it unable to effectively capture the relationship between audio-visual events, and is not suitable for implementation in the network transmission scenario. In this paper, we focus on these problems and propose a distributed audio-visual parsing network (DAVPNet) based on multimodal transformer and deep joint source channel coding (DJSCC). Multimodal transformers are used to enhance the attention calculation between audio-visual events, and DJSCC is used to apply DAVP tasks to network transmission scenarios. Finally, the Look, Listen, and Parse (LLP) dataset is used to test the algorithm performance, and the experimental results show that the DAVPNet has superior parsing performance. |
|---|---|
| AbstractList | Audio-visual parsing (AVP) is a newly emerged multimodal perception task, which detects and classifies audio-visual events in video. However, most existing AVP networks only use a simple attention mechanism to guide audio-visual multimodal events, and are implemented in a single end. This makes it unable to effectively capture the relationship between audio-visual events, and is not suitable for implementation in the network transmission scenario. In this paper, we focus on these problems and propose a distributed audio-visual parsing network (DAVPNet) based on multimodal transformer and deep joint source channel coding (DJSCC). Multimodal transformers are used to enhance the attention calculation between audio-visual events, and DJSCC is used to apply DAVP tasks to network transmission scenarios. Finally, the Look, Listen, and Parse (LLP) dataset is used to test the algorithm performance, and the experimental results show that the DAVPNet has superior parsing performance. |
| Author | Ma, Mengyao Wang, Penghong Li, Jiahui Fan, Xiaopeng |
| Author_xml | – sequence: 1 givenname: Penghong surname: Wang fullname: Wang, Penghong organization: Harbin Institute of Technology,School of Computer Science and Technology,Harbin,China,150001 – sequence: 2 givenname: Jiahui surname: Li fullname: Li, Jiahui organization: Huawei,Wireless Technology Lab,Shenzhen,China,518129 – sequence: 3 givenname: Mengyao surname: Ma fullname: Ma, Mengyao organization: Huawei,Wireless Technology Lab,Shenzhen,China,518129 – sequence: 4 givenname: Xiaopeng surname: Fan fullname: Fan, Xiaopeng organization: Harbin Institute of Technology,School of Computer Science and Technology,Harbin,China,150001 |
| BookMark | eNotkNtKAzEYhKMo2FafwJu8wNZsDpvmsm49Umlhq3hX_jZ_NLJNSrJ74du7YGGYufhgYGZMLkIMSAgt2bQsmbl7qedNs5bCcD7lbDCjZVVV7IyMy6pSkg2qzsmIC22K0rDPKzLO-YcxNtNyNiK48LlLftd3aOm8tz4WHz730NI1pOzDF72HPKBVoG992_lDtAPbJAjZxXTARCFYukA80tfoQ0eb2Kc90vobQsCW1tEOJdfk0kGb8eaUE_L--LCpn4vl6mlYsCw8Z6IrkHHktjJWS7kvpeB2Z7UFMM5KJzVIiwaUdRydcugQlbY7UEqAUXuuQEzI7X-vR8TtMfkDpN_t6RLxB92MWx0 |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/ICASSP43922.2022.9746660 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering |
| EISBN | 1665405406 9781665405409 |
| EISSN | 2379-190X |
| EndPage | 4627 |
| ExternalDocumentID | 9746660 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: Research and Development funderid: 10.13039/100006190 – fundername: National Science Foundation funderid: 10.13039/100000001 |
| GroupedDBID | 23M 6IE 6IF 6IH 6IK 6IL 6IM 6IN AAJGR AAWTH ABLEC ACGFS ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO RNS |
| ID | FETCH-LOGICAL-i203t-e02e2d69d744c1432dbd7daa9fd4f47a4de9a5df2ef5fefee57dba553a95c25a3 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 9 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000864187904182&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:25:03 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i203t-e02e2d69d744c1432dbd7daa9fd4f47a4de9a5df2ef5fefee57dba553a95c25a3 |
| PageCount | 5 |
| ParticipantIDs | ieee_primary_9746660 |
| PublicationCentury | 2000 |
| PublicationDate | 2022-May-23 |
| PublicationDateYYYYMMDD | 2022-05-23 |
| PublicationDate_xml | – month: 05 year: 2022 text: 2022-May-23 day: 23 |
| PublicationDecade | 2020 |
| PublicationTitle | Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) |
| PublicationTitleAbbrev | ICASSP |
| PublicationYear | 2022 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0008748 |
| Score | 2.2839684 |
| Snippet | Audio-visual parsing (AVP) is a newly emerged multimodal perception task, which detects and classifies audio-visual events in video. However, most existing AVP... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 4623 |
| SubjectTerms | Channel coding Decoding deep joint source channel coding distributed audio-visual parsing network multimodal transformer Signal processing algorithms Speech coding Training Transformers Visualization |
| Title | Distributed Audio-Visual Parsing Based On Multimodal Transformer and Deep Joint Source Channel Coding |
| URI | https://ieeexplore.ieee.org/document/9746660 |
| WOSCitedRecordID | wos000864187904182&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA61eNCLj1Z8k4NH167ZzWZzrK2iInWhVXor2Z0JLNTd0u76-03ShwpevISQBwMzJPNIvhlCrpjyIZYgjG_ClRemOvDSDJQXmdUs1hlK3wGFX8RgEI_HMmmQ6w0WBhHd5zO8sV33lg9lVttQWcfYvsbaNg76lhDREqu1uXVjEcbrnzq-7Dz1usNhYrQts2gr06z2_iqi4nTIw97_qO-T9jcYjyYbNXNAGlgckt0feQRbBPs2_a2tXIVAuzXkpfeeL2o1pYlywQB6Z5QV0NeCOsDtRwlmbrS2WXFOVQG0jzijz2VeVHToQvrUQg8KnNJeaUm3ydvD_aj36K3qJ3g584PKQ58hg8hIIgwzYxcxSEGAUlJDqEOhQkCpOGiGmmvUiFxAqjgPlOQZ4yo4Is2iLPCYUD9iWjOdgbT-o0yVQoEZD7Q5wVrfshPSsgybzJYpMiYrXp3-PXxGdqxM7CM8C85Js5rXeEG2s88qX8wvnVy_AJKAp68 |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3dT8IwEG-ImqgvfoDx2z746GR2K1sfESSgiCSg4Y10vWuyBDcCm3-_7RyoiS--NE0_csld2vtof3eEXDPpQiggML4Jl44fac-JFEinYVazUCsUbgEU7geDQTiZiGGF3KyxMIhYfD7DW9st3vIhVbkNldWN7WusbeOgb9rKWSVaa33vhoEfrv7quKLeazVHo6HRt8zirUxT7v5VRqXQIp29_9HfJ7VvOB4drhXNAalgckh2f2QSrBJs2wS4tnYVAm3mEKfOW7zM5YwOZREOoPdGXQF9SWgBuX1PwcyNV1YrLqhMgLYR5_QxjZOMjoqgPrXggwRntJVa0jXy2nkYt7pOWUHBiZnrZQ66DBk0jCx8XxnLiEEEAUgpNPjaD6QPKCQHzVBzjRqRBxBJzj0puGJcekdkI0kTPCbUbTCtmVYgrAcpIikxQMU9bc6w1nfshFQtw6bzryQZ05JXp38PX5Ht7vi5P-33Bk9nZMfKxz7JM--cbGSLHC_IlvrI4uXispDxJ3qAqvg |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+of+the+...+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%281998%29&rft.atitle=Distributed+Audio-Visual+Parsing+Based+On+Multimodal+Transformer+and+Deep+Joint+Source+Channel+Coding&rft.au=Wang%2C+Penghong&rft.au=Li%2C+Jiahui&rft.au=Ma%2C+Mengyao&rft.au=Fan%2C+Xiaopeng&rft.date=2022-05-23&rft.pub=IEEE&rft.eissn=2379-190X&rft.spage=4623&rft.epage=4627&rft_id=info:doi/10.1109%2FICASSP43922.2022.9746660&rft.externalDocID=9746660 |