AASD: Accelerate Inference by Aligning Speculative Decoding in Multimodal Large Language Models
Multimodal Large Language Models (MLLMs) have achieved notable success in visual instruction tuning, yet their inference is time-consuming due to the auto-regressive decoding of Large Language Model (LLM) backbone. Traditional methods for accelerating inference, including model compression and migra...
Uložené v:
| Vydané v: | 2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7 |
|---|---|
| Hlavní autori: | , , , , , , , , |
| Médium: | Konferenčný príspevok.. |
| Jazyk: | English |
| Vydavateľské údaje: |
IEEE
22.06.2025
|
| Predmet: | |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | Multimodal Large Language Models (MLLMs) have achieved notable success in visual instruction tuning, yet their inference is time-consuming due to the auto-regressive decoding of Large Language Model (LLM) backbone. Traditional methods for accelerating inference, including model compression and migration from language model acceleration, often compromise output quality or face challenges in effectively integrating multimodal features. To address these issues, we propose AASD, a novel framework for Accelerating inference with refined KV Cache and Aligning speculative decoding in MLLMs. Our approach leverages the target model's cached KeyValue (KV) pairs to extract vital information for generating draft tokens, enabling efficient speculative decoding. To reduce the computational burden associated with long multimodal token sequences, we introduce a KV Projector to compress the KV Cache while maintaining representational fidelity. Additionally, we design a Target-Draft Attention mechanism that optimizes the alignment between the draft model and the target model, achieving the benefits of real inference scenarios with minimal computational overhead. Extensive experiments on mainstream MLLMs demonstrate that our method achieves up to a 2 \times inference speedup without sacrificing accuracy. This study not only provides an effective and lightweight solution for accelerating MLLM inference but also introduces a novel alignment strategy for speculative decoding in multimodal contexts, laying a strong foundation for future research in efficient MLLMs. Code is availiable at https://github.com/transcend-0/ASD |
|---|---|
| AbstractList | Multimodal Large Language Models (MLLMs) have achieved notable success in visual instruction tuning, yet their inference is time-consuming due to the auto-regressive decoding of Large Language Model (LLM) backbone. Traditional methods for accelerating inference, including model compression and migration from language model acceleration, often compromise output quality or face challenges in effectively integrating multimodal features. To address these issues, we propose AASD, a novel framework for Accelerating inference with refined KV Cache and Aligning speculative decoding in MLLMs. Our approach leverages the target model's cached KeyValue (KV) pairs to extract vital information for generating draft tokens, enabling efficient speculative decoding. To reduce the computational burden associated with long multimodal token sequences, we introduce a KV Projector to compress the KV Cache while maintaining representational fidelity. Additionally, we design a Target-Draft Attention mechanism that optimizes the alignment between the draft model and the target model, achieving the benefits of real inference scenarios with minimal computational overhead. Extensive experiments on mainstream MLLMs demonstrate that our method achieves up to a 2 \times inference speedup without sacrificing accuracy. This study not only provides an effective and lightweight solution for accelerating MLLM inference but also introduces a novel alignment strategy for speculative decoding in multimodal contexts, laying a strong foundation for future research in efficient MLLMs. Code is availiable at https://github.com/transcend-0/ASD |
| Author | Chen, Yuzhi Chen, Ran Xu, Rongtao Fu, Kexue Gao, Longxiang Yang, Chaoqun Zhang, Muyang Pang, Weiguang Wang, Changwei |
| Author_xml | – sequence: 1 givenname: Chaoqun surname: Yang fullname: Yang, Chaoqun organization: Qilu University of Technology (Shandong Academy of Sciences),Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center – sequence: 2 givenname: Ran surname: Chen fullname: Chen, Ran organization: Peking University,School of Mathematical Sciences,Department of Information and Computational Sciences – sequence: 3 givenname: Muyang surname: Zhang fullname: Zhang, Muyang organization: University of Chinese Academy of Sciences,School of Artificial Intelligence – sequence: 4 givenname: Weiguang surname: Pang fullname: Pang, Weiguang organization: Qilu University of Technology (Shandong Academy of Sciences),Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center – sequence: 5 givenname: Yuzhi surname: Chen fullname: Chen, Yuzhi organization: University of Chinese Academy of Sciences,School of Artificial Intelligence – sequence: 6 givenname: Rongtao surname: Xu fullname: Xu, Rongtao organization: University of Chinese Academy of Sciences,School of Artificial Intelligence – sequence: 7 givenname: Kexue surname: Fu fullname: Fu, Kexue organization: Qilu University of Technology (Shandong Academy of Sciences),Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center – sequence: 8 givenname: Changwei surname: Wang fullname: Wang, Changwei email: chanweiwang@sdas.org organization: Qilu University of Technology (Shandong Academy of Sciences),Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center – sequence: 9 givenname: Longxiang surname: Gao fullname: Gao, Longxiang email: gaolx@sdas.org organization: Qilu University of Technology (Shandong Academy of Sciences),Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center |
| BookMark | eNo1j81KxDAcxCPoQdd9A5G8QNd8tE3irbS6LnTxsHou-finBLLp0m2FfXsr6mVm-MEMzB26TkMChB4p2VBK1FNT1SWXudowwooFUc5USa7QWgklOacF4SSXt6irqkPzjCtrIcKoJ8C75GGEZAGbC65i6FNIPT6cwM5RT-ELcAN2cD8wJLyf4xSOg9MRt3rsYdHUz3oJ-8FBPN-jG6_jGdZ_vkKfry8f9VvWvm93ddVmmgo1ZVQ6LqVUpBTMmTL3TCsDXjJdGmWI94VRmntiqQChBPgFAnFUccFKu5RX6OF3NwBAdxrDUY-X7v83_wYoplIx |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/DAC63849.2025.11132960 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Xplore url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9798331503048 |
| EndPage | 7 |
| ExternalDocumentID | 11132960 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: Ministry of Education funderid: 10.13039/501100002701 – fundername: National Natural Science Foundation of China funderid: 10.13039/501100001809 |
| GroupedDBID | 6IE 6IH CBEJK RIE RIO |
| ID | FETCH-LOGICAL-a179t-18d388890672db64f2a9bef82a6b9b0ff5b9a3f0c17e797ef9b0e0d193726c8d3 |
| IEDL.DBID | RIE |
| IngestDate | Wed Oct 01 07:05:15 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a179t-18d388890672db64f2a9bef82a6b9b0ff5b9a3f0c17e797ef9b0e0d193726c8d3 |
| PageCount | 7 |
| ParticipantIDs | ieee_primary_11132960 |
| PublicationCentury | 2000 |
| PublicationDate | 2025-June-22 |
| PublicationDateYYYYMMDD | 2025-06-22 |
| PublicationDate_xml | – month: 06 year: 2025 text: 2025-June-22 day: 22 |
| PublicationDecade | 2020 |
| PublicationTitle | 2025 62nd ACM/IEEE Design Automation Conference (DAC) |
| PublicationTitleAbbrev | DAC |
| PublicationYear | 2025 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| Score | 2.2953317 |
| Snippet | Multimodal Large Language Models (MLLMs) have achieved notable success in visual instruction tuning, yet their inference is time-consuming due to the... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1 |
| SubjectTerms | Computational modeling Decoding Faces Feature extraction Focusing Inference Acceleration Large language models Model compression Multimodal Large Language Model Question answering (information retrieval) Speculative Decoding Tuning Visualization |
| Title | AASD: Accelerate Inference by Aligning Speculative Decoding in Multimodal Large Language Models |
| URI | https://ieeexplore.ieee.org/document/11132960 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA5aPHhSseKbHLxum2Qf2XhbWouClIIKvZVNMpFC3Uq7LfjvzWTbigcPXkIIeUAmyWQm8-Uj5M5LNUkcsMjZ1ESJZHgOsjQygtvcOBvHpQ5kE3I4zMdjNdqA1QMWBgBC8Bl0MBve8u3crNBV1m1o0TNvoe9LKRuw1gb1y5nq9oueX00Jwk9E2tlW_kWbErTG4Oif4x2T9g_-jo52muWE7EF1SiZF8dK_p4UxXlPgBw_0aVdZf9FiNn1HHwdFRvnAybUG2vfGJXZBpxUNUNuPuS1n9BnDv33auCop8qHNlm3yNnh47T1GG3qEqPS7qI54bmNvvyp8TLU6S5wolQaXizLTSjPnUq3K2DHDJUglwflCYNbf2KTIjG98RlrVvIJzQlnCgXMLHPyNwjClnUmM1Io7IzLIzQVp4-xMPpsfMCbbibn8o_yKHKIMMKRKiGvSqhcruCEHZl1Pl4vbILdvSmma_g |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA5SBT2pWPFtDl63TbLZR7wtraXFWgpW6G3ZJBMp1K30Bf57k2xb8eDBSwghD8gkmcxkvnwIPVipcm6ABEZHKuAJcecgiQLFqE6V0WFYSE82kQwG6XgshhuwusfCAIAPPoOGy_q3fD1TK-cqa1a06LG10Pcjzhmt4Fob3C8lotnOWnY9cQdAYVFjW_0XcYrXG53jf454guo_CDw83OmWU7QH5RnKs-y1_YgzpayucF884N6usvzC2XTy7rwc2HHKe1auNeC2NS9dF3hSYg-2_ZjpYor7LgDcppWzEjtGtOmijt46T6NWN9gQJASF3UfLgKY6tBascM-pWsbcsEJIMCkrYikkMSaSoggNUTSBRCRgbCEQbe9sCYuVbXyOauWshAuECadAqQYK9k6hiJBGcZVIQY1iMaTqEtXd7OSf1R8Y-XZirv4ov0eH3dFLP-_3Bs_X6MjJwwVYMXaDasv5Cm7RgVovJ4v5nZfhN4jZnkU |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2025+62nd+ACM%2FIEEE+Design+Automation+Conference+%28DAC%29&rft.atitle=AASD%3A+Accelerate+Inference+by+Aligning+Speculative+Decoding+in+Multimodal+Large+Language+Models&rft.au=Yang%2C+Chaoqun&rft.au=Chen%2C+Ran&rft.au=Zhang%2C+Muyang&rft.au=Pang%2C+Weiguang&rft.date=2025-06-22&rft.pub=IEEE&rft.spage=1&rft.epage=7&rft_id=info:doi/10.1109%2FDAC63849.2025.11132960&rft.externalDocID=11132960 |