AASD: Accelerate Inference by Aligning Speculative Decoding in Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) have achieved notable success in visual instruction tuning, yet their inference is time-consuming due to the auto-regressive decoding of Large Language Model (LLM) backbone. Traditional methods for accelerating inference, including model compression and migra...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7
Hlavní autori: Yang, Chaoqun, Chen, Ran, Zhang, Muyang, Pang, Weiguang, Chen, Yuzhi, Xu, Rongtao, Fu, Kexue, Wang, Changwei, Gao, Longxiang
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 22.06.2025
Predmet:
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Multimodal Large Language Models (MLLMs) have achieved notable success in visual instruction tuning, yet their inference is time-consuming due to the auto-regressive decoding of Large Language Model (LLM) backbone. Traditional methods for accelerating inference, including model compression and migration from language model acceleration, often compromise output quality or face challenges in effectively integrating multimodal features. To address these issues, we propose AASD, a novel framework for Accelerating inference with refined KV Cache and Aligning speculative decoding in MLLMs. Our approach leverages the target model's cached KeyValue (KV) pairs to extract vital information for generating draft tokens, enabling efficient speculative decoding. To reduce the computational burden associated with long multimodal token sequences, we introduce a KV Projector to compress the KV Cache while maintaining representational fidelity. Additionally, we design a Target-Draft Attention mechanism that optimizes the alignment between the draft model and the target model, achieving the benefits of real inference scenarios with minimal computational overhead. Extensive experiments on mainstream MLLMs demonstrate that our method achieves up to a 2 \times inference speedup without sacrificing accuracy. This study not only provides an effective and lightweight solution for accelerating MLLM inference but also introduces a novel alignment strategy for speculative decoding in multimodal contexts, laying a strong foundation for future research in efficient MLLMs. Code is availiable at https://github.com/transcend-0/ASD
AbstractList Multimodal Large Language Models (MLLMs) have achieved notable success in visual instruction tuning, yet their inference is time-consuming due to the auto-regressive decoding of Large Language Model (LLM) backbone. Traditional methods for accelerating inference, including model compression and migration from language model acceleration, often compromise output quality or face challenges in effectively integrating multimodal features. To address these issues, we propose AASD, a novel framework for Accelerating inference with refined KV Cache and Aligning speculative decoding in MLLMs. Our approach leverages the target model's cached KeyValue (KV) pairs to extract vital information for generating draft tokens, enabling efficient speculative decoding. To reduce the computational burden associated with long multimodal token sequences, we introduce a KV Projector to compress the KV Cache while maintaining representational fidelity. Additionally, we design a Target-Draft Attention mechanism that optimizes the alignment between the draft model and the target model, achieving the benefits of real inference scenarios with minimal computational overhead. Extensive experiments on mainstream MLLMs demonstrate that our method achieves up to a 2 \times inference speedup without sacrificing accuracy. This study not only provides an effective and lightweight solution for accelerating MLLM inference but also introduces a novel alignment strategy for speculative decoding in multimodal contexts, laying a strong foundation for future research in efficient MLLMs. Code is availiable at https://github.com/transcend-0/ASD
Author Chen, Yuzhi
Chen, Ran
Xu, Rongtao
Fu, Kexue
Gao, Longxiang
Yang, Chaoqun
Zhang, Muyang
Pang, Weiguang
Wang, Changwei
Author_xml – sequence: 1
  givenname: Chaoqun
  surname: Yang
  fullname: Yang, Chaoqun
  organization: Qilu University of Technology (Shandong Academy of Sciences),Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center
– sequence: 2
  givenname: Ran
  surname: Chen
  fullname: Chen, Ran
  organization: Peking University,School of Mathematical Sciences,Department of Information and Computational Sciences
– sequence: 3
  givenname: Muyang
  surname: Zhang
  fullname: Zhang, Muyang
  organization: University of Chinese Academy of Sciences,School of Artificial Intelligence
– sequence: 4
  givenname: Weiguang
  surname: Pang
  fullname: Pang, Weiguang
  organization: Qilu University of Technology (Shandong Academy of Sciences),Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center
– sequence: 5
  givenname: Yuzhi
  surname: Chen
  fullname: Chen, Yuzhi
  organization: University of Chinese Academy of Sciences,School of Artificial Intelligence
– sequence: 6
  givenname: Rongtao
  surname: Xu
  fullname: Xu, Rongtao
  organization: University of Chinese Academy of Sciences,School of Artificial Intelligence
– sequence: 7
  givenname: Kexue
  surname: Fu
  fullname: Fu, Kexue
  organization: Qilu University of Technology (Shandong Academy of Sciences),Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center
– sequence: 8
  givenname: Changwei
  surname: Wang
  fullname: Wang, Changwei
  email: chanweiwang@sdas.org
  organization: Qilu University of Technology (Shandong Academy of Sciences),Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center
– sequence: 9
  givenname: Longxiang
  surname: Gao
  fullname: Gao, Longxiang
  email: gaolx@sdas.org
  organization: Qilu University of Technology (Shandong Academy of Sciences),Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center
BookMark eNo1j81KxDAcxCPoQdd9A5G8QNd8tE3irbS6LnTxsHou-finBLLp0m2FfXsr6mVm-MEMzB26TkMChB4p2VBK1FNT1SWXudowwooFUc5USa7QWgklOacF4SSXt6irqkPzjCtrIcKoJ8C75GGEZAGbC65i6FNIPT6cwM5RT-ELcAN2cD8wJLyf4xSOg9MRt3rsYdHUz3oJ-8FBPN-jG6_jGdZ_vkKfry8f9VvWvm93ddVmmgo1ZVQ6LqVUpBTMmTL3TCsDXjJdGmWI94VRmntiqQChBPgFAnFUccFKu5RX6OF3NwBAdxrDUY-X7v83_wYoplIx
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/DAC63849.2025.11132960
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798331503048
EndPage 7
ExternalDocumentID 11132960
Genre orig-research
GrantInformation_xml – fundername: Ministry of Education
  funderid: 10.13039/501100002701
– fundername: National Natural Science Foundation of China
  funderid: 10.13039/501100001809
GroupedDBID 6IE
6IH
CBEJK
RIE
RIO
ID FETCH-LOGICAL-a179t-18d388890672db64f2a9bef82a6b9b0ff5b9a3f0c17e797ef9b0e0d193726c8d3
IEDL.DBID RIE
IngestDate Wed Oct 01 07:05:15 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a179t-18d388890672db64f2a9bef82a6b9b0ff5b9a3f0c17e797ef9b0e0d193726c8d3
PageCount 7
ParticipantIDs ieee_primary_11132960
PublicationCentury 2000
PublicationDate 2025-June-22
PublicationDateYYYYMMDD 2025-06-22
PublicationDate_xml – month: 06
  year: 2025
  text: 2025-June-22
  day: 22
PublicationDecade 2020
PublicationTitle 2025 62nd ACM/IEEE Design Automation Conference (DAC)
PublicationTitleAbbrev DAC
PublicationYear 2025
Publisher IEEE
Publisher_xml – name: IEEE
Score 2.2953317
Snippet Multimodal Large Language Models (MLLMs) have achieved notable success in visual instruction tuning, yet their inference is time-consuming due to the...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Computational modeling
Decoding
Faces
Feature extraction
Focusing
Inference Acceleration
Large language models
Model compression
Multimodal Large Language Model
Question answering (information retrieval)
Speculative Decoding
Tuning
Visualization
Title AASD: Accelerate Inference by Aligning Speculative Decoding in Multimodal Large Language Models
URI https://ieeexplore.ieee.org/document/11132960
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA5aPHhSseKbHLxum2Qf2XhbWouClIIKvZVNMpFC3Uq7LfjvzWTbigcPXkIIeUAmyWQm8-Uj5M5LNUkcsMjZ1ESJZHgOsjQygtvcOBvHpQ5kE3I4zMdjNdqA1QMWBgBC8Bl0MBve8u3crNBV1m1o0TNvoe9LKRuw1gb1y5nq9oueX00Jwk9E2tlW_kWbErTG4Oif4x2T9g_-jo52muWE7EF1SiZF8dK_p4UxXlPgBw_0aVdZf9FiNn1HHwdFRvnAybUG2vfGJXZBpxUNUNuPuS1n9BnDv33auCop8qHNlm3yNnh47T1GG3qEqPS7qI54bmNvvyp8TLU6S5wolQaXizLTSjPnUq3K2DHDJUglwflCYNbf2KTIjG98RlrVvIJzQlnCgXMLHPyNwjClnUmM1Io7IzLIzQVp4-xMPpsfMCbbibn8o_yKHKIMMKRKiGvSqhcruCEHZl1Pl4vbILdvSmma_g
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA5SBT2pWPFtDl63TbLZR7wtraXFWgpW6G3ZJBMp1K30Bf57k2xb8eDBSwghD8gkmcxkvnwIPVipcm6ABEZHKuAJcecgiQLFqE6V0WFYSE82kQwG6XgshhuwusfCAIAPPoOGy_q3fD1TK-cqa1a06LG10Pcjzhmt4Fob3C8lotnOWnY9cQdAYVFjW_0XcYrXG53jf454guo_CDw83OmWU7QH5RnKs-y1_YgzpayucF884N6usvzC2XTy7rwc2HHKe1auNeC2NS9dF3hSYg-2_ZjpYor7LgDcppWzEjtGtOmijt46T6NWN9gQJASF3UfLgKY6tBascM-pWsbcsEJIMCkrYikkMSaSoggNUTSBRCRgbCEQbe9sCYuVbXyOauWshAuECadAqQYK9k6hiJBGcZVIQY1iMaTqEtXd7OSf1R8Y-XZirv4ov0eH3dFLP-_3Bs_X6MjJwwVYMXaDasv5Cm7RgVovJ4v5nZfhN4jZnkU
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2025+62nd+ACM%2FIEEE+Design+Automation+Conference+%28DAC%29&rft.atitle=AASD%3A+Accelerate+Inference+by+Aligning+Speculative+Decoding+in+Multimodal+Large+Language+Models&rft.au=Yang%2C+Chaoqun&rft.au=Chen%2C+Ran&rft.au=Zhang%2C+Muyang&rft.au=Pang%2C+Weiguang&rft.date=2025-06-22&rft.pub=IEEE&rft.spage=1&rft.epage=7&rft_id=info:doi/10.1109%2FDAC63849.2025.11132960&rft.externalDocID=11132960