OOPS: Outlier-Aware and Quadratic Programming Based Structured Pruning for Large Language Models

The large model size and resource consumption of Large Language Models (LLMs) limit their deployment and application in many scenarios. Structured pruning offers a solution to this challenge. Based on the need for retraining after pruning, structured pruning methods for LLMs fall into two categories...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Neural networks Ročník 196; s. 108332
Hlavní autoři:	Wei, Jiateng, Li, Siqi, Xiang, Jingyang, Yang, Jiandang, Chen, Jun, Wei, Xiaobin, Jiang, Yunliang, Liu, Yong
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	United States Elsevier Ltd 25.11.2025
Témata:	Large Language Model Model Compression Network Pruning Large Language Model Model Compression Network Pruning Large language model Model compression Network pruning
ISSN:	0893-6080, 1879-2782
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	The large model size and resource consumption of Large Language Models (LLMs) limit their deployment and application in many scenarios. Structured pruning offers a solution to this challenge. Based on the need for retraining after pruning, structured pruning methods for LLMs fall into two categories: retraining-free and retraining-based. Retraining-free methods often result in significant performance degradation, while retraining-based methods may require substantial computational resources. To address these limitations, we propose a structured pruning framework named OOPS (Outlier-Aware and Quadratic PrOgramming-Based Structured Pruning). It comprises three key components: outlier-aware pruning unit selection, quadratic programming-based reconstruction, and layer-wise distillation. By employing the first two components, OOPS prunes models without the requirement of retraining, outperforming existing retraining-free methods. When further incorporating layer-wise distillation to train the pruned layers individually, OOPS surpasses other retraining-based methods with lower computational costs. We evaluate the effectiveness of OOPS on 11 models from 4 LLM families across multiple tasks, demonstrating its superior performance compared to state-of-the-art methods in both retraining-free and retraining-based settings.
ISSN:	0893-6080 1879-2782
DOI:	10.1016/j.neunet.2025.108332