OOPS: Outlier-Aware and Quadratic Programming Based Structured Pruning for Large Language Models

The large model size and resource consumption of Large Language Models (LLMs) limit their deployment and application in many scenarios. Structured pruning offers a solution to this challenge. Based on the need for retraining after pruning, structured pruning methods for LLMs fall into two categories...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Neural networks Jg. 196; S. 108332
Hauptverfasser:	Wei, Jiateng, Li, Siqi, Xiang, Jingyang, Yang, Jiandang, Chen, Jun, Wei, Xiaobin, Jiang, Yunliang, Liu, Yong
Format:	Journal Article
Sprache:	Englisch
Veröffentlicht:	United States Elsevier Ltd 25.11.2025
Schlagworte:	Large Language Model Model Compression Network Pruning Large Language Model Model Compression Network Pruning Large language model Model compression Network pruning
ISSN:	0893-6080, 1879-2782
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The large model size and resource consumption of Large Language Models (LLMs) limit their deployment and application in many scenarios. Structured pruning offers a solution to this challenge. Based on the need for retraining after pruning, structured pruning methods for LLMs fall into two categories: retraining-free and retraining-based. Retraining-free methods often result in significant performance degradation, while retraining-based methods may require substantial computational resources. To address these limitations, we propose a structured pruning framework named OOPS (Outlier-Aware and Quadratic PrOgramming-Based Structured Pruning). It comprises three key components: outlier-aware pruning unit selection, quadratic programming-based reconstruction, and layer-wise distillation. By employing the first two components, OOPS prunes models without the requirement of retraining, outperforming existing retraining-free methods. When further incorporating layer-wise distillation to train the pruned layers individually, OOPS surpasses other retraining-based methods with lower computational costs. We evaluate the effectiveness of OOPS on 11 models from 4 LLM families across multiple tasks, demonstrating its superior performance compared to state-of-the-art methods in both retraining-free and retraining-based settings.
ISSN:	0893-6080 1879-2782
DOI:	10.1016/j.neunet.2025.108332