OOPS: Outlier-Aware and Quadratic Programming Based Structured Pruning for Large Language Models
The large model size and resource consumption of Large Language Models (LLMs) limit their deployment and application in many scenarios. Structured pruning offers a solution to this challenge. Based on the need for retraining after pruning, structured pruning methods for LLMs fall into two categories...
Uložené v:
| Vydané v: | Neural networks Ročník 196; s. 108332 |
|---|---|
| Hlavní autori: | , , , , , , , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
United States
Elsevier Ltd
25.11.2025
|
| Predmet: | |
| ISSN: | 0893-6080, 1879-2782 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | The large model size and resource consumption of Large Language Models (LLMs) limit their deployment and application in many scenarios. Structured pruning offers a solution to this challenge. Based on the need for retraining after pruning, structured pruning methods for LLMs fall into two categories: retraining-free and retraining-based. Retraining-free methods often result in significant performance degradation, while retraining-based methods may require substantial computational resources. To address these limitations, we propose a structured pruning framework named OOPS (Outlier-Aware and Quadratic PrOgramming-Based Structured Pruning). It comprises three key components: outlier-aware pruning unit selection, quadratic programming-based reconstruction, and layer-wise distillation. By employing the first two components, OOPS prunes models without the requirement of retraining, outperforming existing retraining-free methods. When further incorporating layer-wise distillation to train the pruned layers individually, OOPS surpasses other retraining-based methods with lower computational costs. We evaluate the effectiveness of OOPS on 11 models from 4 LLM families across multiple tasks, demonstrating its superior performance compared to state-of-the-art methods in both retraining-free and retraining-based settings. |
|---|---|
| AbstractList | The large model size and resource consumption of Large Language Models (LLMs) limit their deployment and application in many scenarios. Structured pruning offers a solution to this challenge. Based on the need for retraining after pruning, structured pruning methods for LLMs fall into two categories: retraining-free and retraining-based. Retraining-free methods often result in significant performance degradation, while retraining-based methods may require substantial computational resources. To address these limitations, we propose a structured pruning framework named OOPS (Outlier-Aware and Quadratic PrOgramming-Based Structured Pruning). It comprises three key components: outlier-aware pruning unit selection, quadratic programming-based reconstruction, and layer-wise distillation. By employing the first two components, OOPS prunes models without the requirement of retraining, outperforming existing retraining-free methods. When further incorporating layer-wise distillation to train the pruned layers individually, OOPS surpasses other retraining-based methods with lower computational costs. We evaluate the effectiveness of OOPS on 11 models from 4 LLM families across multiple tasks, demonstrating its superior performance compared to state-of-the-art methods in both retraining-free and retraining-based settings. |
| ArticleNumber | 108332 |
| Author | Jiang, Yunliang Wei, Xiaobin Wei, Jiateng Li, Siqi Xiang, Jingyang Yang, Jiandang Chen, Jun Liu, Yong |
| Author_xml | – sequence: 1 givenname: Jiateng surname: Wei fullname: Wei, Jiateng organization: Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, 310027, China – sequence: 2 givenname: Siqi orcidid: 0009-0000-4632-9010 surname: Li fullname: Li, Siqi organization: Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, 310027, China – sequence: 3 givenname: Jingyang orcidid: 0000-0001-5350-1528 surname: Xiang fullname: Xiang, Jingyang organization: Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, 310027, China – sequence: 4 givenname: Jiandang surname: Yang fullname: Yang, Jiandang organization: Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, 310027, China – sequence: 5 givenname: Jun orcidid: 0000-0001-6568-8801 surname: Chen fullname: Chen, Jun email: junc.change@zjnu.edu.cn organization: National Special Education Resource Center for Children with Autism, Zhejiang Normal University, Hangzhou, 311231, China – sequence: 6 givenname: Xiaobin surname: Wei fullname: Wei, Xiaobin organization: Wasu Media & Network CO.,Ltd, China – sequence: 7 givenname: Yunliang surname: Jiang fullname: Jiang, Yunliang organization: School of Computer Science and Technology, Zhejiang Normal University, Jinhua, 321004, China – sequence: 8 givenname: Yong surname: Liu fullname: Liu, Yong email: yongliu@iipc.zju.edu.cn organization: Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, 310027, China |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/41337790$$D View this record in MEDLINE/PubMed |
| BookMark | eNp9kNtKw0AQhhep2IO-gUheIHVP2Wy8ELR4gkoq1et1szsJKe2mbLKKb29K1EtvZn5m_n9gvikaucYBQucEzwkm4nIzdxAcdHOKadKPJGP0CE2ITLOYppKO0ATLjMUCSzxG07bdYIyF5OwEjTlhLE0zPEHveb5aX0V56LY1-PjmU3uItLPRS9DW66420co3lde7Xe2q6Fa3YKN154Ppgu_lygd3WJSNj5baV9BXVwXdi-fGwrY9Rcel3rZw9tNn6O3-7nXxGC_zh6fFzTI2hDIckwJAmzQpJS1pIjgjBTO25AKIACGolrywmHHDC1kKnCVFJopEmILKxHKdshm6GO7uQ7EDq_a-3mn_pX5f7Q18MBjftK2H8s9CsDoQVRs1EFUHomog2seuh1j_C3z0jFRranAGbO3BdMo29f8HvgFSr4AP |
| Cites_doi | 10.18653/v1/2024.findings-acl.178 10.1609/aaai.v34i05.6399 10.18653/v1/2023.emnlp-main.298 10.18653/v1/2020.emnlp-demos.6 |
| ContentType | Journal Article |
| Copyright | 2025 Copyright © 2025 Elsevier Ltd. All rights reserved. |
| Copyright_xml | – notice: 2025 – notice: Copyright © 2025 Elsevier Ltd. All rights reserved. |
| DBID | AAYXX CITATION NPM |
| DOI | 10.1016/j.neunet.2025.108332 |
| DatabaseName | CrossRef PubMed |
| DatabaseTitle | CrossRef PubMed |
| DatabaseTitleList | PubMed |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 1879-2782 |
| ExternalDocumentID | 41337790 10_1016_j_neunet_2025_108332 S0893608025012134 |
| Genre | Journal Article |
| GroupedDBID | --- --K --M -~X .DC .~1 0R~ 123 186 1B1 1RT 1~. 1~5 29N 4.4 457 4G. 53G 5RE 5VS 6TJ 7-5 71M 8P~ 9DU 9JM 9JN AABNK AAEDT AAEDW AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AATTM AAXKI AAXLA AAXUO AAYFN AAYWO ABAOU ABBOA ABCQJ ABDPE ABEFU ABFNM ABFRF ABHFT ABIVO ABJNI ABLJU ABMAC ABWVN ABXDB ACDAQ ACGFO ACGFS ACIUM ACLOT ACNNM ACRLP ACRPL ACVFH ACZNC ADBBV ADCNI ADEZE ADGUI ADJOM ADMUD ADNMO ADRHT AEBSH AECPX AEFWE AEIPS AEKER AENEX AEUPX AFJKZ AFPUW AFTJW AFXIZ AGHFR AGQPQ AGUBO AGWIK AGYEJ AHHHB AHJVU AHZHX AIALX AIEXJ AIGII AIIUN AIKHN AITUG AKBMS AKRWK AKYEP ALMA_UNASSIGNED_HOLDINGS AMRAJ ANKPU AOUOD APXCP ARUGR ASPBG AVWKF AXJTR AZFZN BJAXD BKOJK BLXMC CS3 DU5 EBS EFJIC EFKBS EFLBG EJD EO8 EO9 EP2 EP3 F0J F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-2 G-Q GBLVA GBOLZ HLZ HMQ HVGLF HZ~ IHE J1W JJJVA K-O KOM KZ1 LG9 LMP M2V M41 MHUIS MO0 MOBAO MVM N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. Q38 R2- ROL RPZ SBC SCC SDF SDG SDP SES SEW SNS SPC SPCBC SSN SST SSV SSW SSZ T5K TAE UAP UNMZH VOH WUQ XPP ZMT ~G- ~HD AAYXX CITATION NPM |
| ID | FETCH-LOGICAL-c1230-1beeac75f82f256431b3cdf46e16e662a84bd034c4b8f6095b96b56cb285d4a73 |
| ISSN | 0893-6080 |
| IngestDate | Wed Dec 10 13:18:16 EST 2025 Thu Nov 27 01:07:36 EST 2025 Wed Dec 10 14:36:01 EST 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Keywords | Large Language Model Model Compression Network Pruning Large language model Model compression Network pruning |
| Language | English |
| License | Copyright © 2025 Elsevier Ltd. All rights reserved. |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-c1230-1beeac75f82f256431b3cdf46e16e662a84bd034c4b8f6095b96b56cb285d4a73 |
| ORCID | 0009-0000-4632-9010 0000-0001-6568-8801 0000-0001-5350-1528 |
| PMID | 41337790 |
| ParticipantIDs | pubmed_primary_41337790 crossref_primary_10_1016_j_neunet_2025_108332 elsevier_sciencedirect_doi_10_1016_j_neunet_2025_108332 |
| PublicationCentury | 2000 |
| PublicationDate | 2025-Nov-25 |
| PublicationDateYYYYMMDD | 2025-11-25 |
| PublicationDate_xml | – month: 11 year: 2025 text: 2025-Nov-25 day: 25 |
| PublicationDecade | 2020 |
| PublicationPlace | United States |
| PublicationPlace_xml | – name: United States |
| PublicationTitle | Neural networks |
| PublicationTitleAlternate | Neural Netw |
| PublicationYear | 2025 |
| Publisher | Elsevier Ltd |
| Publisher_xml | – name: Elsevier Ltd |
| References | Bisk, Zellers, Bras, Gao, Choi (bib0005) 2019 Li, G., Tang, Y., & Zhang, W. (2024). LoRAP: Transformer sub-layers deserve differentiated structured compression for large language models. Guo, S., Xu, J., Zhang, L. L., & Yang, M. (2023). Compresso: Structured pruning with collaborative prompting learns compact large language models. Chen, Chen, Wang, Dai, Tsang, Liu (bib0007) 2024; 25 Kaushal, A., Vaidhya, T., & Rish, I. (2023). Lord: Low rank decomposition of monolingual code llms for one-shot compression. Ainslie, J., Lee-Thorp, J., De Jong, M., Zemlyanskiy, Y., Lebrón, F., & Sanghai, S. (2023). Gqa: Training generalized multi-query transformer models from multi-head checkpoints. Dettmers, Lewis, Belkada, Zettlemoyer (bib0012) 2022; 35 Yin, L., Wu, Y., Zhang, Z., Hsieh, C.-Y., Wang, Y., Jia, Y., Pechenizkiy, M., Liang, Y., Wang, Z., & Liu, S. (2023). Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity. Frantar, Alistarh (bib0014) 2023 Paszke, Gross, Massa, Lerer, Bradbury, Chanan, Killeen, Lin, Gimelshein, Antiga (bib0031) 2019; 32 Mihaylov, Clark, Khot, Sabharwal (bib0030) 2018 Wu, M., Waheed, A., Zhang, C., Abdul-Mageed, M., & Aji, A. F. (2023). Lamini-lm: A diverse herd of distilled models from large-scale instructions. Frantar, Alistarh (bib0013) 2022; 35 Lin, Y., Tang, H., Yang, S., Zhang, Z., Xiao, G., Gan, C., & Han, S. (2024b). Qserve: W4a8kv4 quantization and system co-design for efficient llm serving. Wang (bib0044) 2020 . Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2016). Pruning filters for efficient convnets. arXiv preprint Frantar, Ashkboos, Hoefler, Alistarh (bib0015) 2022 Das, R. J., Ma, L., & Shen, Z. (2023). Beyond size: How gradients shape pruning decisions in large language models. Shao, W., Chen, M., Zhang, Z., Xu, P., Zhao, L., Li, Z., Zhang, K., Gao, P., Qiao, Y., & Luo, P. (2023). Omniquant: Omnidirectionally calibrated quantization for large language models. An, Zhao, Yu, Tang, Wang (bib0002) 2024; vol. 38 Brown, T. B. (2020). Language models are few-shot learners. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F. et al. (2023). Llama: Open and efficient foundation language models. Zhang, Y., Zhao, L., Lin, M., Sun, Y., Yao, Y., Han, X., Tanner, J., Liu, S., & Ji, R. (2023b). Dynamic sparse no training: Training-free fine-tuning for sparse llms. Zellers, Holtzman, Bisk, Farhadi, Choi (bib0051) 2019 Wang, P., Fan, Z., Hu, S., Chen, Z., Wang, Y., & Wang, Y. (2024a). Reconstruct the pruned model without any retraining. Clark, Lee, Chang, Kwiatkowski, Collins, Toutanova (bib0009) 2019 Han, Pool, Tran, Dally (bib0018) 2015 Lee, C., Jin, J., Kim, T., Kim, H., & Park, E. (2023). Owq: Lessons learned from activation outliers for weight quantization in large language models. Sakaguchi, Le Bras, Bhagavatula, Choi (bib0033) 2020 Sun, M., Liu, Z., Bair, A., & Kolter, J. Z. (2023). A simple and effective pruning approach for large language models. Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li, Liu (bib0032) 2020; 21 Siddiqui, S. A., Dong, X., Heinrich, G., Breuel, T., Kautz, J., Krueger, D., & Molchanov, P. (2024). A deeper look at depth pruning of LLMs. Hassibi, Stork, Wolff (bib0019) 1993 Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). Lora: Low-rank adaptation of large language models. Workshop, B., Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F. et al. (2022). Bloom: A 176b-parameter open-access multilingual language model. Chen, M., Shao, W., Xu, P., Wang, J., Gao, P., Zhang, K., Qiao, Y., & Luo, P. (2024b). EfficientQAT: Efficient quantization-aware training for large language models. Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., & Hashimoto, T. B. (2023). Stanford alpaca: An instruction-following LLaMA model. Ashkboos, Croci, Nascimento, Hoefler, Hensman (bib0003) 2024 Song, J., Oh, K., Kim, T., Kim, H., Kim, Y., & Kim, J.-J. (2024). Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks. Merity, Xiong, Bradbury, Socher (bib0029) 2017 Yuan, Z., Shang, Y., Song, Y., Wu, Q., Yan, Y., & Sun, G. (2023). Asvd: Activation-aware singular value decomposition for compressing large language models. Shi, Wang, Chu (bib0035) 2020 Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M. et al. (2019). Huggingface’s transformers: State-of-the-art natural language processing. Zhang, M., Chen, H., Shen, C., Yang, Z., Ou, L., Yu, X., & Zhuang, B. (2023a). Loraprune: Pruning meets low-rank parameter-efficient fine-tuning. Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., & Tafjord, O. (2018). Think you have solved question answering? try ARC, the AI2 reasoning challenge. arXiv: Artificial Intelligence. Ma, Fang, Wang (bib0028) 2023; 36 Wang, X., Zheng, Y., Wan, Z., & Zhang, M. (2024b). Svd-llm: Truncation-aware singular value decomposition for large language model compression. Ashkboos, S., Mohtashami, A., Croci, M. L., Li, B., Jaggi, M., Alistarh, D., Hoefler, T., & Hensman, J. (2024b). Quarot: Outlier-free 4-bit inference in rotated llms. Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (bib0041) 2017; 30 Xiao, Lin, Seznec, Wu, Demouth, Han (bib0048) 2023 Lin, Tang, Tang, Yang, Chen, Wang, Xiao, Dang, Gan, Han (bib0026) 2024 Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., & Zou, A. (2021). A framework for few-shot language model evaluation. He, S., Sun, G., Shen, Z., & Li, A. (2024). What matters in transformers? not all attention is needed. Sakaguchi (10.1016/j.neunet.2025.108332_bib0033) 2020 Hassibi (10.1016/j.neunet.2025.108332_bib0019) 1993 10.1016/j.neunet.2025.108332_bib0040 Ma (10.1016/j.neunet.2025.108332_bib0028) 2023; 36 Merity (10.1016/j.neunet.2025.108332_bib0029) 2017 Clark (10.1016/j.neunet.2025.108332_bib0009) 2019 10.1016/j.neunet.2025.108332_bib0043 10.1016/j.neunet.2025.108332_bib0042 10.1016/j.neunet.2025.108332_bib0004 10.1016/j.neunet.2025.108332_bib0047 10.1016/j.neunet.2025.108332_bib0046 10.1016/j.neunet.2025.108332_bib0001 10.1016/j.neunet.2025.108332_bib0045 An (10.1016/j.neunet.2025.108332_bib0002) 2024; vol. 38 10.1016/j.neunet.2025.108332_bib0008 Xiao (10.1016/j.neunet.2025.108332_bib0048) 2023 10.1016/j.neunet.2025.108332_bib0006 10.1016/j.neunet.2025.108332_bib0049 Han (10.1016/j.neunet.2025.108332_bib0018) 2015 10.1016/j.neunet.2025.108332_bib0037 10.1016/j.neunet.2025.108332_bib0036 10.1016/j.neunet.2025.108332_bib0034 Mihaylov (10.1016/j.neunet.2025.108332_bib0030) 2018 10.1016/j.neunet.2025.108332_bib0039 10.1016/j.neunet.2025.108332_bib0038 Paszke (10.1016/j.neunet.2025.108332_bib0031) 2019; 32 10.1016/j.neunet.2025.108332_bib0022 10.1016/j.neunet.2025.108332_bib0021 Chen (10.1016/j.neunet.2025.108332_bib0007) 2024; 25 10.1016/j.neunet.2025.108332_bib0020 Wang (10.1016/j.neunet.2025.108332_bib0044) 2020 Ashkboos (10.1016/j.neunet.2025.108332_bib0003) 2024 10.1016/j.neunet.2025.108332_bib0025 Shi (10.1016/j.neunet.2025.108332_bib0035) 2020 10.1016/j.neunet.2025.108332_bib0024 Frantar (10.1016/j.neunet.2025.108332_bib0013) 2022; 35 10.1016/j.neunet.2025.108332_bib0023 Frantar (10.1016/j.neunet.2025.108332_bib0015) 2022 10.1016/j.neunet.2025.108332_bib0027 Lin (10.1016/j.neunet.2025.108332_bib0026) 2024 Dettmers (10.1016/j.neunet.2025.108332_bib0012) 2022; 35 Raffel (10.1016/j.neunet.2025.108332_bib0032) 2020; 21 Zellers (10.1016/j.neunet.2025.108332_bib0051) 2019 10.1016/j.neunet.2025.108332_bib0050 10.1016/j.neunet.2025.108332_bib0011 10.1016/j.neunet.2025.108332_bib0010 Vaswani (10.1016/j.neunet.2025.108332_bib0041) 2017; 30 10.1016/j.neunet.2025.108332_bib0053 Bisk (10.1016/j.neunet.2025.108332_bib0005) 2019 10.1016/j.neunet.2025.108332_bib0052 Frantar (10.1016/j.neunet.2025.108332_bib0014) 2023 10.1016/j.neunet.2025.108332_bib0017 10.1016/j.neunet.2025.108332_bib0016 |
| References_xml | – volume: 32 year: 2019 ident: bib0031 article-title: Pytorch: An imperative style, high-performance deep learning library publication-title: Advances in neural information processing systems – reference: Ainslie, J., Lee-Thorp, J., De Jong, M., Zemlyanskiy, Y., Lebrón, F., & Sanghai, S. (2023). Gqa: Training generalized multi-query transformer models from multi-head checkpoints. – volume: 25 start-page: 1 year: 2024 end-page: 44 ident: bib0007 article-title: Learning discretized neural networks under ricci flow publication-title: Journal of Machine Learning Research – reference: Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., & Zou, A. (2021). A framework for few-shot language model evaluation. – year: 2017 ident: bib0029 article-title: Pointer sentinel mixture models publication-title: International conference on learning representations – year: 2019 ident: bib0005 article-title: Piqa: Reasoning about physical commonsense in natural language publication-title: Aaai conference on artificial intelligence – start-page: 293 year: 1993 end-page: 299 ident: bib0019 article-title: Optimal brain surgeon and general network pruning publication-title: Ieee international conference on neural networks – reference: Das, R. J., Ma, L., & Shen, Z. (2023). Beyond size: How gradients shape pruning decisions in large language models. – reference: Brown, T. B. (2020). Language models are few-shot learners. – reference: Zhang, Y., Zhao, L., Lin, M., Sun, Y., Yao, Y., Han, X., Tanner, J., Liu, S., & Ji, R. (2023b). Dynamic sparse no training: Training-free fine-tuning for sparse llms. – reference: Yuan, Z., Shang, Y., Song, Y., Wu, Q., Yan, Y., & Sun, G. (2023). Asvd: Activation-aware singular value decomposition for compressing large language models. – start-page: 1135 year: 2015 end-page: 1143 ident: bib0018 article-title: Learning both weights and connections for efficient neural networks publication-title: Proceedings of the 28th international conference on neural information processing systems volume 1 – reference: Song, J., Oh, K., Kim, T., Kim, H., Kim, Y., & Kim, J.-J. (2024). Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks. – reference: Zhang, M., Chen, H., Shen, C., Yang, Z., Ou, L., Yu, X., & Zhuang, B. (2023a). Loraprune: Pruning meets low-rank parameter-efficient fine-tuning. – reference: Wu, M., Waheed, A., Zhang, C., Abdul-Mageed, M., & Aji, A. F. (2023). Lamini-lm: A diverse herd of distilled models from large-scale instructions. – reference: Sun, M., Liu, Z., Bair, A., & Kolter, J. Z. (2023). A simple and effective pruning approach for large language models. – volume: 30 year: 2017 ident: bib0041 article-title: Attention is all you need publication-title: Advances in neural information processing systems – year: 2024 ident: bib0003 article-title: Slicegpt: Compress large language models by deleting rows and columns publication-title: International conference on learning representations – year: 2022 ident: bib0015 article-title: Optq: Accurate quantization for generative pre-trained transformers publication-title: The eleventh international conference on learning representations – start-page: 2381 year: 2018 end-page: 2391 ident: bib0030 article-title: Can a suit of armor conduct electricity? a new dataset for open book question answering publication-title: Proceedings of the 2018 conference on empirical methods in natural language processing – reference: Ashkboos, S., Mohtashami, A., Croci, M. L., Li, B., Jaggi, M., Alistarh, D., Hoefler, T., & Hensman, J. (2024b). Quarot: Outlier-free 4-bit inference in rotated llms. – reference: Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M. et al. (2019). Huggingface’s transformers: State-of-the-art natural language processing. – reference: Workshop, B., Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F. et al. (2022). Bloom: A 176b-parameter open-access multilingual language model. – reference: Chen, M., Shao, W., Xu, P., Wang, J., Gao, P., Zhang, K., Qiao, Y., & Luo, P. (2024b). EfficientQAT: Efficient quantization-aware training for large language models. – start-page: 19 year: 2020 end-page: 26 ident: bib0035 article-title: Efficient sparse-dense matrix-matrix multiplication on GPUs using the customized sparse storage format publication-title: 2020 IEEE 26th international conference on parallel and distributed systems (ICPADS) – reference: Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F. et al. (2023). Llama: Open and efficient foundation language models. – volume: 21 start-page: 1 year: 2020 end-page: 67 ident: bib0032 article-title: Exploring the limits of transfer learning with a unified text-to-text transformer publication-title: Journal of machine learning research – volume: 36 start-page: 21702 year: 2023 end-page: 21720 ident: bib0028 article-title: Llm-pruner: On the structural pruning of large language models publication-title: Advances in neural information processing systems – reference: Shao, W., Chen, M., Zhang, Z., Xu, P., Zhao, L., Li, Z., Zhang, K., Gao, P., Qiao, Y., & Luo, P. (2023). Omniquant: Omnidirectionally calibrated quantization for large language models. – reference: Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2016). Pruning filters for efficient convnets. arXiv preprint – reference: . – reference: Wang, X., Zheng, Y., Wan, Z., & Zhang, M. (2024b). Svd-llm: Truncation-aware singular value decomposition for large language model compression. – year: 2023 ident: bib0048 article-title: Smoothquant: accurate and efficient post-training quantization for large language models publication-title: Proceedings of the 40th international conference on machine learning – reference: Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., & Tafjord, O. (2018). Think you have solved question answering? try ARC, the AI2 reasoning challenge. arXiv: Artificial Intelligence. – reference: Li, G., Tang, Y., & Zhang, W. (2024). LoRAP: Transformer sub-layers deserve differentiated structured compression for large language models. – reference: Siddiqui, S. A., Dong, X., Heinrich, G., Breuel, T., Kautz, J., Krueger, D., & Molchanov, P. (2024). A deeper look at depth pruning of LLMs. – year: 2019 ident: bib0051 article-title: Hellaswag: Can a machine really finish your sentence? publication-title: Proceedings of the 57th annual meeting of the association for computational linguistics – reference: Yin, L., Wu, Y., Zhang, Z., Hsieh, C.-Y., Wang, Y., Jia, Y., Pechenizkiy, M., Liang, Y., Wang, Z., & Liu, S. (2023). Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity. – reference: He, S., Sun, G., Shen, Z., & Li, A. (2024). What matters in transformers? not all attention is needed. – volume: 35 start-page: 4475 year: 2022 end-page: 4488 ident: bib0013 article-title: Optimal brain compression: A framework for accurate post-training quantization and pruning publication-title: Advances in Neural Information Processing Systems – year: 2024 ident: bib0026 article-title: Awq: Activation-aware weight quantization for llm compression and acceleration publication-title: Mlsys – reference: Lin, Y., Tang, H., Yang, S., Zhang, Z., Xiao, G., Gan, C., & Han, S. (2024b). Qserve: W4a8kv4 quantization and system co-design for efficient llm serving. – reference: Guo, S., Xu, J., Zhang, L. L., & Yang, M. (2023). Compresso: Structured pruning with collaborative prompting learns compact large language models. – reference: Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). Lora: Low-rank adaptation of large language models. – start-page: 31 year: 2020 end-page: 42 ident: bib0044 article-title: Sparsert: Accelerating unstructured sparsity on gpus for deep learning inference publication-title: Proceedings of the ACM international conference on parallel architectures and compilation techniques – start-page: 8732 year: 2020 end-page: 8740 ident: bib0033 article-title: Winogrande: An adversarial winograd schema challenge at scale publication-title: Proceedings of the AAAI Conference on Artificial Intelligence – reference: Lee, C., Jin, J., Kim, T., Kim, H., & Park, E. (2023). Owq: Lessons learned from activation outliers for weight quantization in large language models. – reference: . – start-page: 10323 year: 2023 end-page: 10337 ident: bib0014 article-title: Sparsegpt: Massive language models can be accurately pruned in one-shot publication-title: International conference on machine learning – volume: vol. 38 start-page: 10865 year: 2024 end-page: 10873 ident: bib0002 article-title: Fluctuation-based adaptive structured pruning for large language models publication-title: Proceedings of the AAAI conference on artificial intelligence – start-page: 2924 year: 2019 end-page: 2936 ident: bib0009 article-title: Boolq: Exploring the surprising difficulty of natural yes/no questions publication-title: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics – volume: 35 start-page: 30318 year: 2022 end-page: 30332 ident: bib0012 article-title: Llm.int8: 8-bit matrix multiplication for transformers at scale publication-title: Advances in Neural Information Processing Systems – reference: Kaushal, A., Vaidhya, T., & Rish, I. (2023). Lord: Low rank decomposition of monolingual code llms for one-shot compression. – reference: Wang, P., Fan, Z., Hu, S., Chen, Z., Wang, Y., & Wang, Y. (2024a). Reconstruct the pruned model without any retraining. – reference: Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., & Hashimoto, T. B. (2023). Stanford alpaca: An instruction-following LLaMA model. – ident: 10.1016/j.neunet.2025.108332_bib0037 – volume: 35 start-page: 30318 year: 2022 ident: 10.1016/j.neunet.2025.108332_bib0012 article-title: Llm.int8: 8-bit matrix multiplication for transformers at scale publication-title: Advances in Neural Information Processing Systems – year: 2024 ident: 10.1016/j.neunet.2025.108332_bib0003 article-title: Slicegpt: Compress large language models by deleting rows and columns – year: 2022 ident: 10.1016/j.neunet.2025.108332_bib0015 article-title: Optq: Accurate quantization for generative pre-trained transformers – year: 2019 ident: 10.1016/j.neunet.2025.108332_bib0005 article-title: Piqa: Reasoning about physical commonsense in natural language – ident: 10.1016/j.neunet.2025.108332_bib0052 doi: 10.18653/v1/2024.findings-acl.178 – ident: 10.1016/j.neunet.2025.108332_bib0010 – start-page: 293 year: 1993 ident: 10.1016/j.neunet.2025.108332_bib0019 article-title: Optimal brain surgeon and general network pruning – ident: 10.1016/j.neunet.2025.108332_bib0004 – ident: 10.1016/j.neunet.2025.108332_bib0046 – ident: 10.1016/j.neunet.2025.108332_bib0027 – ident: 10.1016/j.neunet.2025.108332_bib0008 – ident: 10.1016/j.neunet.2025.108332_bib0042 – ident: 10.1016/j.neunet.2025.108332_bib0023 – ident: 10.1016/j.neunet.2025.108332_bib0038 – volume: 25 start-page: 1 issue: 386 year: 2024 ident: 10.1016/j.neunet.2025.108332_bib0007 article-title: Learning discretized neural networks under ricci flow publication-title: Journal of Machine Learning Research – start-page: 8732 year: 2020 ident: 10.1016/j.neunet.2025.108332_bib0033 article-title: Winogrande: An adversarial winograd schema challenge at scale publication-title: Proceedings of the AAAI Conference on Artificial Intelligence doi: 10.1609/aaai.v34i05.6399 – ident: 10.1016/j.neunet.2025.108332_bib0034 – ident: 10.1016/j.neunet.2025.108332_bib0017 – volume: 32 year: 2019 ident: 10.1016/j.neunet.2025.108332_bib0031 article-title: Pytorch: An imperative style, high-performance deep learning library publication-title: Advances in neural information processing systems – ident: 10.1016/j.neunet.2025.108332_bib0049 – start-page: 31 year: 2020 ident: 10.1016/j.neunet.2025.108332_bib0044 article-title: Sparsert: Accelerating unstructured sparsity on gpus for deep learning inference – start-page: 2381 year: 2018 ident: 10.1016/j.neunet.2025.108332_bib0030 article-title: Can a suit of armor conduct electricity? a new dataset for open book question answering – ident: 10.1016/j.neunet.2025.108332_bib0020 – ident: 10.1016/j.neunet.2025.108332_bib0024 – ident: 10.1016/j.neunet.2025.108332_bib0039 – volume: 21 start-page: 1 issue: 140 year: 2020 ident: 10.1016/j.neunet.2025.108332_bib0032 article-title: Exploring the limits of transfer learning with a unified text-to-text transformer publication-title: Journal of machine learning research – ident: 10.1016/j.neunet.2025.108332_bib0016 – volume: 30 year: 2017 ident: 10.1016/j.neunet.2025.108332_bib0041 article-title: Attention is all you need publication-title: Advances in neural information processing systems – ident: 10.1016/j.neunet.2025.108332_bib0025 – ident: 10.1016/j.neunet.2025.108332_bib0050 – volume: 36 start-page: 21702 year: 2023 ident: 10.1016/j.neunet.2025.108332_bib0028 article-title: Llm-pruner: On the structural pruning of large language models publication-title: Advances in neural information processing systems – start-page: 10323 year: 2023 ident: 10.1016/j.neunet.2025.108332_bib0014 article-title: Sparsegpt: Massive language models can be accurately pruned in one-shot – start-page: 2924 year: 2019 ident: 10.1016/j.neunet.2025.108332_bib0009 article-title: Boolq: Exploring the surprising difficulty of natural yes/no questions – ident: 10.1016/j.neunet.2025.108332_bib0001 doi: 10.18653/v1/2023.emnlp-main.298 – volume: vol. 38 start-page: 10865 year: 2024 ident: 10.1016/j.neunet.2025.108332_bib0002 article-title: Fluctuation-based adaptive structured pruning for large language models – year: 2023 ident: 10.1016/j.neunet.2025.108332_bib0048 article-title: Smoothquant: accurate and efficient post-training quantization for large language models – ident: 10.1016/j.neunet.2025.108332_bib0021 – ident: 10.1016/j.neunet.2025.108332_bib0006 – start-page: 1135 year: 2015 ident: 10.1016/j.neunet.2025.108332_bib0018 article-title: Learning both weights and connections for efficient neural networks – ident: 10.1016/j.neunet.2025.108332_bib0040 – ident: 10.1016/j.neunet.2025.108332_bib0036 – volume: 35 start-page: 4475 year: 2022 ident: 10.1016/j.neunet.2025.108332_bib0013 article-title: Optimal brain compression: A framework for accurate post-training quantization and pruning publication-title: Advances in Neural Information Processing Systems – year: 2024 ident: 10.1016/j.neunet.2025.108332_bib0026 article-title: Awq: Activation-aware weight quantization for llm compression and acceleration – ident: 10.1016/j.neunet.2025.108332_bib0053 – ident: 10.1016/j.neunet.2025.108332_bib0011 – year: 2017 ident: 10.1016/j.neunet.2025.108332_bib0029 article-title: Pointer sentinel mixture models – ident: 10.1016/j.neunet.2025.108332_bib0047 – ident: 10.1016/j.neunet.2025.108332_bib0045 doi: 10.18653/v1/2020.emnlp-demos.6 – year: 2019 ident: 10.1016/j.neunet.2025.108332_bib0051 article-title: Hellaswag: Can a machine really finish your sentence? – ident: 10.1016/j.neunet.2025.108332_bib0043 – ident: 10.1016/j.neunet.2025.108332_bib0022 – start-page: 19 year: 2020 ident: 10.1016/j.neunet.2025.108332_bib0035 article-title: Efficient sparse-dense matrix-matrix multiplication on GPUs using the customized sparse storage format |
| SSID | ssj0006843 |
| Score | 2.4722047 |
| Snippet | The large model size and resource consumption of Large Language Models (LLMs) limit their deployment and application in many scenarios. Structured pruning... |
| SourceID | pubmed crossref elsevier |
| SourceType | Index Database Publisher |
| StartPage | 108332 |
| SubjectTerms | Large Language Model Model Compression Network Pruning |
| Title | OOPS: Outlier-Aware and Quadratic Programming Based Structured Pruning for Large Language Models |
| URI | https://dx.doi.org/10.1016/j.neunet.2025.108332 https://www.ncbi.nlm.nih.gov/pubmed/41337790 |
| Volume | 196 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVESC databaseName: Elsevier SD Freedom Collection Journals 2021 customDbUrl: eissn: 1879-2782 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0006843 issn: 0893-6080 databaseCode: AIEXJ dateStart: 19950101 isFulltext: true titleUrlDefault: https://www.sciencedirect.com providerName: Elsevier |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV3JTsMwELXYDlzY90U-cHXVZnEcbgWBoEJtESDKKSSOK7WCAC1h-XvGW9qCROmBS1SNUsfyPHleJs8zCB1ADI4TLW6tcOI5ZY_ETAgCsT2FgEGBI6eq2URQr7NWK2yaVHZftRMIsox9fITP_-pqsIGz5dHZCdxdDAoG-A1Ohyu4Ha5_cnyj0byS7_mN_BX4ZY9U32PzjeAyj9OeqtDa1KqsR5knOII4JkmnrCObSzV6s5dnVl95IYXicNVJTdU57aE_TGhlcQ_wcqbV5AVBvxVKJVADxwsTG6XqRxmvOi8da2l1TL66Bk_8jAe33hV2meowdpOdcHx5TE-fZLabWOgSWtbdmswmWQHap5OaP_ZvnUroljKRw8RLcsTS4PbRctnfwlghLrS6tW6kR4nkKJEeZRrNOoEfwg4-Wz0_adWKoE2ZFlja6dpTlkoK-HM2Y1jMEEW5XkIL5t0CVzUmltGUyFbQou3bgc02voruJUQO8QhAMCwzLgCChwCCFUDwACDYAAQDQLACCLYAwRoga-jm9OT6-IyYThuEA3Mpk0oiIAAHfps5beDAQCoTl6dtj4oKFZQ6MfOStOx63EtYW5YoTEKa-JQnDvNTLw7cdTSTPWViE2FXpK7PUoeH3PG4_E7rusCiWciEH0Pw2ELErlz0rAuqRL95bAsFdnkjQwo12YsAM2P-uaG9UTwHCJuqr7k94Rx20PwA27toBhZc7KE5_vba6ff2DZS-ABt1io0 |
| linkProvider | Elsevier |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=OOPS%3A+Outlier-Aware+and+Quadratic+Programming+Based+Structured+Pruning+for+Large+Language+Models&rft.jtitle=Neural+networks&rft.au=Wei%2C+Jiateng&rft.au=Li%2C+Siqi&rft.au=Xiang%2C+Jingyang&rft.au=Yang%2C+Jiandang&rft.date=2025-11-25&rft.issn=0893-6080&rft.spage=108332&rft_id=info:doi/10.1016%2Fj.neunet.2025.108332&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_neunet_2025_108332 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0893-6080&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0893-6080&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0893-6080&client=summon |