View in EDS

基于小规模异构语言模型一致性委员会的数据剪枝方法.

Saved in:

Bibliographic Details
Title:	基于小规模异构语言模型一致性委员会的数据剪枝方法. (Chinese)
Alternate Title:	Data pruning method based on consistency committee of small-scale heterogeneous language models. (English)
Authors:	王凯文, 王蕴哲, 谈威, 傅启明, 陆悠, 陈建平
Source:	Application Research of Computers / Jisuanji Yingyong Yanjiu; Jan2026, Vol. 43 Issue 1, p110-119, 10p
Subject Terms:	LANGUAGE models, EVALUATION methodology, DATA curation, MATHEMATICAL optimization, ALGORITHMS, COMPUTATIONAL complexity, METADATA
Abstract (English):	Large language models (LLMs) fine-tuning performance strongly depends on the quality of training data. Existing single-model perplexity-based data evaluation methods had limitations, including perplexity bias, where low-perplexity samples could still be mispredicted, and cross-model divergence, where different models produced inconsistent perplexity scores on the same samples. To address these issues, this study proposed a method based on the consistency of a heterogeneous committee of small language models to evaluate data value from two perspectives. The method calculated the coefficient of variation of perplexity across multiple models to measure model divergence. It also computed prediction difficulty by combining the similarity between predicted outputs and reference answers. Based on these evaluations, the algorithm proposed the MMCS (multi-model consistency score) metric to select high-quality training data. Experimental results show that data filtered by MMCS achieves better fine-tuning performance than traditional methods on two mainstream LLMs and three public datasets. It obtained optimal results in 27 out of 36 comparative experiments. This finding provides a new approach for efficient data pruning. The evaluation method based on multi-model divergence proves effective in improving the marginal utility of training data. [ABSTRACT FROM AUTHOR]
Abstract (Chinese):	大型语言模型 (LLMs) 的微调效果高度依赖于训练数据的质量, 但现有的基于单模型困惑度的数据评估方法存在困惑度偏差 (低困惑度样本可能仍被错误预测) 和跨模型分歧 (不同模型对同一样本的困惑度不一致) 的局限性。为此, 该研究提出了一种基于异构小语言模型委员会一致性的方法, 从两个方面评估数据价值: 一方面计算多模型对同一数据样本的困惑度的变异系数来量化模型间分歧; 另一方面结合预测结果与基准答案的相似性来计算预测难度。综合这两方面的评估结果, 提出 MMCS (多模型一致性) 指标, 用于高质量训练数据筛选。实验结果表明, 基于 MMCS 筛选的数据在两种主流 LLM 和三个公开数据集上的微调性能优于传统方法, 在 36 次对比实验中有 27 次取得最优效果, 为高效数据剪枝提供了新的思路, 证实了基于多模型分歧的评估方法在提升数据边际效益方面的有效性。 [ABSTRACT FROM AUTHOR]
	Copyright of Application Research of Computers / Jisuanji Yingyong Yanjiu is the property of Application Research of Computers Edition and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database:	Complementary Index

Nájsť tento článok vo Web of Science

Description
Abstract:	Large language models (LLMs) fine-tuning performance strongly depends on the quality of training data. Existing single-model perplexity-based data evaluation methods had limitations, including perplexity bias, where low-perplexity samples could still be mispredicted, and cross-model divergence, where different models produced inconsistent perplexity scores on the same samples. To address these issues, this study proposed a method based on the consistency of a heterogeneous committee of small language models to evaluate data value from two perspectives. The method calculated the coefficient of variation of perplexity across multiple models to measure model divergence. It also computed prediction difficulty by combining the similarity between predicted outputs and reference answers. Based on these evaluations, the algorithm proposed the MMCS (multi-model consistency score) metric to select high-quality training data. Experimental results show that data filtered by MMCS achieves better fine-tuning performance than traditional methods on two mainstream LLMs and three public datasets. It obtained optimal results in 27 out of 36 comparative experiments. This finding provides a new approach for efficient data pruning. The evaluation method based on multi-model divergence proves effective in improving the marginal utility of training data. [ABSTRACT FROM AUTHOR]
ISSN:	10013695
DOI:	10.19734/j.issn.1001-3695.2026.05.0139