In EDS ansehen

融合多粒度代码特征和孤立森林算法的配置类型识别.

Gespeichert in:

Bibliographische Detailangaben
Titel:	融合多粒度代码特征和孤立森林算法的配置类型识别. (Chinese)
Alternate Title:	Configuration Type Identification Integrating Multi-Granularity Code Features and Isolation Forest Algorithm. (English)
Autoren:	刘源, 刘大伟, 张玉秀, 吴明磊
Quelle:	Journal of Computer Engineering & Applications; Jul2025, Vol. 61 Issue 13, p185-199, 15p
Schlagwörter:	SOURCE code, RESEARCH personnel, EMPIRICAL research, PROJECT evaluation, ALGORITHMS
Abstract (English):	The widespread application of the design principle of "high cohesion and low coupling" has led to types in source code, which are dedicated to managing configuration options or methods, called configuration types. Configuration types help researchers to understand configuration mechanisms from both attribute and behavioral perspectives, and provide the necessary option set and option data flow information for configuration error-handling techniques. However, the research on configuration types is not enough, and their identification still relies on manual retrieval. A configuration type identification method that integrates multi-granularity code features and the isolation forest algorithm is proposed to address the above issue. First, a configuration type dataset is manually constructed for ten representative open-source software projects. Through empirical research on the distribution, classification, and factors influencing the identification of configuration types, nine results are summarized to guide the identification of configuration types. Then, based on the research results, four type-level coarse-grained features and three method-level fine-grained features covering code vocabulary, structure, semantics and syntax information are selected, and a quantization algorithm is designed for each feature. Finally, considering the imbalanced sample category distribution of configuration types, the identification is transformed into an anomaly detection. The isolation forest algorithm is utilized to recommend configuration types, while heuristic rules are designed to reduce the number of false positives. Experimental results on five evaluation software projects demonstrate that the proposed method can identify configuration types for each software, with a mean average precision of 0.86 and an average time overhead of 21 minutes, thus preliminarily possessing the ability to replace manual identification. [ABSTRACT FROM AUTHOR]
Abstract (Chinese):	高内聚、低耦合"设计原则的普及应用, 使得代码中通常存在着专门管理配置选项或配置方法的特殊类型, 称为配置类型。配置类型有助于研究人员从属性角度和行为角度增进对配置机制的理解, 并为配置错误处理技术提供必要的选项集合以及选项数据流信息。然而, 配置类型研究尚不充分, 其识别仍依赖于人工检索。提出一种融合多粒度代码特征和孤立森林算法的配置类型识别方法。基于10 个具有代表性的开源软件, 手动构建配置类型数据集, 通过实证调研配置类型的分布、分类和识别影响因素, 总结得到9 个调研结果, 用于指导配置类型识别。基于调研结果, 选取覆盖代码词汇、结构、语义和语法信息的4 个类型级粗粒度特征和3 个方法级细粒度特征, 并为每个特征设计量化算法。考虑到配置类型存在样本类别分布不平衡问题, 将识别问题转化为异常检测问题, 利用孤立森林算法推荐配置类型, 同时设计启发规则减少误报数量。在5 个评估软件上的实验结果表明, 该方法能识别出每个软件的配置类型, 平均精度均值为0.86, 平均时间开销为21 min, 已初步具备代替人工识别的能力. [ABSTRACT FROM AUTHOR]
	Copyright of Journal of Computer Engineering & Applications is the property of Beijing Journal of Computer Engineering & Applications Journal Co Ltd. and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Datenbank:	Complementary Index

Nájsť tento článok vo Web of Science

Beschreibung
Abstract:	The widespread application of the design principle of "high cohesion and low coupling" has led to types in source code, which are dedicated to managing configuration options or methods, called configuration types. Configuration types help researchers to understand configuration mechanisms from both attribute and behavioral perspectives, and provide the necessary option set and option data flow information for configuration error-handling techniques. However, the research on configuration types is not enough, and their identification still relies on manual retrieval. A configuration type identification method that integrates multi-granularity code features and the isolation forest algorithm is proposed to address the above issue. First, a configuration type dataset is manually constructed for ten representative open-source software projects. Through empirical research on the distribution, classification, and factors influencing the identification of configuration types, nine results are summarized to guide the identification of configuration types. Then, based on the research results, four type-level coarse-grained features and three method-level fine-grained features covering code vocabulary, structure, semantics and syntax information are selected, and a quantization algorithm is designed for each feature. Finally, considering the imbalanced sample category distribution of configuration types, the identification is transformed into an anomaly detection. The isolation forest algorithm is utilized to recommend configuration types, while heuristic rules are designed to reduce the number of false positives. Experimental results on five evaluation software projects demonstrate that the proposed method can identify configuration types for each software, with a mean average precision of 0.86 and an average time overhead of 21 minutes, thus preliminarily possessing the ability to replace manual identification. [ABSTRACT FROM AUTHOR]
ISSN:	10028331
DOI:	10.3778/j.issn.1002-8331.2408-0377