Zobrazit v EDS

Exploring Class Mapping as Data Fusion Technique in Machine Learning for Research Classification.

Uloženo v:

Podrobná bibliografie
Název:	Exploring Class Mapping as Data Fusion Technique in Machine Learning for Research Classification.
Autoři:	Huang, Chien-Chih, Chen, Kuang-Hua
Zdroj:	Journal of Library & Information Studies; Dec2025, Vol. 23 Issue 2, p119-143, 25p
Abstrakt:	Access to sufficient, high-quality data is essential for effectively training and validating machine learning classifiers. This study investigates class mapping as a data fusion strategy to enhance training data for research classification. Two versions of the Australian and New Zealand Standard Research Classification, ANZSRC 2008 FoR and ANZSRC 2020 FoR, are used to organize 179,431 documents from eight institutional repositories into plain and mapped datasets. Each dataset is divided into subsets corresponding to the division, group, and field levels of the classification schemes. Results show that 49% to 63% of documents are successfully mapped between schemes. Classifiers by Support Vector Machines (SVM), SciBERT, ModernBERT-base, and ModernBERT-large are trained to assess the effectiveness of this data fusion approach on classification performance. All models show improved performance at the three levels. ModernBERT-large achieved the greatest performance gains, with the improvements in validation F1 scores of 1.0% and 2.5% at the division level, 4.4% and 2.2% at the group level, and 9.9% and 11.5% at the field level. An emergent ability was observed, as performance in non-augmented classes improved with ModernBERT-large but not with ModernBERTbase. Overall, this study demonstrates that class mapping effectively enriches training datasets, enhances classification performance, and underscores the importance of model size and architecture. These findings offer a practical and scalable strategy for improving machine learning performance in research classification tasks. [ABSTRACT FROM AUTHOR]
	Copyright of Journal of Library & Information Studies is the property of Department of Library & Information Science, National Taiwan University and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Databáze:	Complementary Index

Nájsť tento článok vo Web of Science

Popis
Abstrakt:	Access to sufficient, high-quality data is essential for effectively training and validating machine learning classifiers. This study investigates class mapping as a data fusion strategy to enhance training data for research classification. Two versions of the Australian and New Zealand Standard Research Classification, ANZSRC 2008 FoR and ANZSRC 2020 FoR, are used to organize 179,431 documents from eight institutional repositories into plain and mapped datasets. Each dataset is divided into subsets corresponding to the division, group, and field levels of the classification schemes. Results show that 49% to 63% of documents are successfully mapped between schemes. Classifiers by Support Vector Machines (SVM), SciBERT, ModernBERT-base, and ModernBERT-large are trained to assess the effectiveness of this data fusion approach on classification performance. All models show improved performance at the three levels. ModernBERT-large achieved the greatest performance gains, with the improvements in validation F1 scores of 1.0% and 2.5% at the division level, 4.4% and 2.2% at the group level, and 9.9% and 11.5% at the field level. An emergent ability was observed, as performance in non-augmented classes improved with ModernBERT-large but not with ModernBERTbase. Overall, this study demonstrates that class mapping effectively enriches training datasets, enhances classification performance, and underscores the importance of model size and architecture. These findings offer a practical and scalable strategy for improving machine learning performance in research classification tasks. [ABSTRACT FROM AUTHOR]
ISSN:	16067509
DOI:	10.6182/jlis.202512_23(2).119