Automatic classification of research data sets into the Chinese Library Classification with generative large language model.

Uložené v:
Podrobná bibliografia
Názov: Automatic classification of research data sets into the Chinese Library Classification with generative large language model.
Autori: Luo, Pengcheng, Hong, Lingzi, Nie, Lei
Zdroj: Electronic Library; 2025, Vol. 43 Issue 4, p600-618, 19p
Abstrakt: Purpose: Research data sets are typically distributed across different data repositories and lack standardized classification information, which hinders effective discovery and access. This study aims to develop an automated method that assigns Chinese Library Classification (CLC) codes to data sets to facilitate user searching and browsing data sets. Design/methodology/approach: This study experiments with a three-step method for the automatic classification of research data sets: firstly, a multilingual classification model is trained to identify data sets with valid descriptions; subsequently, a multilingual generative large language model fine-tuned with book bibliographic data is used to generate CLC codes for data sets based on their valid descriptions; and, finally, the generated CLC codes are validated and corrected by a prefix tree constructed with valid CLC codes. Findings: Experimental results demonstrate that the proposed three-step method effectively classifies data sets. The CLC codes generated by the model are highly consistent with the classification information provided by the data set contributors, achieving a classification accuracy of 0.8520 for the first-level category and 0.4080 at the full CLC code level. Originality/value: This study proposes a method for the hierarchical classification of multilingual research data sets by accurately identifying data sets with valid descriptions, generating classification codes and correcting faulty codes. It provides a scalable and effective solution for data set classification and management. [ABSTRACT FROM AUTHOR]
Copyright of Electronic Library is the property of Emerald Publishing Limited and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Databáza: Complementary Index
Popis
Abstrakt:Purpose: Research data sets are typically distributed across different data repositories and lack standardized classification information, which hinders effective discovery and access. This study aims to develop an automated method that assigns Chinese Library Classification (CLC) codes to data sets to facilitate user searching and browsing data sets. Design/methodology/approach: This study experiments with a three-step method for the automatic classification of research data sets: firstly, a multilingual classification model is trained to identify data sets with valid descriptions; subsequently, a multilingual generative large language model fine-tuned with book bibliographic data is used to generate CLC codes for data sets based on their valid descriptions; and, finally, the generated CLC codes are validated and corrected by a prefix tree constructed with valid CLC codes. Findings: Experimental results demonstrate that the proposed three-step method effectively classifies data sets. The CLC codes generated by the model are highly consistent with the classification information provided by the data set contributors, achieving a classification accuracy of 0.8520 for the first-level category and 0.4080 at the full CLC code level. Originality/value: This study proposes a method for the hierarchical classification of multilingual research data sets by accurately identifying data sets with valid descriptions, generating classification codes and correcting faulty codes. It provides a scalable and effective solution for data set classification and management. [ABSTRACT FROM AUTHOR]
ISSN:02640473
DOI:10.1108/EL-02-2025-0042