Enhanced interpretation of novel datasets by summarizing clustering results using deep-learning based linguistic models

In today’s technology-driven era, the proliferation of data is inevitable across various domains. Within engineering, sciences, and business domains, particularly in the context of big data, it can extract actionable insights that can revolutionize the field. Amid data management and analysis, patte...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Applied intelligence (Dordrecht, Netherlands) Ročník 55; číslo 5; s. 317
Hlavní autori: K, Natarajan, Verma, Srikar, Kumar, Dheeraj
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Boston Springer Nature B.V 01.04.2025
Predmet:
ISSN:0924-669X, 1573-7497
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:In today’s technology-driven era, the proliferation of data is inevitable across various domains. Within engineering, sciences, and business domains, particularly in the context of big data, it can extract actionable insights that can revolutionize the field. Amid data management and analysis, patterns or groups of interconnected data points, commonly referred to as clusters, frequently emerge. These clusters represent distinct subsets containing closely related data points, showcasing unique characteristics compared to other clusters within the same dataset. Spanning across disciplines such as physics, biology, business, and sales, clustering is important in understanding these novel datasets’ essential characteristics, developing complex statistical models, and testing various hypotheses. However, interpreting the characteristics and physical implications of generated clusters by different clustering algorithms is challenging for researchers unfamiliar with these algorithms’ inner workings. This research addresses the intricacies of comprehending data clustering, cluster attributes, and evaluation metrics, especially for individuals lacking proficiency in clustering or related disciplines like statistics. The primary objective of this study is to simplify cluster analysis by furnishing users or analysts from diverse domains with succinct linguistic synopses of clustering results, circumventing the necessity for intricate numerical or mathematical terms. Deep learning techniques based on large language models, such as encoder-decoders (for example, the T5 model) and generative pre-trained transformers (GPTs), are employed to achieve this. This study aims to construct a summarization model capable of ingesting data clusters, producing a condensed overview of the contained insights in a simplified, easily understandable linguistic format. The evaluation process revealed a clear preference among evaluators for the summaries generated by GPT, with T5 summaries following closely behind. GPT and T5 summaries were good at fluency, demonstrating their ability to capture the original content in a human-like manner. In contrast, while providing a structured framework for summarization, the linguistic protoform-based approach is needed to match the quality and coherence of the GPT and T5 summaries.
Bibliografia:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0924-669X
1573-7497
DOI:10.1007/s10489-025-06250-6