Grouping the Unstructured – A Comparison of Methods for Unsupervised Document Clustering of a Specialised Corpus

Uložené v:
Podrobná bibliografia
Názov: Grouping the Unstructured – A Comparison of Methods for Unsupervised Document Clustering of a Specialised Corpus
Autori: Schlander, Anna, Bartsch, Sabine, Gius, Evelyn, Müller, Marcus, Rapp, Andrea, Weitin, Thomas
Informácie o vydavateľovi: UNSPECIFIED, 2025.
Rok vydania: 2025
Popis: The rapid growth of digital corpora presents a need for methods to automatically organise large collections of domain-specific texts into meaningful, interpretable groups. This study evaluates the effectiveness of document vectorisation combined with clustering for this purpose, comparing three prominent embedding approaches: Word2Vec, FastText, and Sentence-BERT in combination with the clustering algorithms k-means, DBSCAN, and hierarchical agglomerative clustering. In this work, the PubMed Abstracts corpus, consisting of academic abstracts from the field of neuroscience, was processed. The study delves into the characteristics, pitfalls and specific advantages of vectorisation and clustering methods. A combination of vectorisation and clustering with methods of corpuslinguistics and statistics allows us furthermore to seek and identify the „linguistic triggers“ that lead to specific behaviour of embeddings and clustering algorithms. A qualitative analysis framework is applied to assess cluster coherence and interpretability. Quantitative measures are presented alongside visual analyses of clustering results, including statistics for cluster-based subcorpora, inferred qualitative categories, and their distribution across clusters. Cramér’s V is employed to quantify associations between clustering methods and category assignments. The observations demonstrate distinct operational characteristics and trade-offs across vectorisation–clustering combinations. The findings inform methodological selection for large-scale text analysis and offer a framework for exploring scalable, interpretable, and linguistically informed clustering approaches. Ultimately, this work discusses and answers the question if we can create meaningful groups of documents and improve the accessibility of domain-specific corpora, given limited prior knowledge, through cluster analysis – a task that gains relevance as digital corpora grow.
Druh dokumentu: Book
DOI: 10.26083/tuprints-00031119
Rights: CC BY
Prístupové číslo: edsair.doi...........a0c103dd412879f95fdaae1d0f9d19a3
Databáza: OpenAIRE
Popis
Abstrakt:The rapid growth of digital corpora presents a need for methods to automatically organise large collections of domain-specific texts into meaningful, interpretable groups. This study evaluates the effectiveness of document vectorisation combined with clustering for this purpose, comparing three prominent embedding approaches: Word2Vec, FastText, and Sentence-BERT in combination with the clustering algorithms k-means, DBSCAN, and hierarchical agglomerative clustering. In this work, the PubMed Abstracts corpus, consisting of academic abstracts from the field of neuroscience, was processed. The study delves into the characteristics, pitfalls and specific advantages of vectorisation and clustering methods. A combination of vectorisation and clustering with methods of corpuslinguistics and statistics allows us furthermore to seek and identify the „linguistic triggers“ that lead to specific behaviour of embeddings and clustering algorithms. A qualitative analysis framework is applied to assess cluster coherence and interpretability. Quantitative measures are presented alongside visual analyses of clustering results, including statistics for cluster-based subcorpora, inferred qualitative categories, and their distribution across clusters. Cramér’s V is employed to quantify associations between clustering methods and category assignments. The observations demonstrate distinct operational characteristics and trade-offs across vectorisation–clustering combinations. The findings inform methodological selection for large-scale text analysis and offer a framework for exploring scalable, interpretable, and linguistically informed clustering approaches. Ultimately, this work discusses and answers the question if we can create meaningful groups of documents and improve the accessibility of domain-specific corpora, given limited prior knowledge, through cluster analysis – a task that gains relevance as digital corpora grow.
DOI:10.26083/tuprints-00031119