Fine-Tuning Large Language Models for Kazakh Text Simplification

This paper addresses text simplification task for Kazakh, a morphologically rich, low-resource language, by introducing KazSim, an instruction-tuned model built on multilingual large language models (LLMs). First, we develop a heuristic pipeline to identify complex Kazakh sentences, manually validat...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Applied sciences Jg. 15; H. 15; S. 8344
Hauptverfasser:	Toleu, Alymzhan, Tolegen, Gulmira, Ualiyeva, Irina
Format:	Journal Article
Sprache:	Englisch
Veröffentlicht:	Basel MDPI AG 26.07.2025
Schlagworte:	Benchmarks Datasets English language Error analysis fine-tuning Fluency Heuristic Kazakh language Language Language modeling Large language models Linear programming Linguistics Machine learning Methods Morphology Multilingualism Natural language processing Parallel corpora Preservation Readability Semantic change Simplicity Simplification Simplified language Syntax Teaching Tense text simplification Word sense disambiguation
ISSN:	2076-3417, 2076-3417
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This paper addresses text simplification task for Kazakh, a morphologically rich, low-resource language, by introducing KazSim, an instruction-tuned model built on multilingual large language models (LLMs). First, we develop a heuristic pipeline to identify complex Kazakh sentences, manually validating its performance on 400 examples and comparing it against a purely LLM-based selection method; we then use this pipeline to assemble a parallel corpus of 8709 complex–simple pairs via LLM augmentation. For the simplification task, we benchmark KazSim against standard Seq2Seq systems, domain-adapted Kazakh LLMs, and zero-shot instruction-following models. On an automatically constructed test set, KazSim (Llama-3.3-70B) achieves BLEU 33.50, SARI 56.38, and F1 87.56 with a length ratio of 0.98, outperforming all baselines. We also explore prompt language (English vs. Kazakh) and conduct human evaluation with three native speakers: KazSim scores 4.08 for fluency, 4.09 for meaning preservation, and 4.42 for simplicity—significantly above GPT-4o-mini. Error analysis shows that remaining failures cluster into tone change, tense change, and semantic drift, reflecting Kazakh’s agglutinative morphology and flexible syntax.
Bibliographie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2076-3417 2076-3417
DOI:	10.3390/app15158344