Fine-Tuning Large Language Models for Kazakh Text Simplification
This paper addresses text simplification task for Kazakh, a morphologically rich, low-resource language, by introducing KazSim, an instruction-tuned model built on multilingual large language models (LLMs). First, we develop a heuristic pipeline to identify complex Kazakh sentences, manually validat...
Saved in:
| Published in: | Applied sciences Vol. 15; no. 15; p. 8344 |
|---|---|
| Main Authors: | , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Basel
MDPI AG
26.07.2025
|
| Subjects: | |
| ISSN: | 2076-3417, 2076-3417 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | This paper addresses text simplification task for Kazakh, a morphologically rich, low-resource language, by introducing KazSim, an instruction-tuned model built on multilingual large language models (LLMs). First, we develop a heuristic pipeline to identify complex Kazakh sentences, manually validating its performance on 400 examples and comparing it against a purely LLM-based selection method; we then use this pipeline to assemble a parallel corpus of 8709 complex–simple pairs via LLM augmentation. For the simplification task, we benchmark KazSim against standard Seq2Seq systems, domain-adapted Kazakh LLMs, and zero-shot instruction-following models. On an automatically constructed test set, KazSim (Llama-3.3-70B) achieves BLEU 33.50, SARI 56.38, and F1 87.56 with a length ratio of 0.98, outperforming all baselines. We also explore prompt language (English vs. Kazakh) and conduct human evaluation with three native speakers: KazSim scores 4.08 for fluency, 4.09 for meaning preservation, and 4.42 for simplicity—significantly above GPT-4o-mini. Error analysis shows that remaining failures cluster into tone change, tense change, and semantic drift, reflecting Kazakh’s agglutinative morphology and flexible syntax. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 2076-3417 2076-3417 |
| DOI: | 10.3390/app15158344 |