Enhancing Language Models via HTML DOM Tree for Text Structure Understanding
Understanding text structure, which enables the automated system to parse long text structure, is crucial for various natural language processing applications such as information extraction, summarization, and question answering. Although previous methods have advanced text structure parsing effecti...
Gespeichert in:
| Veröffentlicht in: | IEEE Transactions on Audio, Speech and Language Processing Jg. 33; S. 1653 - 1663 |
|---|---|
| Hauptverfasser: | , , , , , , , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
IEEE
2025
|
| Schlagworte: | |
| ISSN: | 2998-4173, 2998-4173 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Zusammenfassung: | Understanding text structure, which enables the automated system to parse long text structure, is crucial for various natural language processing applications such as information extraction, summarization, and question answering. Although previous methods have advanced text structure parsing effectively, they face challenges such as not leveraging the abundance of unlabelled data and focusing mainly on content-inferred information. To address this deficiency, this paper introduces a novel Text Structure Language Model (TSLM), an LM pre-training framework that employs ubiquitous HTML documents and considers the text structure among text units. HTML documents are composed by experts and their hierarchies can reflect the structure of documents. Our learning framework is designed to equip the LM with awareness of two complementary kinds of structures from HTML documents. It encourages the model to learn local structure which helps in understanding the immediate connection between two units by reconstructing the structure of DOM tree, and global structure which shapes the overall organization and thematic development by predicting the optimal content-fitting tree. Extensive experiments with structure-related downstream tasks, including text segmentation and table of contents generation, validate the effectiveness of TSLM. |
|---|---|
| ISSN: | 2998-4173 2998-4173 |
| DOI: | 10.1109/TASLPRO.2025.3555098 |