Enhancing Language Models via HTML DOM Tree for Text Structure Understanding

Understanding text structure, which enables the automated system to parse long text structure, is crucial for various natural language processing applications such as information extraction, summarization, and question answering. Although previous methods have advanced text structure parsing effecti...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE Transactions on Audio, Speech and Language Processing Jg. 33; S. 1653 - 1663
Hauptverfasser: Xing, Hangdi, Shao, Zirui, Gao, Feiyu, Bu, Jiajun, Yu, Zhi, Zheng, Qi, Gu, Jingjun, Liu, Xiaozhong
Format: Journal Article
Sprache:Englisch
Veröffentlicht: IEEE 2025
Schlagworte:
ISSN:2998-4173, 2998-4173
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Understanding text structure, which enables the automated system to parse long text structure, is crucial for various natural language processing applications such as information extraction, summarization, and question answering. Although previous methods have advanced text structure parsing effectively, they face challenges such as not leveraging the abundance of unlabelled data and focusing mainly on content-inferred information. To address this deficiency, this paper introduces a novel Text Structure Language Model (TSLM), an LM pre-training framework that employs ubiquitous HTML documents and considers the text structure among text units. HTML documents are composed by experts and their hierarchies can reflect the structure of documents. Our learning framework is designed to equip the LM with awareness of two complementary kinds of structures from HTML documents. It encourages the model to learn local structure which helps in understanding the immediate connection between two units by reconstructing the structure of DOM tree, and global structure which shapes the overall organization and thematic development by predicting the optimal content-fitting tree. Extensive experiments with structure-related downstream tasks, including text segmentation and table of contents generation, validate the effectiveness of TSLM.
ISSN:2998-4173
2998-4173
DOI:10.1109/TASLPRO.2025.3555098