Multi-granularity hierarchical topic-based segmentation of structured, digital library resources
Purpose Current segmentation systems almost invariably focus on linear segmentation and can only divide text into linear sequences of segments. This suits cohesive text such as news feed but not coherent texts such as documents of a digital library which have hierarchical structures. To overcome the...
Saved in:
| Published in: | Electronic library Vol. 35; no. 1; pp. 99 - 120 |
|---|---|
| Main Authors: | , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Oxford
Emerald Publishing Limited
01.01.2017
Emerald Group Publishing Limited |
| Subjects: | |
| ISSN: | 0264-0473, 1758-616X, 1758-616X |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Purpose
Current segmentation systems almost invariably focus on linear segmentation and can only divide text into linear sequences of segments. This suits cohesive text such as news feed but not coherent texts such as documents of a digital library which have hierarchical structures. To overcome the focus on linear segmentation in document segmentation and to realize the purpose of hierarchical segmentation for a digital library’s structured resources, this paper aimed to propose a new multi-granularity hierarchical topic-based segmentation system (MHTSS) to decide section breaks.
Design/methodology/approach
MHTSS adopts up-down segmentation strategy to divide a structured, digital library document into a document segmentation tree. Specifically, it works in a three-stage process, such as document parsing, coarse segmentation based on document access structures and fine-grained segmentation based on lexical cohesion.
Findings
This paper analyzed limitations of document segmentation methods for the structured, digital library resources. Authors found that the combination of document access structures and lexical cohesion techniques should complement each other and allow for a better segmentation of structured, digital library resources. Based on this finding, this paper proposed the MHTSS for the structured, digital library resources. To evaluate it, MHTSS was compared to the TT and C99 algorithms on real-world digital library corpora. Through comparison, it was found that the MHTSS achieves top overall performance.
Practical implications
With MHTSS, digital library users can get their relevant information directly in segments instead of receiving the whole document. This will improve retrieval performance as well as dramatically reduce information overload.
Originality/value
This paper proposed MHTSS for the structured, digital library resources, which combines the document access structures and lexical cohesion techniques to decide section breaks. With this system, end-users can access a document by sections through a document structure tree. |
|---|---|
| Bibliography: | Includes illustrations, references, tables Includes links to related electronic resources ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 0264-0473 1758-616X 1758-616X |
| DOI: | 10.1108/EL-06-2015-0108 |