XML2HBase: Storing and querying large collections of XML documents using a NoSQL database system

•An encoding scheme named Pathed-Dewey Order for nodes in XML documents.•An XML-to-HBase mapping strategy based on Pathed-Dewey Order.•An algorithm to translate XPath into query operations on HBase tables.•Supports efficient storage and query of large collections of small XML documents.•A superior p...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Journal of parallel and distributed computing Ročník 161; s. 83 - 99
Hlavní autoři: Bao, Liang, Yang, Jin, Wu, Chase Q., Qi, Haiyang, Zhang, Xin, Cai, Shunda
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier Inc 01.03.2022
Témata:
ISSN:0743-7315, 1096-0848
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:•An encoding scheme named Pathed-Dewey Order for nodes in XML documents.•An XML-to-HBase mapping strategy based on Pathed-Dewey Order.•An algorithm to translate XPath into query operations on HBase tables.•Supports efficient storage and query of large collections of small XML documents.•A superior performance of XML2HBase over existing solutions in target use cases. Many big data applications such as smart transportation, healthcare, and e-commerce need to store and query large collections of small XML documents, which has become a fundamental problem. However, existing solutions are inadequate to deliver satisfactory query performance in such circumstances. In this paper, we propose a framework named XML2HBase to address this problem using HBase, a widely deployed NoSQL database. Within this framework, we design a novel encoding scheme called Pathed-Dewey Order and a two-layer mapping method to store XML documents in HBase tables. XML queries, which are represented as XPath expressions, are evaluated through their translation into queries over HBase tables. Based on an in-depth analysis of the characteristics of the proposed approach, we design and integrate four optimization strategies to reduce storage space and query response time. Extensive experiments on two well-known XML benchmarks demonstrate the superior performance of XML2HBase over three state-of-the-art methods.
ISSN:0743-7315
1096-0848
DOI:10.1016/j.jpdc.2021.11.003