TextBenDS: a Generic Textual Data Benchmark for Distributed Systems

Extracting top- k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the mod...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Information systems frontiers Ročník 23; číslo 1; s. 81 - 100
Hlavní autoři:	Truică, Ciprian-Octavian, Apostol, Elena-Simona, Darmont, Jérôme, Assent, Ira
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	New York Springer US 01.02.2021 Springer Nature B.V Springer Verlag
Edice:	Breakthroughs on Cross-Cutting Data Management, Data Analytics and Applied Data Science
Témata:	Algorithms Benchmarks Big Data Business and Management Computation Computer networks Computer Science Control Data base management systems Data mining Datasets Document and Text Processing Ecosystems Errors Information systems IT in Business Keywords Machine learning Management of Computing and Information Systems Multidimensional approach Operations Research/Decision Theory Performance evaluation Queries Retrieval Scores Subsets Systems Theory Weighting Weighting schemes Distributed DBMSs Top keywords Distributed frameworks documents Benchmark Top-k documents Top-k keywords
ISSN:	1387-3326, 1572-9419
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Extracting top- k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the modifications performed on the dataset. Furthermore, calculation errors are introduced when analyzing only subsets of the dataset, i.e., wrong weighting are computed as weighting schemes use the number of documents for scoring keywords and documents. Therefore, in a Big Data context, it is crucial to lower the runtime of computing weighting schemes, without hindering the analysis process and the accuracy of the machine learning algorithms. To address this requirement for the task of computing top- k keywords and documents (which largely relies on weighting schemes), it is customary to design benchmarks that compare weighting schemes within various configurations of distributedframeworks and database management systems. Thus, we propose T ext B en DS - a generic document-oriented benchmark for storing textual data and constructing weighting schemes. Our benchmark offers a generic data model designed with a multidimensional approach for storing text documents. We also propose using aggregation queries with various complexities and selectivities for constructing term weighting schemes, that are utilized in extracting top- k keywords and documents. We evaluate the computing performance of the queries on several distributed environments set within the Apache Hadoop ecosystem. Our experimental results provide interesting insights. As an example, MongoDB shows the best overall performance, while Spark’s execution time remains almost constant regardless of weighting schemes.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1387-3326 1572-9419
DOI:	10.1007/s10796-020-09999-y