HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing

Frequent itemset mining is one of the data mining techniques applied to discover frequent patterns, used in prediction, association rule mining, classification, etc. Apriori algorithm is an iterative algorithm, which is used to find frequent itemsets from transactional dataset. It scans complete dat...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	The Journal of supercomputing Ročník 73; číslo 8; s. 3652 - 3668
Hlavní autoři:	Sethi, Krishan Kumar, Ramesh, Dharavath
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	New York Springer US 01.08.2017 Springer Nature B.V
Témata:	Algorithms Big Data Compilers Computer Science Data management Data mining Data processing Datasets Interpreters Iterative algorithms Processor Architectures Programming Languages Scanning Frequent pattern mining Apriori algorithm Big data Apache Spark
ISSN:	0920-8542, 1573-0484
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Frequent itemset mining is one of the data mining techniques applied to discover frequent patterns, used in prediction, association rule mining, classification, etc. Apriori algorithm is an iterative algorithm, which is used to find frequent itemsets from transactional dataset. It scans complete dataset in each iteration to generate the large frequent itemsets of different cardinality, which seems better for small data but not feasible for big data. The MapReduce framework provides the distributed environment to run the Apriori on big transactional data. However, MapReduce is not suitable for iterative process and declines the performance. We introduce a novel algorithm named Hybrid Frequent Itemset Mining (HFIM), which utilizes the vertical layout of dataset to solve the problem of scanning the dataset in each iteration. Vertical dataset carries information to find support of each itemsets. Moreover, we also include some enhancements to reduce number of candidate itemsets. The proposed algorithm is implemented over Spark framework, which incorporates the concept of resilient distributed datasets and performs in-memory processing to optimize the execution time of operation. We compare the performance of HFIM with another Spark-based implementation of Apriori algorithm for various datasets. Experimental results show that the HFIM performs better in terms of execution time and space consumption.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0920-8542 1573-0484
DOI:	10.1007/s11227-017-1963-4