A greedy feature selection algorithm for Big Data of high dimensionality

We present the Parallel, Forward–Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p -values of conditional independence tests and meta-analysis...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Machine learning Ročník 108; číslo 2; s. 149 - 202
Hlavní autoři:	Tsamardinos, Ioannis, Borboudakis, Giorgos, Katsogridakis, Pavlos, Pratikakis, Polyvios, Christophides, Vassilis
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	New York Springer US 01.02.2019 Springer Nature B.V Springer Verlag
Témata:	Algorithms Artificial Intelligence Bayesian analysis Big Data Computer Science Computer simulation Control Data management Empirical analysis Greedy algorithms Iterative methods Machine Learning Mechatronics Natural Language Processing (NLP) Parallel processing Partitions Polymorphism Pruning Robotics Simulation and Modeling Big Data Feature selection Data analytics Forward selection Variable selection
ISSN:	0885-6125, 1573-0565
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	We present the Parallel, Forward–Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p -values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 PMCID: PMC6399683
ISSN:	0885-6125 1573-0565
DOI:	10.1007/s10994-018-5748-7