A greedy feature selection algorithm for Big Data of high dimensionality
We present the Parallel, Forward–Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p -values of conditional independence tests and meta-analysis...
Uloženo v:
| Vydáno v: | Machine learning Ročník 108; číslo 2; s. 149 - 202 |
|---|---|
| Hlavní autoři: | , , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
New York
Springer US
01.02.2019
Springer Nature B.V Springer Verlag |
| Témata: | |
| ISSN: | 0885-6125, 1573-0565 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | We present the
Parallel, Forward–Backward with Pruning
(PFBP) algorithm for
feature selection
(FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of
p
-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as
Early Dropping
of features from consideration in subsequent iterations,
Early Stopping
of consideration of features within the same iteration, or
Early Return
of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions
faithfully
representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case. |
|---|---|
| Bibliografie: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 PMCID: PMC6399683 |
| ISSN: | 0885-6125 1573-0565 |
| DOI: | 10.1007/s10994-018-5748-7 |