A greedy feature selection algorithm for Big Data of high dimensionality
We present the Parallel, Forward–Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p -values of conditional independence tests and meta-analysis...
Saved in:
| Published in: | Machine learning Vol. 108; no. 2; pp. 149 - 202 |
|---|---|
| Main Authors: | , , , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
New York
Springer US
01.02.2019
Springer Nature B.V Springer Verlag |
| Subjects: | |
| ISSN: | 0885-6125, 1573-0565 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | We present the
Parallel, Forward–Backward with Pruning
(PFBP) algorithm for
feature selection
(FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of
p
-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as
Early Dropping
of features from consideration in subsequent iterations,
Early Stopping
of consideration of features within the same iteration, or
Early Return
of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions
faithfully
representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 PMCID: PMC6399683 |
| ISSN: | 0885-6125 1573-0565 |
| DOI: | 10.1007/s10994-018-5748-7 |