Scalable random forest with data-parallel computing

Saved in:
Bibliographic Details
Title: Scalable random forest with data-parallel computing
Authors: Vázquez-Novoa, Fernando, Conejero Bañón, Francisco Javier, Tatu, Cristian, Badia Sala, Rosa Maria
Contributors: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center
Publisher Information: Springer Cham
Publication Year: 2023
Collection: Universitat Politècnica de Catalunya, BarcelonaTech: UPCommons - Global access to UPC knowledge
Subject Terms: Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors::Arquitectures distribuïdes, Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Aprenentatge automàtic, High performance computing, Machine learning, Electronic data processing -- Distributed processing, Random forest, PyCOMPSs, COMPSs, Parallelism, Distributed computing, Dislib, HPC, Càlcul intensiu (Informàtica), Aprenentatge automàtic, Processament distribuït de dades
Description: In the last years, there has been a significant increment in the quantity of data available and computational resources. This leads scientific and industry communities to pursue more accurate and efficient Machine Learning (ML) models. Random Forest is a well-known algorithm in the ML field due to the good results obtained in a wide range of problems. Our objective is to create a parallel version of the algorithm that can generate a model using data distributed across different processors that computationally scales on available resources. This paper presents two novel proposals for this algorithm with a data-parallel approach. The first version is implemented using the PyCOMPSs framework and its failure management mechanism, while the second variant uses the new PyCOMPSs nesting paradigm where the parallel tasks can generate other tasks within them. Both approaches are compared between them and against MLlib Apache Spark Random Forest with strong and weak scaling tests. Our findings indicate that while the MLlib implementation is faster when executed in a small number of nodes, the scalability of both new variants is superior. We conclude that the proposed data-parallel approaches to the Random Forest algorithm can effectively generate accurate and efficient models in a distributed computing environment and offer improved scalability over existing methods. ; This work has been supported by the Spanish Government (PID2019-107255GB) and by the MCIN/AEI /10.13039/501100011033 (CEX2021- 001148-S), by the Departament de Recerca i Universitats de la Generalitat de Catalunya to the Research Group MPiEDist (2021 SGR 00412), and by the European Commission’s Horizon 2020 Framework program and the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 955558 and by MCIN/AEI/10.13039/501100011033 and the European Union NextGenerationEU/PRTR (PCI2021-121957, project eFlows4HPC), and by the European Commission through the Horizon Europe Research and Innovation program under Grant Agreement No. ...
Document Type: conference object
File Description: 14 p.; application/pdf
Language: English
Relation: https://link.springer.com/chapter/10.1007/978-3-031-39698-4_27; http://hdl.handle.net/2117/393206
DOI: 10.1007/978-3-031-39698-4_27
Availability: http://hdl.handle.net/2117/393206
https://doi.org/10.1007/978-3-031-39698-4_27
Rights: Open Access
Accession Number: edsbas.21944D8A
Database: BASE
Description
Abstract:In the last years, there has been a significant increment in the quantity of data available and computational resources. This leads scientific and industry communities to pursue more accurate and efficient Machine Learning (ML) models. Random Forest is a well-known algorithm in the ML field due to the good results obtained in a wide range of problems. Our objective is to create a parallel version of the algorithm that can generate a model using data distributed across different processors that computationally scales on available resources. This paper presents two novel proposals for this algorithm with a data-parallel approach. The first version is implemented using the PyCOMPSs framework and its failure management mechanism, while the second variant uses the new PyCOMPSs nesting paradigm where the parallel tasks can generate other tasks within them. Both approaches are compared between them and against MLlib Apache Spark Random Forest with strong and weak scaling tests. Our findings indicate that while the MLlib implementation is faster when executed in a small number of nodes, the scalability of both new variants is superior. We conclude that the proposed data-parallel approaches to the Random Forest algorithm can effectively generate accurate and efficient models in a distributed computing environment and offer improved scalability over existing methods. ; This work has been supported by the Spanish Government (PID2019-107255GB) and by the MCIN/AEI /10.13039/501100011033 (CEX2021- 001148-S), by the Departament de Recerca i Universitats de la Generalitat de Catalunya to the Research Group MPiEDist (2021 SGR 00412), and by the European Commission’s Horizon 2020 Framework program and the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 955558 and by MCIN/AEI/10.13039/501100011033 and the European Union NextGenerationEU/PRTR (PCI2021-121957, project eFlows4HPC), and by the European Commission through the Horizon Europe Research and Innovation program under Grant Agreement No. ...
DOI:10.1007/978-3-031-39698-4_27