Towards Distribution Transparency for Supervised ML With Oblivious Training Functions

Gespeichert in:
Bibliographische Detailangaben
Titel: Towards Distribution Transparency for Supervised ML With Oblivious Training Functions
Autoren: Meister, Moritz, Sheikholeslami, Sina, Andersson, Robin, Ormenisan, Alexandru-Adrian, Dowling, Jim
Verlagsinformationen: KTH, Programvaruteknik och datorsystem, SCS, 2020.
Publikationsjahr: 2020
Schlagwörter: Nätverks-, parallell- och distribuerad beräkning, Datavetenskap (datalogi), Networked, Parallel and Distributed Computing, Computer Sciences
Beschreibung: Building and productionizing Machine Learning (ML) models is a process of interdependent steps of iterative code updates, including exploratory model design, hyperparameter tuning, ablation experiments, and model training. Industrial-strength ML involves doing this at scale, using many compute resources, and this requires rewriting the training code to account for distribution. The result is that moving from a single host program to a cluster hinders iterative development of the software, as iterative development would require multiple versions of the software to be maintained and kept consistent. In this paper, we introduce the distribution oblivious training function as an abstraction for ML development in Python, whereby developers can reuse the same training function when running a notebook on a laptop or performing scale-out hyperparameter search and distributed training on clusters. Programs written in our framework look like industry-standard ML programs as we factor out dependencies using best-practice programming idioms (such as functions to generate models and data batches). We believe that our approach takes a step towards unifying single-host and distributed ML development.
QCR 20250303
Publikationsart: Conference object
Dateibeschreibung: application/pdf
Sprache: English
Zugangs-URL: http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-360717
Dokumentencode: edsair.dedup.wf.002..7fc3ac892bcb004e237d1e65139c6bc8
Datenbank: OpenAIRE
Beschreibung
Abstract:Building and productionizing Machine Learning (ML) models is a process of interdependent steps of iterative code updates, including exploratory model design, hyperparameter tuning, ablation experiments, and model training. Industrial-strength ML involves doing this at scale, using many compute resources, and this requires rewriting the training code to account for distribution. The result is that moving from a single host program to a cluster hinders iterative development of the software, as iterative development would require multiple versions of the software to be maintained and kept consistent. In this paper, we introduce the distribution oblivious training function as an abstraction for ML development in Python, whereby developers can reuse the same training function when running a notebook on a laptop or performing scale-out hyperparameter search and distributed training on clusters. Programs written in our framework look like industry-standard ML programs as we factor out dependencies using best-practice programming idioms (such as functions to generate models and data batches). We believe that our approach takes a step towards unifying single-host and distributed ML development.<br />QCR 20250303