Towards Distribution Transparency for Supervised ML With Oblivious Training Functions

Uloženo v:
Podrobná bibliografie
Název: Towards Distribution Transparency for Supervised ML With Oblivious Training Functions
Autoři: Meister, Moritz, Sheikholeslami, Sina, Andersson, Robin, Ormenisan, Alexandru-Adrian, Dowling, Jim
Informace o vydavateli: KTH, Programvaruteknik och datorsystem, SCS, 2020.
Rok vydání: 2020
Témata: Nätverks-, parallell- och distribuerad beräkning, Datavetenskap (datalogi), Networked, Parallel and Distributed Computing, Computer Sciences
Popis: Building and productionizing Machine Learning (ML) models is a process of interdependent steps of iterative code updates, including exploratory model design, hyperparameter tuning, ablation experiments, and model training. Industrial-strength ML involves doing this at scale, using many compute resources, and this requires rewriting the training code to account for distribution. The result is that moving from a single host program to a cluster hinders iterative development of the software, as iterative development would require multiple versions of the software to be maintained and kept consistent. In this paper, we introduce the distribution oblivious training function as an abstraction for ML development in Python, whereby developers can reuse the same training function when running a notebook on a laptop or performing scale-out hyperparameter search and distributed training on clusters. Programs written in our framework look like industry-standard ML programs as we factor out dependencies using best-practice programming idioms (such as functions to generate models and data batches). We believe that our approach takes a step towards unifying single-host and distributed ML development.
QCR 20250303
Druh dokumentu: Conference object
Popis souboru: application/pdf
Jazyk: English
Přístupová URL adresa: http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-360717
Přístupové číslo: edsair.dedup.wf.002..7fc3ac892bcb004e237d1e65139c6bc8
Databáze: OpenAIRE
Popis
Abstrakt:Building and productionizing Machine Learning (ML) models is a process of interdependent steps of iterative code updates, including exploratory model design, hyperparameter tuning, ablation experiments, and model training. Industrial-strength ML involves doing this at scale, using many compute resources, and this requires rewriting the training code to account for distribution. The result is that moving from a single host program to a cluster hinders iterative development of the software, as iterative development would require multiple versions of the software to be maintained and kept consistent. In this paper, we introduce the distribution oblivious training function as an abstraction for ML development in Python, whereby developers can reuse the same training function when running a notebook on a laptop or performing scale-out hyperparameter search and distributed training on clusters. Programs written in our framework look like industry-standard ML programs as we factor out dependencies using best-practice programming idioms (such as functions to generate models and data batches). We believe that our approach takes a step towards unifying single-host and distributed ML development.<br />QCR 20250303