The Product Beyond the Model - An Empirical Study of Repositories of Open-Source ML Products

Machine learning (ML) components are increasingly incorporated into software products for end-users, but developers face challenges in transitioning from ML prototypes to products. Academics have limited access to the source of commercial ML products, hindering research progress to address these cha...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Proceedings / International Conference on Software Engineering s. 1540 - 1552
Hlavní autori: Nahar, Nadia, Zhang, Haoran, Lewis, Grace, Zhou, Shurui, Kastner, Christian
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 26.04.2025
Predmet:
ISSN:1558-1225
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Machine learning (ML) components are increasingly incorporated into software products for end-users, but developers face challenges in transitioning from ML prototypes to products. Academics have limited access to the source of commercial ML products, hindering research progress to address these challenges. In this study, first and foremost, we contribute a dataset of 262 open-source ML products for end users (not just models), identified among more than half a million ML-related projects on GitHub. Then, we qualitatively and quantitatively analyze 30 open-source ML products to answer six broad research questions about development practices and system architecture. We find that the majority of the ML products in our sample represent more startup-style development than reported in past interview studies. We report 21 findings, including limited involvement of data scientists in many open-source ML products, unusually low modularity between ML and non-ML code, diverse architectural choices on incorporating models into products, and limited prevalence of industry best practices such as model testing, pipeline automation, and monitoring. Additionally, we discuss seven implications of this study on research, development, and education, including the need for tools to assist teams without data scientists, education opportunities, and open-source-specific research for privacy-preserving telemetry.
AbstractList Machine learning (ML) components are increasingly incorporated into software products for end-users, but developers face challenges in transitioning from ML prototypes to products. Academics have limited access to the source of commercial ML products, hindering research progress to address these challenges. In this study, first and foremost, we contribute a dataset of 262 open-source ML products for end users (not just models), identified among more than half a million ML-related projects on GitHub. Then, we qualitatively and quantitatively analyze 30 open-source ML products to answer six broad research questions about development practices and system architecture. We find that the majority of the ML products in our sample represent more startup-style development than reported in past interview studies. We report 21 findings, including limited involvement of data scientists in many open-source ML products, unusually low modularity between ML and non-ML code, diverse architectural choices on incorporating models into products, and limited prevalence of industry best practices such as model testing, pipeline automation, and monitoring. Additionally, we discuss seven implications of this study on research, development, and education, including the need for tools to assist teams without data scientists, education opportunities, and open-source-specific research for privacy-preserving telemetry.
Author Lewis, Grace
Zhang, Haoran
Kastner, Christian
Nahar, Nadia
Zhou, Shurui
Author_xml – sequence: 1
  givenname: Nadia
  surname: Nahar
  fullname: Nahar, Nadia
  email: nadian@andrew.cmu.edu
  organization: Carnegie Mellon University
– sequence: 2
  givenname: Haoran
  surname: Zhang
  fullname: Zhang, Haoran
  organization: Carnegie Mellon University
– sequence: 3
  givenname: Grace
  surname: Lewis
  fullname: Lewis, Grace
  organization: Carnegie Mellon Software Engineering Institute
– sequence: 4
  givenname: Shurui
  surname: Zhou
  fullname: Zhou, Shurui
  organization: University of Toronto
– sequence: 5
  givenname: Christian
  surname: Kastner
  fullname: Kastner, Christian
  organization: Carnegie Mellon University
BookMark eNo1UMlOwzAQNQgk2tI_6ME_kOJtnPhYogCViopouSFVrj0WQWkcZTn070kFzOXprYeZkps61kjIgrMl58w8rPNdASBVuhRMwJKNp6_I3KQmk5IDA234NZlwgCzhQsAdmXbd9yWljJmQz_0X0rc2-sH19BHPsfa0H6XX6LGiCV3VtDg1ZVs6W9FdP_gzjYG-YxO7so9tid2Fbxusk10cWjc2N_973T25DbbqcP6HM_LxVOzzl2SzfV7nq01ihWZ9kgofuPbBHZ0Xxit_NPLIFLhgtQ7MstH3zlquMMgMQxCGeQ3BKzAICHJGFr-7JSIemrY82fZ8GN8jTKaY_AFCNFZj
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ICSE55347.2025.00006
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Xplore
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Education
Computer Science
EISBN 9798331505691
EISSN 1558-1225
EndPage 1552
ExternalDocumentID 11029840
Genre orig-research
GroupedDBID -~X
.4S
.DC
29O
5VS
6IE
6IF
6IH
6IK
6IL
6IM
6IN
8US
AAJGR
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
ARCSS
AVWKF
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
EDO
FEDTE
I-F
IEGSK
IJVOP
IPLJI
M43
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-a260t-72df16dfcbcd29d4db93b045cfa66f0a02dfdcaa14ef38eff290d65fd459e5e53
IEDL.DBID RIE
ISICitedReferencesCount 0
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001538318100120&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 01:40:27 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a260t-72df16dfcbcd29d4db93b045cfa66f0a02dfdcaa14ef38eff290d65fd459e5e53
PageCount 13
ParticipantIDs ieee_primary_11029840
PublicationCentury 2000
PublicationDate 2025-April-26
PublicationDateYYYYMMDD 2025-04-26
PublicationDate_xml – month: 04
  year: 2025
  text: 2025-April-26
  day: 26
PublicationDecade 2020
PublicationTitle Proceedings / International Conference on Software Engineering
PublicationTitleAbbrev ICSE
PublicationYear 2025
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0006499
Score 2.2965243
Snippet Machine learning (ML) components are increasingly incorporated into software products for end-users, but developers face challenges in transitioning from ML...
SourceID ieee
SourceType Publisher
StartPage 1540
SubjectTerms Data models
Education
Machine learning
machine learning products
mining software repositories
Open source dataset
Prototypes
SE4ML
Software
Software development management
Software engineering
software engineering for machine learning
Systems architecture
Telemetry
Testing
Title The Product Beyond the Model - An Empirical Study of Repositories of Open-Source ML Products
URI https://ieeexplore.ieee.org/document/11029840
WOSCitedRecordID wos001538318100120&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA62ePBUrRXf5OA1NsnmsTlKaVGopVCFHoSSzQMK2pZuK_jvTbLbigcP3nY37C5kSGYmM9_3AXDHlMSekAIRrzFiufZIZwVGseblKfEOVzyzQzka5dOpGtdg9YSFcc6l5jN3Hy9TLd8uzTYelXWDq6IqZCQN0JBSVGCt_bYrQuxeY-MIVt2n3qTPecZkyAEpTzSF4peCSnIgg9Y_f30MOj9QPDjeO5kTcOAWbdDaaTHAemm2o_py3alxCt6C7eNLkcoVVhAVGOI8GHXP3iGCDwvY_1jNEzkIjI2EX3DpYQzFy3nkDHFlvI-tJmiSzvbh83D3vbIDXgf9l94jqlUUkA65ygZJaj0R1pvCWKoss4UKxmDceC2ExxqHcWu0Jsz5LHfeU4Wt4N4yrhx3PDsDzcVy4c4BFEzkzsqY82CmM5fTwhqbUaM08zInF6ATZ262qogyZrtJu_zj-RU4isaJxRkqrkFzs966G3BoPjfzcn2bzPsNPWemnQ
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NS8MwFA86BT1N58Rvc_Aal6ZJ2hxlTDbsxmATdhBGmg8YaDf2Ifjfm6TdxIMHb21DW8gjee_lvd_vB8ADFQm2UZSjyEqMaCotknGOka95WRJZg0ue2SwZDNLJRAwrsHrAwhhjQvOZefSXoZav52rjj8pazlUR4TKSfXDAKCW4hGvtNl7uovcKHRdh0eq1Rx3GYpq4LJCwQFTIf2moBBfyXP_nz09A8weMB4c7N3MK9kzRAPWtGgOsFmfD6y9XvRpn4M1Z37_kyVxhCVKBLtKDXvnsHSL4VMDOx2IW6EGgbyX8gnMLfTC-mnnWELPy977ZBI3C6T7sZ9vvrZrg9bkzbndRpaOApMtW1igh2kZcW5UrTYSmOhfOHJQpKzm3WGI3rpWUETU2To21RGDNmdWUCcMMi89BrZgX5gJATnlqdOKzHkxlbFKSa6VjooSkNkmjS9D0MzddlFQZ0-2kXf3x_B4cdcf9bJr1Bi_X4NgbypdqCL8BtfVyY27Bofpcz1bLu2DqbxTtqeQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%2F+International+Conference+on+Software+Engineering&rft.atitle=The+Product+Beyond+the+Model+-+An+Empirical+Study+of+Repositories+of+Open-Source+ML+Products&rft.au=Nahar%2C+Nadia&rft.au=Zhang%2C+Haoran&rft.au=Lewis%2C+Grace&rft.au=Zhou%2C+Shurui&rft.date=2025-04-26&rft.pub=IEEE&rft.eissn=1558-1225&rft.spage=1540&rft.epage=1552&rft_id=info:doi/10.1109%2FICSE55347.2025.00006&rft.externalDocID=11029840