Automated Insights into GitHub Collaboration Dynamics

Today, GitHub™ is the most widely used platform for open-source software development. Large projects may comprise hundreds of distributed collaborators and thousands of GitHub "issues" (structured discussions). It includes basic support for dealing with issues, via "Pull Requests"...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	IEEE access Ročník 13; s. 1
Hlavní autoři:	Bian, Jie, Arefev, Nikolay, Muhlhauser, Max, Welzl, Michael
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Piscataway IEEE 01.01.2025 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Témata:	Ablation Adaptation models Benchmark testing Benchmarks Codes Collaboration Computer bugs Datasets Documents GitHub Informatics Information Retrieval Issue Knowledge management LMs Open source software Performance evaluation Programming languages Pull Request Repositories Search engines Semantics Software development Software development management Source code Stakeholders Technical information
ISSN:	2169-3536, 2169-3536
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Today, GitHub™ is the most widely used platform for open-source software development. Large projects may comprise hundreds of distributed collaborators and thousands of GitHub "issues" (structured discussions). It includes basic support for dealing with issues, via "Pull Requests" (PRs)-document changes that can manually be defined to "close" them (i.e., they address and thereby conclude the issue discussion). Unresolved issues can pile up. For example, at the time of writing, the Kubernetes repository has almost 2000 open issues; finding which ones a PR might close is a hard task by itself. We address this by automatically identifying issue-PR relationships using language models (LMs) and leveraging Information Retrieval (IR) techniques. To foster further research, we contribute a carefully curated novel dataset called CodeConvo , reflecting the most influential open-source repositories for code development as well as some technical document repositories. We use this dataset to benchmark several state-of-theart (SOTA) non-proprietary models that show exceptional performance on the MTEB benchmark, as well as to train and evaluate the performance of Smart Insights into GitHub Issue-PR Relations (SIGIR) , our tailored model. The best SIGIR model/data combination yields an average Mean Reciprocal Rank (MRR) above 0.7, around 20% higher than the best baseline performance. Notably, ablation studies revealed that knowledge transfer occurs not only between different programming languages but also between code and technical documents, albeit to a lesser extent. a We believe that these results are encouraging and that they can stimulate the practical application of LMs for taming the complexity of very large projects, in GitHub and beyond. The models and dataset of this study can be found at Huggingface.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2025.3566309