Automated Insights into GitHub Collaboration Dynamics
Today, GitHub™ is the most widely used platform for open-source software development. Large projects may comprise hundreds of distributed collaborators and thousands of GitHub "issues" (structured discussions). It includes basic support for dealing with issues, via "Pull Requests"...
Gespeichert in:
| Veröffentlicht in: | IEEE access Jg. 13; S. 1 |
|---|---|
| Hauptverfasser: | , , , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
Piscataway
IEEE
01.01.2025
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Schlagworte: | |
| ISSN: | 2169-3536, 2169-3536 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Zusammenfassung: | Today, GitHub™ is the most widely used platform for open-source software development. Large projects may comprise hundreds of distributed collaborators and thousands of GitHub "issues" (structured discussions). It includes basic support for dealing with issues, via "Pull Requests" (PRs)-document changes that can manually be defined to "close" them (i.e., they address and thereby conclude the issue discussion). Unresolved issues can pile up. For example, at the time of writing, the Kubernetes repository has almost 2000 open issues; finding which ones a PR might close is a hard task by itself. We address this by automatically identifying issue-PR relationships using language models (LMs) and leveraging Information Retrieval (IR) techniques. To foster further research, we contribute a carefully curated novel dataset called CodeConvo , reflecting the most influential open-source repositories for code development as well as some technical document repositories. We use this dataset to benchmark several state-of-theart (SOTA) non-proprietary models that show exceptional performance on the MTEB benchmark, as well as to train and evaluate the performance of Smart Insights into GitHub Issue-PR Relations (SIGIR) , our tailored model. The best SIGIR model/data combination yields an average Mean Reciprocal Rank (MRR) above 0.7, around 20% higher than the best baseline performance. Notably, ablation studies revealed that knowledge transfer occurs not only between different programming languages but also between code and technical documents, albeit to a lesser extent. a We believe that these results are encouraging and that they can stimulate the practical application of LMs for taming the complexity of very large projects, in GitHub and beyond. The models and dataset of this study can be found at Huggingface. |
|---|---|
| Bibliographie: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 2169-3536 2169-3536 |
| DOI: | 10.1109/ACCESS.2025.3566309 |