A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments

Bots are frequently used in Github repositories to automate repetitive activities that are part of the distributed software development process. They communicate with human actors through comments. While detecting their presence is important for many reasons, no large and representative ground-truth...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	The Journal of systems and software Ročník 175; s. 110911
Hlavní autori:	Golzadeh, Mehdi, Decan, Alexandre, Legay, Damien, Mens, Tom
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	Elsevier Inc 01.05.2021
Predmet:	Bot identification Classification model Distributed software development GitHub repositories Text similarity Bot identification Distributed software development Classification model Text similarity GitHub repositories
ISSN:	0164-1212, 1873-1228
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Bots are frequently used in Github repositories to automate repetitive activities that are part of the distributed software development process. They communicate with human actors through comments. While detecting their presence is important for many reasons, no large and representative ground-truth dataset is available, nor are classification models to detect and validate bots on the basis of such a dataset. This paper proposes a ground-truth dataset, based on a manual analysis with high interrater agreement, of pull request and issue comments in 5,000 distinct Github accounts of which 527 have been identified as bots. Using this dataset we propose an automated classification model to detect bots, taking as main features the number of empty and non-empty comments of each account, the number of comment patterns, and the inequality between comments within comment patterns. We obtained a very high weighted average precision, recall and F1-score of 0.98 on a test set containing 40% of the data. We integrated the classification model into an open source command-line tool to allow practitioners to detect which accounts in a given Github repository actually correspond to bots. •Bots can be detected based on pull request and issue comments in GitHub repositories.•We proposed a ground-truth dataset of 5,000 GitHub accounts including 527 bots.•We developed a classification model that detects bots based on their comments.•We implemented an open source tool to detect bots in GitHub repositories.
ISSN:	0164-1212 1873-1228
DOI:	10.1016/j.jss.2021.110911