Understanding Flaky Tests Through Linguistic Diversity: A Cross-Language and Comparative Machine Learning Study
Software development is significantly impeded by flaky tests, which intermittently pass or fail without requiring code modifications, resulting in a decline in confidence in automated testing frameworks. Code smells (i.e., test case or production code) are the primary cause of test flakiness. In ord...
Uloženo v:
| Vydáno v: | IEEE access Ročník 13; s. 54561 - 54584 |
|---|---|
| Hlavní autoři: | , , , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Piscataway
IEEE
2025
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Témata: | |
| ISSN: | 2169-3536, 2169-3536 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | Software development is significantly impeded by flaky tests, which intermittently pass or fail without requiring code modifications, resulting in a decline in confidence in automated testing frameworks. Code smells (i.e., test case or production code) are the primary cause of test flakiness. In order to ascertain the prevalence of test smells, researchers and practitioners have examined numerous programming languages. However, one isolated experiment was conducted, which focused solely on one programming language. Across a variety of programming languages, such as Java, Python, C++, Go, and JavaScript, this study examines the predictive accuracy of a variety of machine learning classifiers in identifying flaky tests. We compare the performance of classifiers such as Random Forest, Decision Tree, Naive Bayes, Support Vector Machine, and Logistic Regression in both single-language and cross-language settings. In order to ascertain the impact of linguistic diversity on the flakiness of test cases, models were trained on a single language and subsequently tested on a variety of languages. The following key findings indicate that Random Forest and Logistic Regression consistently outperform other classifiers in terms of accuracy, adaptability, and generalizability, particularly in cross-language environments. Additionally, the investigation contrasts our findings with those of previous research, exhibiting enhanced precision and accuracy in the identification of flaky tests as a result of meticulous classifier selection. We conducted a thorough statistical analysis, which included t-tests, to assess the importance of classifier performance differences in terms of accuracy and F1-score across a variety of programming languages. This analysis emphasizes the substantial discrepancies between classifiers and their effectiveness in detecting flaky tests. The datasets and experiment code utilized in this study are accessible through an open source GitHub repository to facilitate reproducibility is available at: https://github.com/PELAB-LiU/FlakyCrossLanguage . Our results emphasize the effectiveness of probabilistic and ensemble classifiers in improving the reliability of automated testing, despite certain constraints, including the potential biases introduced by language-specific structures and dataset variability. This research provides developers and researchers with practical insights that can be applied to the mitigation of flaky tests in a variety of software environments. |
|---|---|
| Bibliografie: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 2169-3536 2169-3536 |
| DOI: | 10.1109/ACCESS.2025.3553626 |