Sosed: a tool for finding similar software projects

In this paper, we present Sosed, a tool for discovering similar software projects. We use fastText to compute the embeddings of subto-kens into a dense space for 120,000 GitHub projects in 200 languages. Then, we cluster embeddings to identify groups of semantically similar subtokens that reflect to...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE) s. 1316 - 1320
Hlavní autoři: Bogomolov, Egor, Golubev, Yaroslav, Lobanov, Artyom, Kovalenko, Vladimir, Bryksin, Timofey
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: ACM 01.09.2020
Témata:
ISSN:2643-1572
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:In this paper, we present Sosed, a tool for discovering similar software projects. We use fastText to compute the embeddings of subto-kens into a dense space for 120,000 GitHub projects in 200 languages. Then, we cluster embeddings to identify groups of semantically similar subtokens that reflect topics in source code. We use a dataset of 9 million GitHub projects as a reference search base. To identify similar projects, we compare the distributions of clusters among their subtokens. The tool receives an arbitrary project as input, extracts subtokens in 16 most popular programming languages, computes cluster distribution, and finds projects with the closest distribution in the search base. We labeled subtoken clusters with short descriptions to enable Sosed to produce interpretable output. Sosed is available at https://github.com/JetBrains-Research/sosed/. The tool demo is available at https://www.youtube.com/watch?v=LYLkztCGRt8. The multi-language extractor of subtokens is available separately at https://github.com/JetBrains-Research/buckwheat/.
ISSN:2643-1572
DOI:10.1145/3324884.3415291