Active Learning Approaches for Labeling Text: Review and Assessment of the Performance of Active Learning Approaches

Supervised machine learning methods are increasingly employed in political science. Such models require costly manual labeling of documents. In this paper, we introduce active learning, a framework in which data to be labeled by human coders are not chosen at random but rather targeted in such a way...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Political analysis Ročník 28; číslo 4; s. 532 - 551
Hlavní autoři: Miller, Blake, Linder, Fridolin, Mebane, Walter R.
Médium: Journal Article
Jazyk:angličtina
Vydáno: New York, USA Cambridge University Press 01.10.2020
Témata:
ISSN:1047-1987, 1476-4989
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Supervised machine learning methods are increasingly employed in political science. Such models require costly manual labeling of documents. In this paper, we introduce active learning, a framework in which data to be labeled by human coders are not chosen at random but rather targeted in such a way that the required amount of data to train a machine learning model can be minimized. We study the benefits of active learning using text data examples. We perform simulation studies that illustrate conditions where active learning can reduce the cost of labeling text data. We perform these simulations on three corpora that vary in size, document length, and domain. We find that in cases where the document class of interest is not balanced, researchers can label a fraction of the documents one would need using random sampling (or “passive” learning) to achieve equally performing classifiers. We further investigate how varying levels of intercoder reliability affect the active learning procedures and find that even with low reliability, active learning performs more efficiently than does random sampling.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1047-1987
1476-4989
DOI:10.1017/pan.2020.4