Unsupervised Discovery of Biographical Structure from Text

We present a method for discovering abstract event classes in biographies, based on a probabilistic latent-variable model. Taking as input timestamped text, we exploit latent correlations among events to learn a set of event classes (such as B , G H S , and B C ), along with the typical times in a p...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Transactions of the Association for Computational Linguistics Jg. 2; S. 363 - 376
Hauptverfasser: Bamman, David, Smith, Noah A.
Format: Journal Article
Sprache:Englisch
Veröffentlicht: One Rogers Street, Cambridge, MA 02142-1209, USA MIT Press 01.12.2014
MIT Press Journals, The
The MIT Press
Schlagworte:
ISSN:2307-387X, 2307-387X
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:We present a method for discovering abstract event classes in biographies, based on a probabilistic latent-variable model. Taking as input timestamped text, we exploit latent correlations among events to learn a set of event classes (such as B , G H S , and B C ), along with the typical times in a person’s life when those events occur. In a quantitative evaluation at the task of predicting a person’s age for a given event, we find that our generative model outperforms a strong linear regression baseline, along with simpler variants of the model that ablate some features. The abstract event classes that we learn allow us to perform a large-scale analysis of 242,970 Wikipedia biographies. Though it is known that women are greatly underrepresented on Wikipedia—not only as editors (Wikipedia, 2011) but also as subjects of articles (Reagle and Rhue, 2011)—we find that there is a bias in their as well, with biographies of women containing significantly more emphasis on events of marriage and divorce than biographies of men.
Bibliographie:Volume, 2014
ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2307-387X
2307-387X
DOI:10.1162/tacl_a_00189