Semi-supervised topic classification for low resource languages

In this paper, we present a novel methodology for rapidly developing a topic-based document classification system for a language that has limited resources. Our approach, a hybrid one, combines supervised and unsupervised topic classification techniques. Given that access to native speakers is fairl...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	2008 IEEE International Conference on Acoustics, Speech and Signal Processing s. 5093 - 5096
Hlavní autori:	Daben Liu, McVeety, S., Prasad, R., Natarajan, P.
Médium:	Konferenčný príspevok..
Jazyk:	English
Vydavateľské údaje:	IEEE 01.03.2008
Predmet:	Broadcasting Hidden Markov Model Hidden Markov models Humans Internet Malay Natural languages off-topic rejection Runtime Search engines Testing topic clustering Topology unsupervised topic discovery Web sites
ISBN:	9781424414833, 1424414830
ISSN:	1520-6149
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	In this paper, we present a novel methodology for rapidly developing a topic-based document classification system for a language that has limited resources. Our approach, a hybrid one, combines supervised and unsupervised topic classification techniques. Given that access to native speakers is fairly limited for low resource languages, our approach requires annotating only a few broad "root" topics in the corpus. Next, unsupervised topic discovery (UTD) technique is used to automatically determine finer topics within the root topics. Lastly, we use the recently developed unsupervised topic clustering technique to organize the corpus into a hierarchical structure that enables browsing documents at multiple levels of granularity. Recognizing the need for reducing false alarms during runtime, we describe rejection techniques for discarding off-topic documents.
ISBN:	9781424414833 1424414830
ISSN:	1520-6149
DOI:	10.1109/ICASSP.2008.4518804