Semi-supervised topic classification for low resource languages

In this paper, we present a novel methodology for rapidly developing a topic-based document classification system for a language that has limited resources. Our approach, a hybrid one, combines supervised and unsupervised topic classification techniques. Given that access to native speakers is fairl...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:2008 IEEE International Conference on Acoustics, Speech and Signal Processing s. 5093 - 5096
Hlavní autoři: Daben Liu, McVeety, S., Prasad, R., Natarajan, P.
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 01.03.2008
Témata:
ISBN:9781424414833, 1424414830
ISSN:1520-6149
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract In this paper, we present a novel methodology for rapidly developing a topic-based document classification system for a language that has limited resources. Our approach, a hybrid one, combines supervised and unsupervised topic classification techniques. Given that access to native speakers is fairly limited for low resource languages, our approach requires annotating only a few broad "root" topics in the corpus. Next, unsupervised topic discovery (UTD) technique is used to automatically determine finer topics within the root topics. Lastly, we use the recently developed unsupervised topic clustering technique to organize the corpus into a hierarchical structure that enables browsing documents at multiple levels of granularity. Recognizing the need for reducing false alarms during runtime, we describe rejection techniques for discarding off-topic documents.
AbstractList In this paper, we present a novel methodology for rapidly developing a topic-based document classification system for a language that has limited resources. Our approach, a hybrid one, combines supervised and unsupervised topic classification techniques. Given that access to native speakers is fairly limited for low resource languages, our approach requires annotating only a few broad "root" topics in the corpus. Next, unsupervised topic discovery (UTD) technique is used to automatically determine finer topics within the root topics. Lastly, we use the recently developed unsupervised topic clustering technique to organize the corpus into a hierarchical structure that enables browsing documents at multiple levels of granularity. Recognizing the need for reducing false alarms during runtime, we describe rejection techniques for discarding off-topic documents.
Author Prasad, R.
Natarajan, P.
McVeety, S.
Daben Liu
Author_xml – sequence: 1
  surname: Daben Liu
  fullname: Daben Liu
  organization: BBN Technol., Cambridge, MA
– sequence: 2
  givenname: S.
  surname: McVeety
  fullname: McVeety, S.
  organization: BBN Technol., Cambridge, MA
– sequence: 3
  givenname: R.
  surname: Prasad
  fullname: Prasad, R.
  organization: BBN Technol., Cambridge, MA
– sequence: 4
  givenname: P.
  surname: Natarajan
  fullname: Natarajan, P.
  organization: BBN Technol., Cambridge, MA
BookMark eNo1j81Kw0AUhUesYFvzBN3kBRLvZO5kZlYixT8oKFTX5WZyU0bSJGQaxbe3YD2bw9l8fGchZl3fsRArCbmU4G5f1vfb7VteANgctbQW8EIsJBaIEi26S5E4Y_-3UjMxl7qArJTorkUS4yecglppp-fibsuHkMVp4PErRK7TYz8En_qWYgxN8HQMfZc2_Zi2_Xc6cuyn0XPaUrefaM_xRlw11EZOzr0UH48P7-vnbPP6dDLdZEEafcwIiYxCj9j4qtGkTm66NgqqWhE4QwDkKi0bpWtnfGktQ2l9ZUCyQvZqKVZ_3MDMu2EMBxp_duf_6hegYE7-
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ICASSP.2008.4518804
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISBN 1424414849
9781424414840
EndPage 5096
ExternalDocumentID 4518804
Genre orig-research
GroupedDBID 23M
29P
6IE
6IF
6IH
6IK
6IL
6IM
6IN
AAJGR
AAWTH
ABLEC
ACGFS
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
IPLJI
M43
OCL
RIE
RIL
RIO
RNS
ID FETCH-LOGICAL-i175t-a4aa734c44fcbf5a34145d730bd3a097a00a9b51f35d97c688e068cb701e34ec3
IEDL.DBID RIE
ISBN 9781424414833
1424414830
ISICitedReferencesCount 1
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000257456703257&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1520-6149
IngestDate Wed Aug 27 02:04:14 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i175t-a4aa734c44fcbf5a34145d730bd3a097a00a9b51f35d97c688e068cb701e34ec3
PageCount 4
ParticipantIDs ieee_primary_4518804
PublicationCentury 2000
PublicationDate 2008-March
PublicationDateYYYYMMDD 2008-03-01
PublicationDate_xml – month: 03
  year: 2008
  text: 2008-March
PublicationDecade 2000
PublicationTitle 2008 IEEE International Conference on Acoustics, Speech and Signal Processing
PublicationTitleAbbrev ICASSP
PublicationYear 2008
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0000453595
ssj0008748
Score 1.6851705
Snippet In this paper, we present a novel methodology for rapidly developing a topic-based document classification system for a language that has limited resources....
SourceID ieee
SourceType Publisher
StartPage 5093
SubjectTerms Broadcasting
Hidden Markov Model
Hidden Markov models
Humans
Internet
Malay
Natural languages
off-topic rejection
Runtime
Search engines
Testing
topic clustering
Topology
unsupervised topic discovery
Web sites
Title Semi-supervised topic classification for low resource languages
URI https://ieeexplore.ieee.org/document/4518804
WOSCitedRecordID wos000257456703257&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV27TsMwFLXaigEWHi3iLQ-MmMbYjp0RVVSwVJUKUrfKr0iRoKmatPw-tuMGkFjY4iyJfGWf4-t7zgXg1geaPAiMjOEMUYUNkg6W3JlHG0aY5Rib0GyCTyZiPs-mHXDXamGstaH4zN77x3CXb0q98amyIQ3uYbQLupynjVarzac4atJoTOMuLHjonOXgyR-PaLYTdTn6T1qvpzgm0Y4IJ9nwZfQ4m02bIsv4vV-NVwLujA__98dHYPAt4IPTFpqOQccuT8DBD-_BvmPp9qNA1WblN4vKGliXq0JD7dm0Lx8KEYOO0sL38hOuY5Yf7vKb1QC8jZ9eR88odlNAhaMINZJUSk6opjTXKmfSwRdlxi1wZYgLD5dJIjPFcE6YybhOhbBJKrTiCbaEWk1OQW9ZLu0ZgEISrPJUSWwSmhMu0lyqRHmjHa6NFOeg76disWoMMxZxFi7-fn0J9psiDF_YdQV69Xpjr8Ge3tZFtb4JUf4C9KCgwg
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1dS8MwFA1zCuqLH1P8tg8-WpcsSZM-ynBsOMdgE_Y28lUo6DrWTv--SdpVBV98a_vSkktyz70951wA7lygcYejUGtGQyKRDoVNS7bmUZpiahhC2g-bYKMRn83icQPc11oYY4wnn5kHd-n_5etMrV2rrE28exjZAtuUkA4s1Vp1R8WCk1JlWp3DnPnZWTZBuQKJxBtZly0AcO32VN3jypAIwbg96D5OJuOSZlm98dfoFZ95egf_--ZDcPIt4QvGdXI6Ag2zOAb7P9wHWxanm_c0zNdLd1zkRgdFtkxVoByedgQiH7PAgtrgLfsMVlWfP9h0OPMT8Np7mnb7YTVPIUwtSChCQYRgmChCEiUTKmwCI1TbLS41tgFiAkIRS4oSTHXMVMS5gRFXkkFkMDEKn4LmIluYMxBwgZFMIimQhiTBjEeJkFA6qx2mtODnoOWWYr4sLTPm1Spc_P34Fuz2py_D-XAwer4EeyUlw9G8rkCzWK3NNdhRH0War258xL8Ah7GkCQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2008+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing&rft.atitle=Semi-supervised+topic+classification+for+low+resource+languages&rft.au=Daben+Liu&rft.au=McVeety%2C+S.&rft.au=Prasad%2C+R.&rft.au=Natarajan%2C+P.&rft.date=2008-03-01&rft.pub=IEEE&rft.isbn=9781424414833&rft.issn=1520-6149&rft.spage=5093&rft.epage=5096&rft_id=info:doi/10.1109%2FICASSP.2008.4518804&rft.externalDocID=4518804
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1520-6149&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1520-6149&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1520-6149&client=summon