The structure of unseen trigrams and its application to language models: A first investigation

In a series of preparatory experiments in 4 languages on subsets of the Europarl corpus, we show that a large number of unseen trigrams can be reconstructed by proportional analogy with trigrams having the lowest frequencies. We derive a very simple smoothing scheme from this empirical result and sh...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:2010 4th International Universal Communication Symposium S. 273 - 280
Hauptverfasser: Lepage, Y, Gosme, J, Lardilleux, A
Format: Tagungsbericht
Sprache:Englisch
Japanisch
Veröffentlicht: IEEE 01.10.2010
Schlagworte:
ISBN:9781424478217, 1424478219
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract In a series of preparatory experiments in 4 languages on subsets of the Europarl corpus, we show that a large number of unseen trigrams can be reconstructed by proportional analogy with trigrams having the lowest frequencies. We derive a very simple smoothing scheme from this empirical result and show that it outperforms Good-Turing and Kneser-Ney smoothing schemes on trigrams models in all 11 languages on the common multilingual part of the Europarl corpus, except Finnish.
AbstractList In a series of preparatory experiments in 4 languages on subsets of the Europarl corpus, we show that a large number of unseen trigrams can be reconstructed by proportional analogy with trigrams having the lowest frequencies. We derive a very simple smoothing scheme from this empirical result and show that it outperforms Good-Turing and Kneser-Ney smoothing schemes on trigrams models in all 11 languages on the common multilingual part of the Europarl corpus, except Finnish.
Author Gosme, J
Lepage, Y
Lardilleux, A
Author_xml – sequence: 1
  givenname: Y
  surname: Lepage
  fullname: Lepage, Y
  email: yves.lepage@aoni.waseda.jp
  organization: IPS Grad. Sch., Waseda Univ., Kitakyushu, Japan
– sequence: 2
  givenname: J
  surname: Gosme
  fullname: Gosme, J
  email: julien.gosme@info.unicaen.fr
  organization: GREYC, Univ. de Caen Basse-Normandie, Caen, France
– sequence: 3
  givenname: A
  surname: Lardilleux
  fullname: Lardilleux, A
  email: adrien.lardilleux@info.unicaen.fr
  organization: GREYC, Univ. de Caen Basse-Normandie, Caen, France
BookMark eNpNkE1LAzEYhCMqqLU_QLzkD2zNm49N11spfhQKHmyvlmz2zTayzZYkK_jvXbQH5_IwzDCHuSEXoQ9IyB2wGQCrHlbb5fuMs9GqsiwZwBmZVnoOkks5opLn_z0HfUWmKX2yUYprCfKafGz2SFOOg81DRNo7OoSEGGiOvo3mkKgJDfV55PHYeWuy78ewp50J7WBapIe-wS490gV1PqZMffjClH3727wll850CacnTsj2-WmzfC3Wby-r5WJd7DlUueDoAJVTTaO5xVILoeu6VrJhZg6WK21tZRTHskbhQHA0wjIUyjXaGnBzMSH3f7seEXfH6A8mfu9Or4gfElVZRQ
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/IUCS.2010.5666011
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9781424478194
9781424478200
1424478200
1424478197
EndPage 280
ExternalDocumentID 5666011
Genre orig-research
GroupedDBID 6IE
6IF
6IH
6IK
6IL
6IN
AAJGR
AAWTH
ADFMO
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
IEGSK
IERZE
OCL
RIE
RIL
ID FETCH-LOGICAL-h219t-2ef1e5f5dd72ce67337bbb54d0a81c257cc9a52e6be3f132ea3c0e35fd7ca1f83
IEDL.DBID RIE
ISBN 9781424478217
1424478219
IngestDate Wed Aug 27 02:52:51 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
Japanese
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-h219t-2ef1e5f5dd72ce67337bbb54d0a81c257cc9a52e6be3f132ea3c0e35fd7ca1f83
PageCount 8
ParticipantIDs ieee_primary_5666011
PublicationCentury 2000
PublicationDate 2010-10-01
PublicationDateYYYYMMDD 2010-10-01
PublicationDate_xml – month: 10
  year: 2010
  text: 2010-10-01
  day: 01
PublicationDecade 2010
PublicationTitle 2010 4th International Universal Communication Symposium
PublicationTitleAbbrev IUCS
PublicationYear 2010
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0000527414
Score 1.4508889
Snippet In a series of preparatory experiments in 4 languages on subsets of the Europarl corpus, we show that a large number of unseen trigrams can be reconstructed by...
SourceID ieee
SourceType Publisher
StartPage 273
SubjectTerms Connectors
Electronic mail
Estimation
Europarl
Morphology
Smoothing methods
structure of unseen trigrams
Training
Training data
Trigram language models
Title The structure of unseen trigrams and its application to language models: A first investigation
URI https://ieeexplore.ieee.org/document/5666011
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NS8MwFA9zePCksonf5ODRuDZpmsabDIeCjIEOdnKkyQv2YCtr59_v68c2BC-e2qZQyntp3-fv9wi54dqmsbMpS3wYscgIyxIROyYt7gCjneQtSdKLmk6TxULPeuR2i4UBgKb5DO7q06aW7wq7rlNlI3Q9MH7AWGdPKdVitbb5lEDWRCzRBruFhi_UG0qn7lp1Vc0w0KPn-fi1bezqHvprukpjXCaH_3utIzLcofTobGt_jkkP8gF5R73TlhR2vQJaeFpjziGnFYbhK_NZUpM7mlV43JWuaVXQTeKSNrNxynv6QH2GriHNdkwcRT4k88nj2_iJdTMU2AcKoGIcfAjSS-cUtxArIVSapjJygUlCi9-rtdpIDnEKwmNkCqitAIT0TlkT-kSckH5e5HBKKMf_kQUhjHA8wvsaPJde2yitp9doeUYGtWyWXy1NxrITy_nfyxfkoCnEN31xl6SPgoErsm-_q6xcXTe6_QF4PaMd
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8NAEF5KFfSk0opv9-DR2GQfSdabFEuLtRRsoSfLZncWezCRNvX3O3m0RfDiKckGQpjZZJ7fN4TcMWWS0JrEi10gPKG58WIeWk8a3AFaWckqkqRhNBrFs5kaN8j9FgsDAGXzGTwUp2Ut32ZmXaTKOuh6YPyAsc6eFIIFFVprm1HxZUHFIjboLTR9gdqQOtXXUV3XDHzVGUy7b1VrV_3YX_NVSvPSO_rfix2T9g6nR8dbC3RCGpC2yDtqnla0sOsl0MzRAnUOKc0xEF_qzxXVqaWLHI-74jXNM7pJXdJyOs7qkT5Rt0DnkC52XBxZ2ibT3vOk2_fqKQreBwog9xi4AKST1kbMQBhxHiVJIoX1dRwY_GKNUVoyCBPgDmNTQH35wKWzkdGBi_kpaaZZCmeEMvwjGeBcc8sE3lfgmHTKiKSYX6PkOWkVspl_VUQZ81osF38v35KD_uR1OB8ORi-X5LAsy5ddclekiUKCa7JvvvPFanlT6vkHJaKmZA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2010+4th+International+Universal+Communication+Symposium&rft.atitle=The+structure+of+unseen+trigrams+and+its+application+to+language+models%3A+A+first+investigation&rft.au=Lepage%2C+Y&rft.au=Gosme%2C+J&rft.au=Lardilleux%2C+A&rft.date=2010-10-01&rft.pub=IEEE&rft.isbn=9781424478217&rft.spage=273&rft.epage=280&rft_id=info:doi/10.1109%2FIUCS.2010.5666011&rft.externalDocID=5666011
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424478217/lc.gif&client=summon&freeimage=true
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424478217/mc.gif&client=summon&freeimage=true
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424478217/sc.gif&client=summon&freeimage=true