The structure of unseen trigrams and its application to language models: A first investigation
In a series of preparatory experiments in 4 languages on subsets of the Europarl corpus, we show that a large number of unseen trigrams can be reconstructed by proportional analogy with trigrams having the lowest frequencies. We derive a very simple smoothing scheme from this empirical result and sh...
Gespeichert in:
| Veröffentlicht in: | 2010 4th International Universal Communication Symposium S. 273 - 280 |
|---|---|
| Hauptverfasser: | , , |
| Format: | Tagungsbericht |
| Sprache: | Englisch Japanisch |
| Veröffentlicht: |
IEEE
01.10.2010
|
| Schlagworte: | |
| ISBN: | 9781424478217, 1424478219 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | In a series of preparatory experiments in 4 languages on subsets of the Europarl corpus, we show that a large number of unseen trigrams can be reconstructed by proportional analogy with trigrams having the lowest frequencies. We derive a very simple smoothing scheme from this empirical result and show that it outperforms Good-Turing and Kneser-Ney smoothing schemes on trigrams models in all 11 languages on the common multilingual part of the Europarl corpus, except Finnish. |
|---|---|
| AbstractList | In a series of preparatory experiments in 4 languages on subsets of the Europarl corpus, we show that a large number of unseen trigrams can be reconstructed by proportional analogy with trigrams having the lowest frequencies. We derive a very simple smoothing scheme from this empirical result and show that it outperforms Good-Turing and Kneser-Ney smoothing schemes on trigrams models in all 11 languages on the common multilingual part of the Europarl corpus, except Finnish. |
| Author | Gosme, J Lepage, Y Lardilleux, A |
| Author_xml | – sequence: 1 givenname: Y surname: Lepage fullname: Lepage, Y email: yves.lepage@aoni.waseda.jp organization: IPS Grad. Sch., Waseda Univ., Kitakyushu, Japan – sequence: 2 givenname: J surname: Gosme fullname: Gosme, J email: julien.gosme@info.unicaen.fr organization: GREYC, Univ. de Caen Basse-Normandie, Caen, France – sequence: 3 givenname: A surname: Lardilleux fullname: Lardilleux, A email: adrien.lardilleux@info.unicaen.fr organization: GREYC, Univ. de Caen Basse-Normandie, Caen, France |
| BookMark | eNpNkE1LAzEYhCMqqLU_QLzkD2zNm49N11spfhQKHmyvlmz2zTayzZYkK_jvXbQH5_IwzDCHuSEXoQ9IyB2wGQCrHlbb5fuMs9GqsiwZwBmZVnoOkks5opLn_z0HfUWmKX2yUYprCfKafGz2SFOOg81DRNo7OoSEGGiOvo3mkKgJDfV55PHYeWuy78ewp50J7WBapIe-wS490gV1PqZMffjClH3727wll850CacnTsj2-WmzfC3Wby-r5WJd7DlUueDoAJVTTaO5xVILoeu6VrJhZg6WK21tZRTHskbhQHA0wjIUyjXaGnBzMSH3f7seEXfH6A8mfu9Or4gfElVZRQ |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/IUCS.2010.5666011 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE/IET Electronic Library url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9781424478194 9781424478200 1424478200 1424478197 |
| EndPage | 280 |
| ExternalDocumentID | 5666011 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IF 6IH 6IK 6IL 6IN AAJGR AAWTH ADFMO ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK IEGSK IERZE OCL RIE RIL |
| ID | FETCH-LOGICAL-h219t-2ef1e5f5dd72ce67337bbb54d0a81c257cc9a52e6be3f132ea3c0e35fd7ca1f83 |
| IEDL.DBID | RIE |
| ISBN | 9781424478217 1424478219 |
| IngestDate | Wed Aug 27 02:52:51 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English Japanese |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-h219t-2ef1e5f5dd72ce67337bbb54d0a81c257cc9a52e6be3f132ea3c0e35fd7ca1f83 |
| PageCount | 8 |
| ParticipantIDs | ieee_primary_5666011 |
| PublicationCentury | 2000 |
| PublicationDate | 2010-10-01 |
| PublicationDateYYYYMMDD | 2010-10-01 |
| PublicationDate_xml | – month: 10 year: 2010 text: 2010-10-01 day: 01 |
| PublicationDecade | 2010 |
| PublicationTitle | 2010 4th International Universal Communication Symposium |
| PublicationTitleAbbrev | IUCS |
| PublicationYear | 2010 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0000527414 |
| Score | 1.4508889 |
| Snippet | In a series of preparatory experiments in 4 languages on subsets of the Europarl corpus, we show that a large number of unseen trigrams can be reconstructed by... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 273 |
| SubjectTerms | Connectors Electronic mail Estimation Europarl Morphology Smoothing methods structure of unseen trigrams Training Training data Trigram language models |
| Title | The structure of unseen trigrams and its application to language models: A first investigation |
| URI | https://ieeexplore.ieee.org/document/5666011 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NS8MwFA9zePCksonf5ODRuDZpmsabDIeCjIEOdnKkyQv2YCtr59_v68c2BC-e2qZQyntp3-fv9wi54dqmsbMpS3wYscgIyxIROyYt7gCjneQtSdKLmk6TxULPeuR2i4UBgKb5DO7q06aW7wq7rlNlI3Q9MH7AWGdPKdVitbb5lEDWRCzRBruFhi_UG0qn7lp1Vc0w0KPn-fi1bezqHvprukpjXCaH_3utIzLcofTobGt_jkkP8gF5R73TlhR2vQJaeFpjziGnFYbhK_NZUpM7mlV43JWuaVXQTeKSNrNxynv6QH2GriHNdkwcRT4k88nj2_iJdTMU2AcKoGIcfAjSS-cUtxArIVSapjJygUlCi9-rtdpIDnEKwmNkCqitAIT0TlkT-kSckH5e5HBKKMf_kQUhjHA8wvsaPJde2yitp9doeUYGtWyWXy1NxrITy_nfyxfkoCnEN31xl6SPgoErsm-_q6xcXTe6_QF4PaMd |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8NAEF5KFfSk0opv9-DR2GQfSdabFEuLtRRsoSfLZncWezCRNvX3O3m0RfDiKckGQpjZZJ7fN4TcMWWS0JrEi10gPKG58WIeWk8a3AFaWckqkqRhNBrFs5kaN8j9FgsDAGXzGTwUp2Ut32ZmXaTKOuh6YPyAsc6eFIIFFVprm1HxZUHFIjboLTR9gdqQOtXXUV3XDHzVGUy7b1VrV_3YX_NVSvPSO_rfix2T9g6nR8dbC3RCGpC2yDtqnla0sOsl0MzRAnUOKc0xEF_qzxXVqaWLHI-74jXNM7pJXdJyOs7qkT5Rt0DnkC52XBxZ2ibT3vOk2_fqKQreBwog9xi4AKST1kbMQBhxHiVJIoX1dRwY_GKNUVoyCBPgDmNTQH35wKWzkdGBi_kpaaZZCmeEMvwjGeBcc8sE3lfgmHTKiKSYX6PkOWkVspl_VUQZ81osF38v35KD_uR1OB8ORi-X5LAsy5ddclekiUKCa7JvvvPFanlT6vkHJaKmZA |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2010+4th+International+Universal+Communication+Symposium&rft.atitle=The+structure+of+unseen+trigrams+and+its+application+to+language+models%3A+A+first+investigation&rft.au=Lepage%2C+Y&rft.au=Gosme%2C+J&rft.au=Lardilleux%2C+A&rft.date=2010-10-01&rft.pub=IEEE&rft.isbn=9781424478217&rft.spage=273&rft.epage=280&rft_id=info:doi/10.1109%2FIUCS.2010.5666011&rft.externalDocID=5666011 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424478217/lc.gif&client=summon&freeimage=true |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424478217/mc.gif&client=summon&freeimage=true |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424478217/sc.gif&client=summon&freeimage=true |

