The structure of unseen trigrams and its application to language models: A first investigation
In a series of preparatory experiments in 4 languages on subsets of the Europarl corpus, we show that a large number of unseen trigrams can be reconstructed by proportional analogy with trigrams having the lowest frequencies. We derive a very simple smoothing scheme from this empirical result and sh...
Saved in:
| Published in: | 2010 4th International Universal Communication Symposium pp. 273 - 280 |
|---|---|
| Main Authors: | , , |
| Format: | Conference Proceeding |
| Language: | English Japanese |
| Published: |
IEEE
01.10.2010
|
| Subjects: | |
| ISBN: | 9781424478217, 1424478219 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | In a series of preparatory experiments in 4 languages on subsets of the Europarl corpus, we show that a large number of unseen trigrams can be reconstructed by proportional analogy with trigrams having the lowest frequencies. We derive a very simple smoothing scheme from this empirical result and show that it outperforms Good-Turing and Kneser-Ney smoothing schemes on trigrams models in all 11 languages on the common multilingual part of the Europarl corpus, except Finnish. |
|---|---|
| AbstractList | In a series of preparatory experiments in 4 languages on subsets of the Europarl corpus, we show that a large number of unseen trigrams can be reconstructed by proportional analogy with trigrams having the lowest frequencies. We derive a very simple smoothing scheme from this empirical result and show that it outperforms Good-Turing and Kneser-Ney smoothing schemes on trigrams models in all 11 languages on the common multilingual part of the Europarl corpus, except Finnish. |
| Author | Gosme, J Lepage, Y Lardilleux, A |
| Author_xml | – sequence: 1 givenname: Y surname: Lepage fullname: Lepage, Y email: yves.lepage@aoni.waseda.jp organization: IPS Grad. Sch., Waseda Univ., Kitakyushu, Japan – sequence: 2 givenname: J surname: Gosme fullname: Gosme, J email: julien.gosme@info.unicaen.fr organization: GREYC, Univ. de Caen Basse-Normandie, Caen, France – sequence: 3 givenname: A surname: Lardilleux fullname: Lardilleux, A email: adrien.lardilleux@info.unicaen.fr organization: GREYC, Univ. de Caen Basse-Normandie, Caen, France |
| BookMark | eNpNkE1LAzEYhCMqqLU_QLzkD2zNm49N11spfhQKHmyvlmz2zTayzZYkK_jvXbQH5_IwzDCHuSEXoQ9IyB2wGQCrHlbb5fuMs9GqsiwZwBmZVnoOkks5opLn_z0HfUWmKX2yUYprCfKafGz2SFOOg81DRNo7OoSEGGiOvo3mkKgJDfV55PHYeWuy78ewp50J7WBapIe-wS490gV1PqZMffjClH3727wll850CacnTsj2-WmzfC3Wby-r5WJd7DlUueDoAJVTTaO5xVILoeu6VrJhZg6WK21tZRTHskbhQHA0wjIUyjXaGnBzMSH3f7seEXfH6A8mfu9Or4gfElVZRQ |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/IUCS.2010.5666011 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9781424478194 9781424478200 1424478200 1424478197 |
| EndPage | 280 |
| ExternalDocumentID | 5666011 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IF 6IH 6IK 6IL 6IN AAJGR AAWTH ADFMO ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK IEGSK IERZE OCL RIE RIL |
| ID | FETCH-LOGICAL-h219t-2ef1e5f5dd72ce67337bbb54d0a81c257cc9a52e6be3f132ea3c0e35fd7ca1f83 |
| IEDL.DBID | RIE |
| ISBN | 9781424478217 1424478219 |
| IngestDate | Wed Aug 27 02:52:51 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English Japanese |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-h219t-2ef1e5f5dd72ce67337bbb54d0a81c257cc9a52e6be3f132ea3c0e35fd7ca1f83 |
| PageCount | 8 |
| ParticipantIDs | ieee_primary_5666011 |
| PublicationCentury | 2000 |
| PublicationDate | 2010-10-01 |
| PublicationDateYYYYMMDD | 2010-10-01 |
| PublicationDate_xml | – month: 10 year: 2010 text: 2010-10-01 day: 01 |
| PublicationDecade | 2010 |
| PublicationTitle | 2010 4th International Universal Communication Symposium |
| PublicationTitleAbbrev | IUCS |
| PublicationYear | 2010 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0000527414 |
| Score | 1.4508115 |
| Snippet | In a series of preparatory experiments in 4 languages on subsets of the Europarl corpus, we show that a large number of unseen trigrams can be reconstructed by... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 273 |
| SubjectTerms | Connectors Electronic mail Estimation Europarl Morphology Smoothing methods structure of unseen trigrams Training Training data Trigram language models |
| Title | The structure of unseen trigrams and its application to language models: A first investigation |
| URI | https://ieeexplore.ieee.org/document/5666011 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELYKYmAC1CLe8sBIqBPHccyGKipYqkpQqROVH2eRgQS1Kb-fi5O2QmJhymOwonOsu-_uvu8IudWJMaAc_rzoTqOUQxrluWBRYpRnIJJc6zQMm5CTST6fq2mP3G25MAAQms_gvrkNtXxX2XWTKhti6IH4AbHOnpRZy9Xa5lOYaIRY0g13Cx1frDaSTt2z7KqaMVPDl9notW3s6hb9NV0lOJfx0f8-65gMdiw9Ot36nxPSg7JP3nHfaSsKu14CrTxtOOdQ0hph-FJ_rqguHS1qvO5K17Su6CZxScNsnNUDfaS-wNCQFjsljqockNn46W30HHUzFKIPNEAdJeBjEF44JxMLmeRcGmNE6pjOY4vn1VqlRQKZAe4RmYLmlgEX3kmrY5_zU7JfViWcEeoQeyjndMxTnzKPUAmjAwnMZQoSn6lz0m9ss_hqZTIWnVku_n59SQ5DIT70xV2RfTQMXJMD-10Xq-VN2Nsf8zaixw |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8NAEF5KFfSk0opv9-DR2M0-kqw3KZYWaynYQk-WTXYWezCRNvX3O3m0RfDiKY_DEmazzHwz831DyJ3hcQza4s-L7tSTAqQXRYp5PNaOgeKRMbIcNhGORtFspscNcr_lwgBA2XwGD8VtWcu3WbIuUmUdDD0QPyDW2VNSclaxtbYZFaYKKRa5YW-h6_P1RtSpfg7ruqbPdGcw7b5VrV31sr_mq5TupXf0vw87Ju0dT4-Otx7ohDQgbZF33HlaycKul0AzRwvWOaQ0RyC-NJ8ralJLFzled8Vrmmd0k7qk5XSc1SN9om6BwSFd7LQ4srRNpr3nSbfv1VMUvA80QO5xcD4op6wNeQJBKEQYx7GSlpnIT_DEJok2ikMQg3CITcGIhIFQzoaJ8V0kTkkzzVI4I9Qi-tDWGl9IJ5lDsITxQQjMBhq4C_Q5aRW2mX9VQhnz2iwXf7--JQf9yetwPhyMXi7JYVmWL7vkrkgTjQTXZD_5zher5U25zz-cLaYO |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2010+4th+International+Universal+Communication+Symposium&rft.atitle=The+structure+of+unseen+trigrams+and+its+application+to+language+models%3A+A+first+investigation&rft.au=Lepage%2C+Y&rft.au=Gosme%2C+J&rft.au=Lardilleux%2C+A&rft.date=2010-10-01&rft.pub=IEEE&rft.isbn=9781424478217&rft.spage=273&rft.epage=280&rft_id=info:doi/10.1109%2FIUCS.2010.5666011&rft.externalDocID=5666011 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424478217/lc.gif&client=summon&freeimage=true |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424478217/mc.gif&client=summon&freeimage=true |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424478217/sc.gif&client=summon&freeimage=true |

