Extracting parallel phrases from comparable data for machine translation

Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other natural language processing applications. In this paper, we address the task of detecting parallel phrase pairs embedded in comparable sentence pairs....

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Natural language engineering Ročník 22; číslo 4; s. 549 - 573
Hlavní autoři: HEWAVITHARANA, SANJIKA, VOGEL, STEPHAN
Médium: Journal Article
Jazyk:angličtina
Vydáno: Cambridge, UK Cambridge University Press 01.07.2016
Témata:
ISSN:1351-3249, 1469-8110
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other natural language processing applications. In this paper, we address the task of detecting parallel phrase pairs embedded in comparable sentence pairs. We present a novel phrase alignment approach that is designed to only align parallel sections bypassing non-parallel sections of the sentence. We compare the proposed approach with two other alignment methods: (1) the standard phrase extraction algorithm, which relies on the Viterbi path of the word alignment, (2) a binary classifier to detect parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the accuracy of these approaches using a manually aligned data set, and show that the proposed approach outperforms the other two approaches. Finally, we demonstrate the effectiveness of the extracted phrase pairs by using them in Arabic–English and Urdu–English translation systems, which resulted in improvements upto 1.2 Bleu over the baseline. The main contributions of this paper are two-fold: (1) novel phrase alignment algorithms to extract parallel phrase pairs from comparable sentences, (2) evaluating the utility of the extracted phrases by using them directly in the MT decoder.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1351-3249
1469-8110
DOI:10.1017/S1351324916000139