Extracting parallel phrases from comparable data for machine translation

Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other natural language processing applications. In this paper, we address the task of detecting parallel phrase pairs embedded in comparable sentence pairs....

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Natural language engineering Ročník 22; číslo 4; s. 549 - 573
Hlavní autori: HEWAVITHARANA, SANJIKA, VOGEL, STEPHAN
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Cambridge, UK Cambridge University Press 01.07.2016
Predmet:
ISSN:1351-3249, 1469-8110
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other natural language processing applications. In this paper, we address the task of detecting parallel phrase pairs embedded in comparable sentence pairs. We present a novel phrase alignment approach that is designed to only align parallel sections bypassing non-parallel sections of the sentence. We compare the proposed approach with two other alignment methods: (1) the standard phrase extraction algorithm, which relies on the Viterbi path of the word alignment, (2) a binary classifier to detect parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the accuracy of these approaches using a manually aligned data set, and show that the proposed approach outperforms the other two approaches. Finally, we demonstrate the effectiveness of the extracted phrase pairs by using them in Arabic–English and Urdu–English translation systems, which resulted in improvements upto 1.2 Bleu over the baseline. The main contributions of this paper are two-fold: (1) novel phrase alignment algorithms to extract parallel phrase pairs from comparable sentences, (2) evaluating the utility of the extracted phrases by using them directly in the MT decoder.
Bibliografia:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1351-3249
1469-8110
DOI:10.1017/S1351324916000139