How Syntax Is Helpful for Statistical Machine Translation?



Huei-Chi Lin
Laboratoire d'Informatique de l'Université du Maine

Phrase-based translation systems are comprised of two probabilistic models, translation model and language model. The translation model is derived from the probabilities of source-target aligned phrase pairs (not linguistic phrase), extracted from parallel corpora. This model is to generate translation alternatives of the given source text, while the language model is taking care of the fluency and grammatical of the translation output. The language model is approximated by the n-gram probabilities trained on huge target language monolingual corpora.

The word alignment is estimated for source-target sentence pair in the parallel corpora, then it is used to extract the consistent phrase pairs for this parallel sentence. The extracted phrase pairs are turned into a phrase table with a larger number of phrase pairs and their probabilities and also reordering table that models short local reordering of words and phrases. Both tables are called the translation model (Koehn, 2010). Translation models have high performance when the two source and target language pairs are close. The translation output therefore only needs short local reorder. The order of syntactic constituents differing between source and target language becomes a critical problem because the parallel data is less monotonically-aligned. The estimation of alignment becomes difficult and translation models are constructed with low quality. For instance, the basic phrase-based systems have less capacity to learn word reordering orientation of English-to-German bitexts. It is not easy to align signal English verbs to German base verbs and their separable prefixes usually locating at the end of the sentence.

One solution for this limitation is to pre-reorder the source sentence to make it resembling the expected order of the target sentence. Hence the translation system needs to do less word movement instead. This is known as "pre-reordering" (Xia and McCord, 2004; Wang et al., 2007; Goto et al., 2012). This preprocessing can be introduced into phrase-based systems with parsing. Syntactic trees have the capacity to represent recursive structures of a language. Reordering rules which are directly learned from parse data can apply maximally on applicable sentences in the source side. When this approach is efficiently performed on source data, the order of translated word respects syntax of the target language.

Syntactic pre-reordering method aims to reconstruct the order of a given source sentence, to make it more close to the order of the target sentence. This syntactic reordering based on pared trees aims to rearrange the maximum pattern in the source data so that the long-distance word reordering is decoupled from the translation systems. The advantage is that the performance of word alignment, reordering models and translation models become better. Translation computed from these statistical systems is hence improved.


References
Isao Goto, Masao Utiyama, and Eiichiro Sumita. 2012. Post-ordering by parsing for japanese-english statistical machine translation. In Proceedings of the50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 311–316, Jeju Island, Korea, July. Association for Computational Linguistics.
Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press.
Chao Wang, Michael Collins, and Philipp Koehn. 2007. Chinese syntactic reordering for statistical machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 737–745.
Fei Xia and Michael McCord. 2004. Improving a statistical MT system with automatically learned rewrite patterns. In Proceedings of Coling 2004, pages 508–514, Geneva, Switzerland, Aug 23–Aug 27. COLING.