Government of Canada
Symbol of the Government of Canada

PORTAGE: Machine Learning for Translation

Technical Overview

The task of a machine translation (MT) system is to translate a text S in a source language into a text T in the target language. As for many other natural language processing and pattern recognition tasks, there are a few approaches one could take:

  • build an expert system containing rules written by human experts for performing the MT task;
  • build a system capable of learning rules from examples. Then,
  • present the system with a bilingual S-T parallel corpus, from which it will automatically generate rules for mapping new S sentences onto new T sentences.

Research into MT goes back to the 1950s. However, almost all of the research carried out until fairly recently – and all but one of today’s commercial MT systems – are based on the first, “expert system” approach. Indeed, it was not until a series of brilliant papers was published in the early 1990s by a group of researchers at IBM’s Watson Research Center that the second, “machine learning” approach received serious attention (for instance, see this example.)

These competing approaches can be compared as follows:

  • Expert systems for MT incorporate deep linguistic knowledge; they currently yield superior performance for well-studied language pairs (S,T) in non-specialized semantic domains. However, they are expensive to maintain and expand. As the number of manually generated rules grows, they may yield unexpected interactions and side-effects, so that software engineering issues become increasingly important as the system grows in complexity; also, porting an expert system to a new semantic domain requires many new rules (sometimes dealt with by means of a “dictionary” functionality). As for porting such a system to a new language pair, it requires not only a whole set of new rules, but also a new set of human experts to define and program these new rules, with all the complexity that involves. Finally, an expert system typically generates only one translation T per source text S, and is incapable of assessing the quality of other possible translations.
  • Existing MT systems based on machine learning, on the other hand, typically incorporate only shallow, surface knowledge of the two languages involved. Thus, they are capable of absurd, counterintuitive blunders (e.g., completely leaving out the subject or main verb of a source sentence from the translation T) such as are rarely perpetrated by well-designed MT expert systems. On the other hand, they rarely create serious software engineering problems. They tend to be highly domain dependent, working well for text that resembles the kind of text they are trained on (e.g., text dealing with the same topic), but not other kinds of text. On the other hand, provided that bilingual text for a new domain or topic is available, they are highly portable: one simply retrains the system on the appropriate text. Similarly, to enable such a system to handle a new language pair, one does not need to hire a bevy of experts in the two languages, but simply to find a large parallel corpus for the pair (though an expert or two may still come in useful to handle text processing questions). Systems of this type can typically be configured to generate multiple translations for each input sentence in the source language.
  • The diagram below illustrates some of these differences, for a French-to-English MT system. When a French sentence - for the sake of an example “Mais où sont les neiges d’antan?” – is entered, the system manually coded by experts generates a single translation, while the system based on machine learning generates multiple translations, each with a probability score. (This example is invented; in practice, neither the expert system nor the machine learning system would be likely to produce translations of such high quality for this type of sentence).

In the last few years, the machine learning approach has garnered increasing interest. Note that from the economic point of view, the main inputs for this approach are computing power, computer memory, and bilingual data. All of these become significantly cheaper and more widely available each day, while human linguistic expertise – the main input for the competing approach – has a cost that increases slightly over time. From the point of view of related disciplines, the machine learning approach has come to dominate similar areas of natural language processing and pattern recognition, such as automatic speech recognition or face recognition. Finally, from the point of view of objectively evaluated performance, systems based at least partly on machine learning have outperformed those based purely on expert systems in recent quantitative evaluations of research MT systems conducted by the US National Institute of Standards and Technology (NIST).

Comparison of Expert System and Machine Learning System Approaches to Machine Translation. <br />In the expert system approach, a team of human experts writes rules for translating from the source language to the target language. These rules are incorporated in a computer program. In the other approach, a learning algorithm reads in a bilingual parallel corpus, consisting of a large text in the source language and its translation into the target language. From the bilingual parallel corpus, the learning algorithm automatically generates a set of translation rules. Unlike the translation rules typically written by human experts, the rules generated by machine learning typically include probabilities. Because of this, when a source sentence is given to the expert system for translation, typically only one translation is generated. By contrast, translation systems based on machine translation can typically generate more than one translation hypothesis for each source sentence; each of these hypotheses is associated with a probability score.

Comparison of Expert System and Machine Learning System Approaches to Machine Translation.



In the expert system approach, a team of human experts writes rules for translating from the source language to the target language. These rules are incorporated in a computer program. In the other approach, a learning algorithm reads in a bilingual parallel corpus, consisting of a large text in the source language and its translation into the target language. From the bilingual parallel corpus, the learning algorithm automatically generates a set of translation rules. Unlike the translation rules typically written by human experts, the rules generated by machine learning typically include probabilities. Because of this, when a source sentence is given to the expert system for translation, typically only one translation is generated. By contrast, translation systems based on machine translation can typically generate more than one translation hypothesis for each source sentence; each of these hypotheses is associated with a probability score.


NIST Evaluation of MT Systems

Until recently, it was difficult to evaluate the claims of competing approaches to machine translation, since no quantitative benchmarks for MT systems were available. In 2001, this situation was mitigated by the initiation of the NIST (U.S. National Institute for Standards and Technology) MT project (there had been earlier NIST initiatives in this area in 1993 and 1994). This NIST project is modeled on the evaluations that have been conducted on automatic speech recognition (ASR) systems since the late 1980s (in fact, it is managed by the unit of NIST responsible for the ASR evaluations). The objective of the project is to promote progress in technologies that convert free text in a variety of languages into English. This is done by holding annual tests. On a fixed date, NIST sends a source-language text file to all participants. The participants must send their systems’ target-language (English) translations of the source back to NIST for scoring within a specified number of days. For these NIST evaluations, the main language directions being evaluated have been Chinese to English and Arabic to English.

The MT NIST project shares a key, highly positive feature with NIST’s ASR evaluations: sites from anywhere are free to participate, provided they agree that the other participating sites may be informed of their score, and promise to attend a workshop in which they will describe the details of their system. However, there are stricter restrictions on the public release of information for the NIST MT evaluations than for their ASR evaluations: each site may publicly disclose its own score, but may not publicly release the results for any other site without that site’s permission.

It is difficult to come up with a good quantitative metric for the performance of an MT system. The obvious approach is to hire human translators to score the performance of their machine counterparts, but it is very slow and painful – and thus expensive – for humans to read and analyze a large number of rather bad machine-generated translations. Instead, NIST has adopted a variant of the “BLEU” scoring methodology originally suggested by researchers at IBM, in which human translators are hired to produce a number of different translations of the source-language test file for the evaluation, and each machine-generated translation is scored for similarity to the human translations. Scores of this type have a strong correlation with human judgments. There is still a potential problem here – in theory, an automatic system could produce brilliant translations that happen to be rather different in style and word choice from those produced by the human translators; however, the performance of MT systems needs to improve a great deal before this problem is likely to arise.

Participants in the NIST MT evaluations are free to submit target sentences from any kind of MT system. In the most recent evaluations (2003-2009), systems based on machine learning outperformed those based on human expertise. One must take this with a grain of salt. The very best MT expert systems may not be participating in these competitions, since efforts for the development of such systems have been expended on language pairs other than the Chinese-English or Arabic-English pairs targeted by the NIST evaluation; the largest efforts spent on expert systems have been focused on pairs in which both languages are of European origin: English-French, English-German, etc. Intensively developed expert systems also tend to be implemented by companies rather than research institutions; a company that sells MT software has therefore not much to gain and a lot to lose by submitting its system to a competitive evaluation by NIST. Nevertheless, the superior performance of the machine learning approach in these evaluations shows that at the very least, the machine learning approach can yield reasonable MT performance for less-studied language pairs. In the most recent MT evaluations for European language pairs (WSMT evaluations – see below) the same trend of machine learning systems catching up with and even surpassing expert systems has begun to appear.

In recent years, participants in the NIST MT evaluations have included research groups from all over the world – for instance:
USA: CMU, IBM, ISI/Language Weaver, John Hopkins, MIT, U Maryland, Systran;
Germany: RWTH University (in Aachen);
Great Britain: University of Edinburg;
China: Hong Kong.UST;
Japan: ATR, NTT;
Italy: IRST.

The PORTAGE system participated in these evaluations in 2005, 2006, 2008, and 2009 (there was no NIST MT evaluation in 2007), thus adding the National Research Council (NRC) of Canada to this list. In 2005, 2008, and 2009 PORTAGE participated in the Chinese to English evaluation; in 2006, PORTAGE participated in both the Chinese to English and Arabic to English evaluations. NRC obtained by far the best scores of any Canadian research group participating in the NIST evaluations during the period 2005-2009; also, NRC obtained the third highest score of any participating group in the 2009 Chinese-English evaluation.

Other Evaluations of MT Systems

Philipp Koehn of the University of Edinburgh (U.K.) has organized a series of evaluations of MT for certain European language pairs. Confusingly, the name of this evaluation has changed several times: the 2005 evaluation was called the "Workshop on Building and Using Parallel Text" (WPT), the 2006 evaluation was called the "Workshop on Machine Translation" (WMT), and the 2007-2010 evaluations were called "The ACL Workshop on Statistical Machine Translation” (WSMT). The PORTAGE system participated in the 2005-2007 and 2010 evaluations.

In 2005, these language pairs were:

  • English to French and vice versa,
  • English to German and vice versa,
  • English to Spanish and vice versa, and
  • English to Finnish and vice versa.

The language pairs for 2006 were the same except for the omission of the English-Finnish and Finnish-English pairs. In both 2005 and 2006, PORTAGE tackled all language pairs in the evaluation.

In 2007 and 2010, the language pairs were:

  • English to French and vice versa,
  • English to German and vice versa,
  • English to Spanish and vice versa, and
  • English to Czech and vice versa.

PORTAGE tackled all of the above except English-Czech and Czech-English. In all evaluations, PORTAGE obtained very good scores. The 2007 evaluation was notable for our development of a hybrid “automatic post-editing” (or “statistical post-editing”) system which fused PORTAGE with a rule-based system from the company SYSTRAN (for English-French and French-English only). This hybrid system was rated particularly highly by human evaluators (see first document and second document).

An interesting feature of WPT/WMT was the use of human evaluations of the adequacy and fluency of MT outputs in addition to the BLEU metric. Results of the 2005 NAACL WPT evaluations can be found here; a full description of the 2006 NAACL WMT evaluations, including a link to the results, is also available; a full description of the 2007 ACL WSMT evaluations is available as well; a full description of the 2010 ACL WSMT is also available.

The PORTAGE system also participated in the 2006 Chinese to English evaluation for the TC-STAR Workshop (sponsored by the European Community).

The GALE Project

During October 2005 – June 2009, the National Research Council of Canada (NRC) received funding from the U.S. Defense Advanced Research Projects Agency (DARPA) as a participant in the multimillion-dollar GALE project, as a member of the ‘Nightingale’ consortium, one of three consortia funded by GALE; the lead contractor in Nightingale was SRI International (California). The goal of GALE (Global Autonomous Language Exploitation) was to make foreign-language (Arabic and Chinese) speech and text accessible to English monolingual people, particularly in military settings. NRC's participation involved making the PORTAGE technology available to the GALE project, and taking part in the project’s internal evaluations. One of NRC’s most important contributions to the Nightingale consortium was providing, in collaboration with SYSTRAN, a hybrid system based on automatic post-editing; this hybrid system proved to be the most valuable single component of our consortium’s Chinese to English translation system. For further details about GALE, click here.

The History of PORTAGE

In September 2004, researchers of the Interactive Language Technologies Group in the NRC- Institute for Information Technology began to build an advanced MT system. In line with the Group’s ambitious mandate, we set ourselves an ambitious goal: to build a world-class system capable of competing each year on equal terms with the other systems that have been participating in the NIST MT evaluations.

We decided that the system would be based on machine learning, because of the advantages of this approach, as outlined above. Within the machine learning camp, we are strong believers in models that can generate multiple hypotheses with quantitative scores, such as models based on probabilities. Like many of the world's other leading statistical MT systems, PORTAGE is "phrase-based", that is, it is based on multiword groups in the source language automatically found to correlate with multiword groups in the target language.

We named our project and the system itself PORTAGE: the English name is Probabilistically Optimized Rules for Translation Automatically Generated from Examples or Portable Omnilingual Robust Translation Agent (depending on whether one focuses on the research approach or the system itself), and the French name is Projet Objectif de Recherche en Traduction Automatique Générale par l’Exemple.

The Merriam-Webster online dictionary gives as one of the meanings of “portage”: “the carrying of boats or goods overland from one body of water to another or around an obstacle (as a rapids)”. It is a word used in this sense in both English and French. For many Canadians, it evokes their first canoe trip in the wilderness, when family members had to carry the canoe on their shoulders from one lake or river to another. It also evokes the rugged voyageurs who explored the Gatineau and Outaouais waterways – and much of the rest of Canada - in the 17th and 18th centuries. Finally, the name evokes some of the difficulty faced by MT systems that must carry information from one pool of words, idioms, and mental habits (the source language) to another (the target language).

The Strategy for PORTAGE

Early in the project, we had to decide what language pairs to focus on, and what R & D plan to adopt. We decided that the language priorities were as follows.


  1. English / French (in both directions):
    Our goal for this language pair is to build the best English-French MT system in the world. English and French are the two official languages of Canada, and most translation work performed in Canada today is between them. We hope that one day, our system will be used in hybrid human-machine translation applications to support Canada’s translators and help them deal with their very large workload while making the best use of their knowledge and expertise.

    Interestingly, the machine learning approach to MT has a historical connection to Canada’s linguistic duality. When this approach was pioneered by researchers in the USA, they focused on English / French MT because of the availability, large size, and high quality of the Canadian Hansard, the official record of Canada’s parliamentary debates – which was, at that time, the largest bilingual corpus in the world. More recently, research into machine-learning-based MT for this language pair has languished, partly because there is no NIST English / French MT evaluation. Fortunately, the European WPT/WMT/WMST evaluation mentioned above covers this language pair (in both directions), and PORTAGE has usually ranked high in these evaluations.

  2. Chinese to English:
    Our goal for this language pair is to ensure that each year, our system has performance comparable to the best systems participating in the annual NIST Chinese-to-English evaluations. By calibrating our system against competing ones in this way, we also participate in the NIST MT workshop, where high-level technical information is exchanged amongst participants; we will also be compelled to steadily improve our technology to keep up with the state-of-the-art. Wherever possible, our resulting system improvements will be propagated to the English/French system.

    Chinese-English is one of the two major language pairs covered by NIST evaluations, the other language pair being Arabic-English. Chinese and English are the two most widely written languages in the world. According to the census of 2006, dialects of Chinese constitute the third most important language in Canada (after English and French), with 3% of the Canadian population identifying Chinese as their mother tongue. From a research point of view, the Chinese-English pair has the advantage that the two languages are completely unrelated, both at the lexico-syntactic and the orthographic level, thus making this MT task unusually difficult. If PORTAGE can learn how to translate from one of these into the other, it should be able to handle most other language pairs; as a result, we typically use this language pair as a testbed for trying out new, language-independent MT algorithms.

  3. Arabic to English:
    This language pair has obvious political, military, and economic significance. For this language pair, we have focused on different ways of pre-processing the Arabic input in order to attain better MT performance.

  4. Other languages:
    Translation involving languages other than those noted above is also of interest to us, since these languages may be of importance to our collaborators. Provided a suitable bilingual training corpus is available, the PORTAGE system can be adapted accordingly.

PORTAGE Research Directions


The tricky part in defining our MT research directions is to find the right balance between work aimed at state-of-the-art performance and work aimed at significant innovation. In the short term, we could put all of our effort into achieving reasonable performance quickly at a low risk by implementing the most successful algorithms in the technical literature. However, if we were to do only this, we would always be half a step behind our competitors, and forfeit the intellectual satisfaction (and intellectual property) that can be earned by pioneering innovative approaches. On the other hand, if we focused solely on innovation, we might well end up with a system incorporating highly original ideas that few people would pay any attention to, because of mediocre performance as measured by the system’s scores in international MT evaluations. The PORTAGE research directions therefore combine both performance and innovation goals, as explained further below.

  1. Automatic post-editing.
    As mentioned above, MT systems like PORTAGE that are based on machine learning replaced an older generation of MT systems that are based on the expert system approach. However, this older type of MT system often contains linguistic and translation knowledge that is complementary to that extracted from bilingual corpora by pure machine learning systems. Rather than discarding the useful knowledge that is embedded in expert systems, we devised a method for fusing the two types of system that we call “automatic post-editing” or “statistical post-editing”. Over the last few years, we have been able to show that the resulting hybrid system often performs better than either system alone, especially when judged by human evaluators as demonstrated in the WSMT 2007 evaluation (see a description of the system and human evaluations of all participating systems). Our work in this area has been influential, as measured not only by its citations in the technical and scientific literature but also by its adoption by other R&D groups as well as by the international translation software company SYSTRAN.

  2. Adaptation.
    MT systems based on machine learning often work well on the domains they are trained on, but not nearly as well on other domains. For example, a German to English system trained on medical literature would produce a reasonable translation for a medical article written in German, but will fail completely at translating a German description of a tennis match. The goal is therefore to develop a system that can adjust smoothly to changes in the domain of the material it is translating. We have already developed some techniques which go part-way towards solving this challenge and we are pursuing our efforts to render PORTAGE adaptable to the content of the incoming text to be translated.

  3. Application of statistical techniques to phrase-based models.
    Phrase-based statistical MT relies on an initial alignment between groups of contiguous words ("phrases") in each pair of corresponding source-language and target-language sentences in the bilingual training corpus. Estimates of various kinds are derived from the analysis of these phrase alignments, which are integral to the process of translation. The goal of this research is to apply statistical techniques to improve the probability estimates at hand. For instance, using a technique called significance testing, we can discard most of the phrase alignments, because they are likely to reflect mere random fluctuations in the data. We can then apply other statistical techniques to the remaining phrase alignments in order to smooth the remaining probability estimates. Combined, these techniques produce a smaller, more tractable translation model while, at the same time, improving the estimates. Surprisingly, until we looked into this important issue, from both theoretical and practical viewpoints, it had received little attention from other MT groups.

  4. Confidence estimation of machine translations.
    One of the main goals of our work is to make MT a useful tool for human translators, allowing them not only to become more productive but also to focus on the best use of their expertise in their work. As we carry out tests with professional translators, we become more and more convinced that they will not accept MT unless it comes equipped with confidence estimations for the translations the system produces. Indeed, translators are irritated, and rightly so, when a machine translation system proposes a translation that is riddled with errors, causing translators to waste time in post-editing the proposed translation. With good confidence estimation techniques, one can eliminate such translations from the output rather than presenting them to the translators; in such cases, it would be better to simply display a message like “Sorry, I have no suggestion to make” than to waste a professional’s time. Thus, our ongoing work on confidence estimation will probably have the most important practical impact on our goal of getting PORTAGE into applications aimed at professional translators.

Project Leaders

Roland Kuhn
Phone: 819-934-4222
Fax: 819-934-2607
Email

George Foster
Phone: 819-934-3275
Fax: 819-934-2607
Email


Business Contact

Michel Mellinger
Phone: 819-934-2602
Fax: 819-934-2607
Email