The task of a machine translation (MT) system is to translate a text S in a source language into a text T in the target language. As for many other natural language processing and pattern recognition tasks, there are a few approaches one could take:
Research into MT goes back to the 1950s. However, almost all of the research carried out until fairly recently – and all but one of today’s commercial MT systems – are based on the first, “expert system” approach. Indeed, it was not until a series of brilliant papers was published in the early 1990s by a group of researchers at IBM’s Watson Research Center that the second, “machine learning” approach received serious attention (for instance, see this example.)
These competing approaches can be compared as follows:
In the last few years, the machine learning approach has garnered increasing interest. Note that from the economic point of view, the main inputs for this approach are computing power, computer memory, and bilingual data. All of these become significantly cheaper and more widely available each day, while human linguistic expertise – the main input for the competing approach – has a cost that increases slightly over time. From the point of view of related disciplines, the machine learning approach has come to dominate similar areas of natural language processing and pattern recognition, such as automatic speech recognition or face recognition. Finally, from the point of view of objectively evaluated performance, systems based at least partly on machine learning have outperformed those based purely on expert systems in recent quantitative evaluations of research MT systems conducted by the US National Institute of Standards and Technology (NIST).

Comparison of Expert System and Machine Learning System Approaches to Machine Translation.
In the expert system approach, a team of human experts writes rules for translating from the source language to the target language. These rules are incorporated in a computer program. In the other approach, a learning algorithm reads in a bilingual parallel corpus, consisting of a large text in the source language and its translation into the target language. From the bilingual parallel corpus, the learning algorithm automatically generates a set of translation rules. Unlike the translation rules typically written by human experts, the rules generated by machine learning typically include probabilities. Because of this, when a source sentence is given to the expert system for translation, typically only one translation is generated. By contrast, translation systems based on machine translation can typically generate more than one translation hypothesis for each source sentence; each of these hypotheses is associated with a probability score.
Until recently, it was difficult to evaluate the claims of competing approaches to machine translation, since no quantitative benchmarks for MT systems were available. In 2001, this situation was mitigated by the initiation of the NIST (U.S. National Institute for Standards and Technology) MT project (there had been earlier NIST initiatives in this area in 1993 and 1994). This NIST project is modeled on the evaluations that have been conducted on automatic speech recognition (ASR) systems since the late 1980s (in fact, it is managed by the unit of NIST responsible for the ASR evaluations). The objective of the project is to promote progress in technologies that convert free text in a variety of languages into English. This is done by holding annual tests. On a fixed date, NIST sends a source-language text file to all participants. The participants must send their systems’ target-language (English) translations of the source back to NIST for scoring within a specified number of days. For these NIST evaluations, the main language directions being evaluated have been Chinese to English and Arabic to English.
The MT NIST project shares a key, highly positive feature with NIST’s ASR evaluations: sites from anywhere are free to participate, provided they agree that the other participating sites may be informed of their score, and promise to attend a workshop in which they will describe the details of their system. However, there are stricter restrictions on the public release of information for the NIST MT evaluations than for their ASR evaluations: each site may publicly disclose its own score, but may not publicly release the results for any other site without that site’s permission.
It is difficult to come up with a good quantitative metric for the performance of an MT system. The obvious approach is to hire human translators to score the performance of their machine counterparts, but it is very slow and painful – and thus expensive – for humans to read and analyze a large number of rather bad machine-generated translations. Instead, NIST has adopted a variant of the “BLEU” scoring methodology originally suggested by researchers at IBM, in which human translators are hired to produce a number of different translations of the source-language test file for the evaluation, and each machine-generated translation is scored for similarity to the human translations. Scores of this type have a strong correlation with human judgments. There is still a potential problem here – in theory, an automatic system could produce brilliant translations that happen to be rather different in style and word choice from those produced by the human translators; however, the performance of MT systems needs to improve a great deal before this problem is likely to arise.
Participants in the NIST MT evaluations are free to submit target sentences from any kind of MT system. In the most recent evaluations (2003-2009), systems based on machine learning outperformed those based on human expertise. One must take this with a grain of salt. The very best MT expert systems may not be participating in these competitions, since efforts for the development of such systems have been expended on language pairs other than the Chinese-English or Arabic-English pairs targeted by the NIST evaluation; the largest efforts spent on expert systems have been focused on pairs in which both languages are of European origin: English-French, English-German, etc. Intensively developed expert systems also tend to be implemented by companies rather than research institutions; a company that sells MT software has therefore not much to gain and a lot to lose by submitting its system to a competitive evaluation by NIST. Nevertheless, the superior performance of the machine learning approach in these evaluations shows that at the very least, the machine learning approach can yield reasonable MT performance for less-studied language pairs. In the most recent MT evaluations for European language pairs (WSMT evaluations – see below) the same trend of machine learning systems catching up with and even surpassing expert systems has begun to appear.
In recent years, participants in the NIST MT evaluations have included research groups from all over the world – for instance:| USA: CMU, IBM, ISI/Language Weaver, John Hopkins, MIT, U Maryland, Systran; | |
| Germany: RWTH University (in Aachen); | |
| Great Britain: University of Edinburg; | |
| China: Hong Kong.UST; | |
| Japan: ATR, NTT; | |
| Italy: IRST. |
The PORTAGE system participated in these evaluations in 2005, 2006, 2008, and 2009 (there was no NIST MT evaluation in 2007), thus adding the National Research Council (NRC) of Canada to this list. In 2005, 2008, and 2009 PORTAGE participated in the Chinese to English evaluation; in 2006, PORTAGE participated in both the Chinese to English and Arabic to English evaluations. NRC obtained by far the best scores of any Canadian research group participating in the NIST evaluations during the period 2005-2009; also, NRC obtained the third highest score of any participating group in the 2009 Chinese-English evaluation.
Philipp Koehn of the University of Edinburgh (U.K.) has organized a series of evaluations of MT for certain European language pairs. Confusingly, the name of this evaluation has changed several times: the 2005 evaluation was called the "Workshop on Building and Using Parallel Text" (WPT), the 2006 evaluation was called the "Workshop on Machine Translation" (WMT), and the 2007-2010 evaluations were called "The ACL Workshop on Statistical Machine Translation” (WSMT). The PORTAGE system participated in the 2005-2007 and 2010 evaluations.
In 2005, these language pairs were:
The language pairs for 2006 were the same except for the omission of the English-Finnish and Finnish-English pairs. In both 2005 and 2006, PORTAGE tackled all language pairs in the evaluation.
In 2007 and 2010, the language pairs were:
PORTAGE tackled all of the above except English-Czech and Czech-English. In all evaluations, PORTAGE obtained very good scores. The 2007 evaluation was notable for our development of a hybrid “automatic post-editing” (or “statistical post-editing”) system which fused PORTAGE with a rule-based system from the company SYSTRAN (for English-French and French-English only). This hybrid system was rated particularly highly by human evaluators (see first document and second document).
An interesting feature of WPT/WMT was the use of human evaluations of the adequacy and fluency of MT outputs in addition to the BLEU metric. Results of the 2005 NAACL WPT evaluations can be found here; a full description of the 2006 NAACL WMT evaluations, including a link to the results, is also available; a full description of the 2007 ACL WSMT evaluations is available as well; a full description of the 2010 ACL WSMT is also available.
The PORTAGE system also participated in the 2006 Chinese to English evaluation for the TC-STAR Workshop (sponsored by the European Community).
During October 2005 – June 2009, the National Research Council of Canada (NRC) received funding from the U.S. Defense Advanced Research Projects Agency (DARPA) as a participant in the multimillion-dollar GALE project, as a member of the ‘Nightingale’ consortium, one of three consortia funded by GALE; the lead contractor in Nightingale was SRI International (California). The goal of GALE (Global Autonomous Language Exploitation) was to make foreign-language (Arabic and Chinese) speech and text accessible to English monolingual people, particularly in military settings. NRC's participation involved making the PORTAGE technology available to the GALE project, and taking part in the project’s internal evaluations. One of NRC’s most important contributions to the Nightingale consortium was providing, in collaboration with SYSTRAN, a hybrid system based on automatic post-editing; this hybrid system proved to be the most valuable single component of our consortium’s Chinese to English translation system. For further details about GALE, click here.
In September 2004, researchers of the Interactive Language Technologies Group in the NRC- Institute for Information Technology began to build an advanced MT system. In line with the Group’s ambitious mandate, we set ourselves an ambitious goal: to build a world-class system capable of competing each year on equal terms with the other systems that have been participating in the NIST MT evaluations.
We decided that the system would be based on machine learning, because of the advantages of this approach, as outlined above. Within the machine learning camp, we are strong believers in models that can generate multiple hypotheses with quantitative scores, such as models based on probabilities. Like many of the world's other leading statistical MT systems, PORTAGE is "phrase-based", that is, it is based on multiword groups in the source language automatically found to correlate with multiword groups in the target language.
We named our project and the system itself PORTAGE: the English name is Probabilistically Optimized Rules for Translation Automatically Generated from Examples or Portable Omnilingual Robust Translation Agent (depending on whether one focuses on the research approach or the system itself), and the French name is Projet Objectif de Recherche en Traduction Automatique Générale par l’Exemple.
The Merriam-Webster online dictionary gives as one of the meanings of “portage”: “the carrying of boats or goods overland from one body of water to another or around an obstacle (as a rapids)”. It is a word used in this sense in both English and French. For many Canadians, it evokes their first canoe trip in the wilderness, when family members had to carry the canoe on their shoulders from one lake or river to another. It also evokes the rugged voyageurs who explored the Gatineau and Outaouais waterways – and much of the rest of Canada - in the 17th and 18th centuries. Finally, the name evokes some of the difficulty faced by MT systems that must carry information from one pool of words, idioms, and mental habits (the source language) to another (the target language).
Early in the project, we had to decide what language pairs to focus on, and what R & D plan to adopt. We decided that the language priorities were as follows.
The tricky part in defining our MT research directions is to find the right balance between work aimed at state-of-the-art performance and work aimed at significant innovation. In the short term, we could put all of our effort into achieving reasonable performance quickly at a low risk by implementing the most successful algorithms in the technical literature. However, if we were to do only this, we would always be half a step behind our competitors, and forfeit the intellectual satisfaction (and intellectual property) that can be earned by pioneering innovative approaches. On the other hand, if we focused solely on innovation, we might well end up with a system incorporating highly original ideas that few people would pay any attention to, because of mediocre performance as measured by the system’s scores in international MT evaluations. The PORTAGE research directions therefore combine both performance and innovation goals, as explained further below.
Roland Kuhn
Phone: 819-934-4222
Fax: 819-934-2607
Email
George Foster
Phone: 819-934-3275
Fax: 819-934-2607
Email
Michel Mellinger
Phone: 819-934-2602
Fax: 819-934-2607
Email