Government of Canada
Symbol of the Government of Canada
Our Research - Projects

Statistical Language Processing

Almost every human activity involves the use of natural language in some way. Computers today are limited in their ability to process natural language, and thus limited in their ability to help us in our daily activities. Computers are adept at working at the surface level of language – whether a word appears or not – but developing software that can go beyond the surface, to the underlying meaning of language – semantics and pragmatics – remains a major research challenge.

The objective of the Statistical Language Processing project is to investigate statistical approaches to processing natural language. Statistical methods have recently surpassed older techniques for computer processing of natural language in all major areas (machine translation, parsing, information extraction, information retrieval, text mining, knowledge acquisition, knowledge discovery, etc.).

This project consists of two subprojects, Statistical Semantics and Probabilistic Models of Language Structure.

Statistical Semantics is the application of statistical techniques to extract semantics (meaning) from text. Currently, computers process language without any understanding of what the words mean. For example, a search for "car" will not find a web page that only mentions "automobile", although the two words have the same meaning. The aim of this subproject is to develop new algorithms for extracting meaning by examining patterns of word usage in large collections of text.

We use the word “Statistical” to emphasize that the work is based on large collections of textual data rather than on intuitions about language or on linguistic approaches that use only a small number of examples. We use the word “Semantics” to emphasize that the regularities that we wish to study and exploit are ‘deeper’ than phonology or morphology or other surface form characteristics.

Probabilistic Models of Language Structure is the application of statistical techniques to modelling syntactic patterns in language use (structural, grammatical regularities in language). Probabilistic models can be used to improve the quality of machine translation, to improve the quality of parsing (grammatical analysis), or to model morphology (the various forms that a word can take, such as "go", "gone", "went", "going", "goes").

Project Goals and Impact

The goals of this project are to identify new techniques relevant to NRC-IIT application areas and to publish those new techniques alongside scientific evaluations. Since the NRC-IIT Interactive Information Group works mainly with information in the form of natural language, it is essential to the future of the group to take a leadership position in research in statistical approaches to processing natural language.

The immediate impact of the research will be to maintain and enhance the group's reputation as a technology leader and to ensure that the group is well informed about and familiarized with the state-of-the-art and most advanced high-performance language technologies. The longer term impact will be commercial applications and new tools and methods in support of the group's other projects.

Related Information

Institutes: