ARCHIVED - NRC Researchers Build Smarter Search Engines
Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please contact us to request a format other than those available.
February 06, 2006— Ottawa, Ontario
"The more computers are able to understand words, the more helpful they'll be to us in every daily task,"
Peter Turney, NRC-IIT.
Google a particular car type or disease and you're bound to spend time sorting between hundreds of more and less useful results. How to sort the wheat from the chaff in the world of too-much electronic information? NRC researchers are helping make the search for electronic information easier, and one key to this is teaching computers to understand language.
"The more computers are able to understand words, the more helpful they'll be to us in every daily task," says Peter Turney, an Ottawa-based research scientist with the Interteractive Information Group of the NRC Institute for Information Technology. The Interactive Information Group focuses on developing software tools to increase access to electronic information.
Turney's computer science speciality is the area of lexical semantics, or word meaning. At present, our desktop and laptop computers are linguistic toddlers. Spam filters, and other software such as editing tools, are able to distinguish and make decisions based on single words like "Viagra", but there's no sense of meaning. It's like learning a second language but not knowing what the words mean or how they create meaning together.
So the race is on to create software that goes beyond single word recognition to extract deeper understanding. One example is sentiment analysis. This is software that can determine whether the words in a sentence are positive or negative. Sentiment analysis is being used to create a kind of Googling for feelings. One application of sentiment analysis involves following financial chat groups to track attitudes towards particular stocks.
Turney's focus is on taking computer understanding of English to the two-word stage.
"I'm developing an algorithm that uses a huge quantity of text to figure-out the relationship between any given pair of words," he says. An algorithm is a method of doing a computation and the basis for creating computer codes and software.
Examples of SAT Multiple-Choice Word Analogies with Computer Generated Answers
1. ostrich is to bird as...
(a) lion is to cat
(b) goose is to flock
(c) ewe is to sheep
(d) cub is to bear
(e) primate is to monkey
Computer generated answer: (a) ostrich is to bird as lion is to cat
Computer generated explanation: "birds such as the ostrich" = "cats such as the lion"
2. traffic is to street as...
(a) ship is to gangplank
(b) crop is to harvest
(c) car is to garage
(d) pedestrians is to feet
(e) water is to riverbed
Computer generated answer: (e) traffic is to street as water is to riverbed
Computer generated explanation: "streets that carry traffic" = "riverbeds that carry water"
His word pairs of choice are noun modifiers — word tandems in which the noun is modified by a preceding term, for example, "laser printer" or "flu virus". Creating digital understanding of noun modifiers is a dizzying linguistic task. WordNet, a free online reference library of words used by researchers, lists about 26,000 noun modifier pairs. And the relationship between the two words falls into one of more than 50 categories. For example, the relationship could be causal (exam anxiety), temporal (daily exercise) or spatial (home town).
Turney says the ability to program a computer to understand noun modifiers doesn't rely on logic. Rather, it's based on the machine's ability to statistically calculate the probable relationship between two words based on prior experience with the words through Web mining, the analysis of huge quantities of text.
A search engine equipped with an understanding of noun modifiers would be a major advance. Today when we Google for two words, the search provides a wide array of examples of the two words in a variety of relationships. But if the search were delimited by a specific understanding of noun modifiers it could provide much more precise results.
Before sending his algorithm out into the world, Turney is giving it practice against U.S. college and university entrance exams. Some of these include multiple choice word analogy questions. For example, given the words "mason" and "stone" the student is asked to pick the most analogous pair from five additional pairs of words. In this case the answer is "carpenter" and "wood".
|Dr. Peter Turney|
"Right now, my algorithm is scoring at an average human level on the SAT word analogy questions," says Turney.
Now he's training his program to move from state college to Ivy League performance. As part of this, rather than just choosing the right answer, he wants the software to be able to explain the decision.
Turney notes that there's enormous financial potential for applications of a noun modifier algorithm and other tools to help us quickly find exactly what we're looking for amidst a flood of electronic information. Companies like Google are snapping-up computational linguists to apply their insights to creating a more word-wise search engine. Turney's patented key phrase extraction algorithm is already used by thousands of people every day for online searches as part of Quebec-based Copernic's search tools. NRC IIT colleague Alain Désilet is extending this technology to develop a tool for automatically extracting the core information from spoken "documents" such as videos or videoconferences to create written summaries.
In a related area, NRC IIT researchers Joel Martin and Berry de Bruijn are developing software to improve the info-mining of a particularly challenging area: scientific literature. Called Litminer, the software, presently in development, will help genomics and proteomics researchers to more effectively comb through the ten of thousands of scholarly articles published each month and find the latest discoveries and technical advances most useful to them.
But Turney's also looking beyond the immediate benefits of more search-savvy computers. Inspired as a child by the talking robots in science fiction TV series such as Lost in Space and Star Trek, and the famous "HAL" in the movie 2001: A Space Odyssey, 45-year-old Turney is confident that within several decades computers will have grown way past the two-word comprehension stage. He thinks we'll be talking to machines — and not only will they understand, they'll be talking back.
Says Turney: "While there's still lots of debate on this among my colleagues in artificial intelligence, I fully expect that one day you'll be able to talk to a computer just as you would to another person."
- IT and Communications: NRC's areas of research
- NRC Institute for Information Technology (NRC-IIT)
- Information Analysis and Retrieval Research Program, NRC Institute for Information Technology (NRC-IIT)
Enquiries: Media relations
National Research Council of Canada
Report a problem or mistake on this page
- Date modified: