Canadian Indigenous languages technology project

Status: Active

Overview

We are working on speech- and text‑based technologies that aim to assist the revitalization and preservation of Indigenous languages by supporting Indigenous language educators and students, promoting the accessibility of audio recordings, and supporting Indigenous language translators, transcribers and other language professionals.

  • Language-independent technology (such as software) will be released to communities as open-source software.
  • We will be working under the direction and advice of an Advisory committee, and in close collaboration and partnership with Indigenous community organizations and Indigenous communities across Canada.
  • All research done within this project will be compliant with the Tri-Council Research Ethics Policy.
  • Budget 2017 invested $89.9M over three years to support Indigenous languages and cultures. We were granted $6M of this funding.
  • This project is managed by the NRC's Digital Technologies Research Centre.

Technologies

Speech-based technologies

The context

  • There are thousands of hours of recordings of Indigenous languages from across the country.
  • The recordings can be difficult for Indigenous communities to access and make use of because they are not always fully transcribed, and sometimes are missing metadata (information about what languages are being spoken, who is speaking, etc.).

Our aim

  • To create software that will automatically segment and label audio files while they're being recorded (or shortly afterwards).
  • To build and test audio-indexation software that makes it possible to search through existing recordings, including recordings made decades ago, to find key words or phrases.
Text-based technologies

The context

  • The complexity of words in Indigenous languages – in which single, long words made up of many small pieces know as morphemes, can often express what other languages express with entire clauses – poses difficulties for software applications (including both educational and professional software) that lack language-specific word-handling capabilities.
  • Teaching how to form words is a central concern in Indigenous language education.
  • Word complexity, and, in some languages, the complexity of the writing systems, mean that writing in accordance with official community standards is difficult for many learners.

Our aim

  • To design, in collaboration with instructors, educational tools that support exploratory learning of word formation.
  • To develop tools for spell-checking and grammar-checking, for integration with desktop and mobile applications, to help language users at all levels to follow their community's writing standards.

Languages

We are taking a "first deep, then broad" approach. Each software tool we build will initially be specialized to one or two Indigenous languages in Canada, but built in a way that allows customization for additional languages.

We are currently working with:

  • Kanyen'keha (Mohawk)
  • Inuktitut
  • Cree

Through thoughtful design, and subsequent testing, we will attempt to ensure that the tools we develop in this way will be adaptable to many different languages after this initial development period.

Collaborations

We are collaborating formally and informally with:

7000 Languages

Website: 7000 Languages

Project description: Initiative for Creating Online Indigenous Language Courses (COILC initiative)

The NRC has partnered with the experts at 7000 Languages, a non-profit, non-Indigenous organization based in the United States that creates courses for endangered languages around the world. The NRC will fund selected community teams who wish to create online courses for their languages. Find out more about COILC.

Alberta Language Technology Lab, University of Alberta

Website: Alberta Language Technology Lab, University of Alberta

Project description: Since 2013, the Alberta Language Technology Lab (ALTLab) at the University of Alberta, headed by Dr. Antti Arppe, has been combining research on language structure with the creation of computational tools for Indigenous languages, starting with Plains Cree. The lab has been building on earlier work by its Norwegian partners on Saami and other threatened Uralic languages of Northern Eurasia which resulted in the Giella linguistic software development infrastructure. This infrastructure allows for the straightforward, rapid creation of end-user applications for morphologically complex languages.

Another section of this webpage describes the NRC's collaboration with the Onkawenna Kentyokwa Mohawk-language immersion school to build an educational tool called Kawennonis. This tool – which is currently being extended to other Iroquoian languages – was built within the Giella infrastructure. It would have been much more difficult for the NRC team to create Kawennonis without the help of the ALTLab team's Giella expertise. An NRC software developer, Eddie Santos, is currently embedded in the ALTLab to enhance the synergy between the two teams.

Canadian Broadcasting Corporation (CBC)

Website: Canadian Broadcasting Corporation

Project description: CBC creates programming by and for Indigenous peoples, providing services in eight Indigenous/Inuit languages. CBC is providing the Computer Research Institute of Montreal (CRIM) with access to East James Bay Cree recordings, as part of the NRC’s Indigenous languages technology project, so that CRIM can develop audio segmentation and analysis tools suitable for indexing audio recordings in Indigenous languages. CBC has shared over 1,343 hours of radio programming originally broadcast by CBC North from January 2015 to December 2016. These 1,312 audio files, which contain studio/telephone quality speech as well as music, are highly appreciated by the NRC and CRIM project teams and will be critical to the success of the project.

Carleton University

Website: Professor Marie-Odile Junker of Carleton University and her team have developed several websites for languages of the Algonquian family, in partnership with Indigenous organizations.

Project description: Algonquian Dictionaries Project (East Cree and Innu)

The collaboration with the NRC is focused on updating online language lessons developed earlier by the Carleton team, in partnership with Cree Programs and Institut Tshakapesh, aimed at supporting East Cree (2006‑2011) and Innu (2009‑2012) literacy.

The online lessons/games/exercises platform supports the creation of multimedia interactive online lessons with auto‑generated exercises/games. In this platform, users are able to listen to a word or phrase in several dialects. They then play computer‑generated interactive activities that test and enhance their vocabulary, orthography and grammar acquisition. They can also engage in more advanced grammatical and textual activities. Teachers can go online to develop additional lesson plans, and track students' progress. Language experts can access an administrative interface to develop new content.

Unfortunately, the rapid pace of change in the software industry has stranded these educational tools: many of the key functionalities no longer work as intended. The collaboration is aimed at updating the platform to align with current technology. The platform update is also an opportunity to improve the experience of second language learners (these tools were originally developed with first language speakers in mind) and to carry out user testing of the lessons.

Computer Research Institute of Montreal (CRIM)

Website: Computer Research Institute of Montreal

Project description: News release about indexation of Indigenous language audio recordings to enable keyword search

The Computer Research Institute of Montréal (CRIM) is an applied research and expertise centre in information technology. Its speech and text team has a long and distinguished record of accomplishments in technologies related to speech recognition. Its audio content indexing technology indexes the spoken content of very large audio databases, making such content accessible through search engines. CRIM has applied this technology to the archives of the National Film Board (NFB) and to the collected testimonies of the Bastarache investigative commission. CRIM's speaker recognition technology, which identifies the person who generated a particular segment of speech, is world-class. It has consistently ranked among the top entries in international evaluations of speaker recognition systems, and is now used all over the world.

The NRC's collaboration with CRIM is focused on applying audio indexing and speaker recognition technologies to Indigenous languages. Over the years, hundreds of thousands of hours of speech have been recorded in various Indigenous languages. Unfortunately, these recordings are typically not annotated or indexed. Surprisingly, even speech data being collected now by Indigenous communities and linguists have this problem: because there is a lack of tools for segmenting speech data as they are being recorded, the stock of unannotated speech data in Indigenous languages is constantly growing.

We are tackling two aspects of this problem:

  • We are developing simple tools that will segment speech as it is being recorded. The tools will separate audio files into speech and non-speech data, and will label the speech segments by the identity of the current speaker. This should make annotation of speech currently being collected easier, for a variety of languages.
  • We also plan to build systems that will make it possible to search for particular words or phrases in audio recordings in some Indigenous languages. This will not be full speech recognition and we will not be creating systems that are able to produce high-quality transcriptions of everything that was said in a recording. Rather, the systems will enable audio keyword search, so that users will be able to search quickly through long audio recordings for particular words or topics. We are currently targeting Inuktut and Cree. The Pirurvik Centre is providing valuable assistance on the Inuktut part of this project.
First Peoples' Cultural Council

Website: First Peoples' Cultural Council

Project description: News release about Upgrades to FPCC's FirstVoices Language Tutor software

Official Languages, Department of Culture and Heritage, Government of Nunavut

Website: Official Languages, Department of Culture and Heritage, Government of Nunavut

Project description: Coming soon

Onkwawenna Kentyohkwa Language School

Website: Onkwawenna Kentyohkwa Language School

Project description: Kawennón:nis verb conjugator

Onkwawenna Kentyohkwa is an immersion school for teaching Kanyen'kéha (the "Mohawk" language) to adult learners. It is located on the Six Nations of Grand River reserve in southwestern Ontario. Onkwawenna Kentyohkwa was established in 1999 by Owennatekha (Brian Maracle) and Onekiyohstha (Audrey Maracle). Owennatekha is the lead instructor at the school. Many of the school's 100 graduates have gone on to teach the Kanyen'kéha language at the pre-school, elementary, secondary, university or community level.

The focus of the NRC's collaboration with Onkwawenna Kentyohkwa is Kawennón:nis, meaning 'wordmaker' in Kanyen'kéha. Kawennón:nis is a verb conjugator meant to assist learners and educators at the school students of the language, wherever they might be. The idea for the tool was suggested by Owennatekha. The creation and extension of this tool involves a number of researchers at the NRC, Owennatekha, and two other educators from Onkwawenna Kentyohkwa. The language model that powers Kawennón:nis is the first of its kind for any Iroquoian language. Kawennón:nis's user interface is closely linked to the school's curriculum, and is being designed collaboratively between students and educators there, and NRC researchers. Kawennón:nis will be hosted by the school online and on Android and iOS devices; language-independent technology developed for it will be released with an open-source licence.

Pirurvik Centre

Website: Pirurvik Centre

Project description: Pirurvik is a centre of excellence for Inuit language, culture and well-being. It was founded in the fall of 2003, and based in Nunavut's capital, Iqaluit. The main focus of the NRC's collaboration with Pirurvik is the transcription into written form of audio recordings of spoken Inuktut. The project criteria will be to select materials that are original language with a depth of vocabulary and not 'thinking in English' while speaking Inuktut.

The transcribed Inuktut speech data will be subsequently be used by the NRC and one of its other partners, Computer Research Institute of Montreal, to develop speech recognition tools that will make it possible to search other Inuktut speech recordings using text queries. This will make it easier for people who speak Inuktut to access and navigate audiovisual documents in their language.

This list is updated on a regular basis and as the project proceeds, collaborations with other organizations will be developed and this list updated.

Publications

The following is a list of selected publications by the project team and their collaborators relating to research in Indigenous languages technology.

Our project team

Anna Kazantseva, PhD

Computational linguistics of literature (novels and stories); modeling discourse structure of long informal documents; computational linguistics of Iroquoian languages.

Roland Kuhn, PhD (project lead)

Automatic speech recognition; machine translation.

Patrick Littell, PhD (project advisor)

Computational linguistics of low-resource languages; he has worked with several Indigenous languages, including Kwak'wala/Bak'wamk'ala, Gitksan, and Nłeʔkepmxcín (Thompson River Salish).

Aidan Pine

Development of software for supporting Indigenous languages; he has developed tools in collaboration with Gitksan & Heiltsuk communities.

Eddie Antonio Santos

Software engineering; Applied language modeling; Unicode wrangler.

Advisory committee

We are committed to developing technology in collaboration with Indigenous stakeholders, and are implementing an Indigenous majority Advisory committee that will advise on collaborative methodologies and evaluate project implementations.

Contact

Roland Kuhn, PhD
Principal Research Officer, Project Lead

Telephone: 613-993-0821
Email: Roland.Kuhn@nrc-cnrc.gc.ca
LinkedIn: Roland Kuhn

Date modified: