The Indian arm of tech giant Microsoft is granting researchers access to speech data in three local languages in its quest to build more robust speech-recognition systems.
The company said in a statement that this will constitute speech training and test data for Telugu, Tamil and Gujarati and will include audio and corresponding transcripts.
This Indian language Speech Corpus content is being provided by Microsoft Research Open Data initiative, a collection of free datasets from Microsoft Research to advance state-of-the-art research in areas such as natural language processing, computer vision, and domain-specific sciences.
According to a company statement, this is the largest corpus of publicly-available Indian language speech data that researchers and other members of the academic world can use to build Indian language speech-recognition for voice-based applications.
“We believe India’s increasing digital literacy needs to be supported by a multilingual digital world,” said Sundar Srinivasan, general manager of artificial intelligence & research at Microsoft India. “Using our technology expertise, we want to accelerate innovation in voice-based computing for India by supporting researchers and academia.”
Microsoft’s Indian Language Speech Corpus was tested at Interspeech 2018, which is touted as the world’s largest and most comprehensive conference on the science and technology of spoken language processing.
In a Low Resource Speech Recognition Challenge, participants used data from Microsoft’s Indian language speech corpus to build Automatic Speech Recognition (ASR) systems. They were able to create high-quality speech recognition models using this data.
Microsoft has been working with Indian languages since the launch of Project Bhasha in 1998, allowing users to input localised text using the Indian Language Input tool.
Microsoft also recently announced support for email addresses in multiple Indian languages across most of its email apps and services.