Artpark-IISc and Google to bring innovation to India's diverse language
Bangalore-based Artpark (AI & Robotics Technology Park), a non-profit aimed at promoting technology innovations in AI and robotics, set up by the Indian Institute of Science (IISc), teamed up with Google to unveil an all-India inclusive language data initiative for open-sourcing datasets.
The new initiative touted ‘Vaani’, launched at the “Google for India 2022” event in Delhi, “brings together high-quality datasets that reflect the true diversity of natural spoken language and transcribed text from every district of India”.
With this launch, Vaani joins the Bhāshā AI umbrella of ARTPARK and IISc’s pan-India language initiatives that include SYSPIN (Synthesising Speech in Indian languages) and RESPIN (Recognising Speech in Indian languages) which cover nine languages including Magadhi and Maithili.
“To propel research and innovation these datasets are being open-sourced via Vaani’s website (vaani.iisc.ac.in) and in the future may also be available through other platforms like ‘Bhashini’ of MeitY (Ministry of Electronics and Information Technology),” according to a statement.
Globally, there is a lot of hype about large language models like GPT-3. But they require huge text corpora and humongous computing power to train, as Prasanta Kumar Ghosh, IISc, who leads these initiatives, “in our work, we found at least 50 varieties of ‘Bengali’ and some that even I, as a native Bengali speaker, had difficulty understanding”. Even Hindi, with its more than four-dozen dialectal variations does not have nearly as much text data. “Machines have no hope! So, research and innovation for inclusive language AI require capturing this diversity in our datasets,” he said.
Also, as Indians primarily communicate by speech warrants very different approaches and breakthroughs for machines to transcribe, understand, or translate while also taking into account the language variations every few kilometres. In such a context, technologies like automatic speech recognition (ASR) and natural language processing (NLP) can only be unleashed through open-source and mission-mode efforts.
Raghu Dharmaraju, President, ARTPARK added, “Over the past decade, most apps for frontline health and agriculture workers have failed because digital interfaces feel alien to them. More than 1 billion Indians still cannot speak or type in English.
“So, if citizens can communicate with digital services in their mother tongue...over the next decade, that will be key to India’s economic growth and for a more equitable distribution of its benefits,” Dharmaraju said.
The initiative, currently focused in 80 districts of 10 states, will expand to every district over the next couple of years. Artpark and IISc will also launch challenges for researchers and startups to build applications in areas like health, agriculture, and financial inclusion using these datasets.