How AI research in India is improving Google’s global NLP Models

How AI research in India is improving Google’s global NLP Models
Photo Credit: 123RF.com
21 Sep, 2021

Tech giant Google’s products are ubiquitous in India. But while consumers usually talk about Android and its apps, a lot of the technology we see today comes from research work Google does in the background. 

Speaking at the Mint Digital Innovation Summit today, Manish Gupta, Director of Google Research India, discussed some of the work the company is doing in India.

“Everybody and their mother has heard of the term deep learning today,” said Gupta. 

“In deep learning, you have nothing but these artificial neural networks, which are very simple mathematical units performing very simple computations. But when organized in layers, they can accomplish these really complex tasks,” he added.

According to Google, only 10.6% of Indians speak in the English language. Gupta noted that Google’s natural language processing model called BERT (Biderectional Encoder Representations) doesn’t work for Indian languages. In English, BERT is presented with a paragraph of sorts, with some words blanked out, and the artificial intelligence (AI) guesses those words, forcing it to understand both the “nuances as well as the context” of the language.

“Techniques like BERT don’t work completely, and one of the reasons is that the corpus of formal data (used to train them), like Wikipedia etc., is not enough to capture the full universe of vocabulary that we’re dealing with,” said Gupta. 

Misspelled English words in menu cards is a great example of phonetic spelling in India — for instance, the word pineapple being spelled ʼpainaple’ or ‘apaitaijar’ instead of appetizer. 

“If you think about it, they do actually make sense. Because somebody thinks in Hindi, which is a very phonetically sound language, this spelling actually makes sense,” said Gupta.

Google recognized that its language learning models have to be robust enough to understand those queries. It has to understand what Indian users mean instead of looking for the right spelling and grammar all the time. To tackle this, Google designed a project called Indo-Phoneme, which is meant to tackle both Indian spellings and phonetics.

“It is not just about spelling correction, but we also want to model the reason behind those spellings. And in that process improve our query processing systems as well as our speech recognition systems,” Gupta said. 

The company has been able to create databases of correctly spelled words and their derivatives that may be used in India.

Further, Google is also working on adapting its language models to “code mixing”, which is when users may mix Hindi and English in the same sentence — like ‘iska price kya hai’. Gupta said that Google started a project called Multilingual Representations for Indian Languages (MURIL), which helped improve BERT’s accuracy in transliterated texts by over 27%.