Digital news publishers globally and in India are jumping into action to safeguard their content against powerful web-crawlers like OpenAI's GPTBot, which collects data from websites to train its artificial intelligence (AI) models. According to a new report published Tuesday, nearly one-third of the world’s top 50 news sites, including several of leading Indian sites, have blocked AI crawlers from accessing their content.
The data published by benchmarking agency AltIndex.com, has named leading news sites like CNN, New York Times, Daily Mail, Reuters, and Bloomberg – all blocking at least one AI crawler and their number continues rising.
AI companies send crawlers to collect data to train their models and provide information for chatbots. However, as data is one of their core advantages, many of the world’s largest news websites have become extremely cautious, especially since there is generally no upside to handing over their data to AI crawlers.
AltIndex.com researchers explained that after the ChatGPT launch in November last year, companies and consumers worldwide started using generative AI to automate tasks, write documents, do market research, or even basic coding. However, the rise of large language models and generative AI has also pushed into the spotlight the problem of news sites, publishers, and intellectual property holders who see their data being collected by AI crawlers. And while there are still no clear regulatory rules controlling AI’s use of copyrighted material, some of the world`s largest news websites have taken matters into their own hands.
The entire situation escalated last month after Microsoft-backed OpenAI had launched its GPTBot crawler to collect data to enhance its language models. Although the AI research firm promised that paywalled content would be excluded from websites, several high-profile news sites blocked GPTBot. Their number continued growing in the following weeks.
According to Dymic, a global marketing agency, 28% of the top 50 news sites worldwide have blocked at least one AI crawler by the end of last month. In regional comparison, the picture is a bit different. For example, 24%, or twelve out of fifty largest news sites in the United States, have blocked at least one AI crawler, far more than in the United Kingdom, where only three of 21 leading sites did the same. In India, the percentage of top new sites unwilling to hand over their data to AI companies is much higher, with one-third blocking at least one AI crawler, the report showed.
In India, Digital News Publishers Association (DNPA) members have already restricted access to OpenAI. DNPA represents leading news publishers in India such as India Today Group, HT Group, Times Group, DB Corp, Dainik Jagran, Amar Ujala, Zee Media, ABP Network, NDTV, New Indian Express, Mathrubhumi, Hindu, and Network18, to name a few.
That said, not all news sites have taken action on blocking, the study showed, and GPTBot continues to be the number one choice among those who have. Statistics show the brainchild of OpenAI has been blocked 22% of the time across the top 50 news sites, with Bloomberg, Reuters, Business Insider, Washington Post, the New York Times, and CNN as the top names on this list.
The Ministry of information and broadcasting and Ministry of Electronics and Information Technology (MeitY) are aware of the concerns. DNPA official mentioned that the Government is working towards the issue and the new Digital India Act should factor in all the changes and should have a ramification for both the revenue and copyright package for them.
Other countries such as Australia has taken credible steps towards it. The country reopened the Treasury Bill to incorporate technological advancements with respect to AI. Similarly, Canada and the EU has incorporated it.
Last week, following the G20 Summit’s showcase of digital infrastructure, the Union Minister of State for Electronics and Information Technology, Rajeev Chandrasekhar, stated that the draft of the Digital India Act is ready and will be released soon.