• 2 mins read
  • Published
  • updated

AI Bots Still Cite Blocked Sites Despite Robots.txt Restrictions

Ken Doctor media analyst FAYFO.com

by Ken Doctor

Google Revises 'Do you need an SEO?' Doc With New AI Optimization Guidance and Warnings on SEO Tools FAYFO.com
Google Revises 'Do you need an SEO?' Doc With New AI Optimization Guidance and Warnings on SEO Tools

Blocking AI bots in robots.txt no longer guarantees content privacy. New research shows most sites remain cited by AI, even after restricting access. Publishers face fresh challenges in controlling how their work is used. The stakes for content creators just got higher.

For publishers and content creators, the belief that a simple robots.txt block could shield their work from AI models is quickly unraveling. Recent findings reveal that even after taking steps to restrict OpenAI and Google AI bots, the majority of these sites still end up referenced in AI-generated content. This shift is forcing newsrooms and digital publishers to rethink how they protect their intellectual property in an era of relentless data harvesting.

According to Buzzstream's research, about three-quarters of websites that actively block OpenAI or Google AI bots are still being cited by AI-powered resources. The data shows that nearly 95% of the pages mentioned in AI citations had already implemented blocks against GPTBot or Google-Extended—bots specifically designed to collect data for training large language models. Even more striking, around 70% of sites cited by ChatGPT had blocked ChatGPT-User or OAI-SearchBot, both of which are used for real-time search capabilities.

This disconnect between publisher intent and AI behavior highlights a growing tension in the digital content ecosystem. As AI models become more sophisticated, traditional methods of controlling access—like robots.txt—are proving less effective. The result is a landscape where content creators have less certainty about how and where their work will appear, raising urgent questions about attribution, consent, and the future of digital rights management.

Buzzstream's findings, shared by Ludkevich, underscore the need for new strategies as AI-driven platforms continue to reshape the boundaries of content ownership and visibility. For those in the business of news and information, the challenge is no longer just about being found—it's about maintaining control over how their work is used and credited in a rapidly evolving AI landscape.

Large language models such as those developed by OpenAI and Google rely on vast datasets scraped from the open web. While robots.txt was once a reliable tool for signaling which pages should remain off-limits, the explosive growth of AI has exposed its limitations. As more publishers confront the reality that their content may be cited or repurposed regardless of access controls, the industry is beginning to explore alternative solutions—from legal frameworks to technical innovations—to safeguard their work in the age of artificial intelligence.

Related articles