The rise of large language models has created a new class of web crawlers that operate at unprecedented scale. Unlike traditional search engine crawlers that index pages for search results, AI training crawlers like OpenCrawl, GPTBot, CCBot, and others are scraping content to feed machine learning pipelines. For website operators, this means significantly more traffic, higher server costs, and the uncompensated use of original content.
The Scale of the Problem
AI training crawlers do not behave like traditional search bots. Google's search crawler respects crawl delays, fetches pages at a measured pace, and focuses on indexing content for search. AI training crawlers, by contrast, often attempt to download entire sites as quickly as possible. They fetch every page, every asset, and every variant — because more data means better training outcomes for the model operators.
Server logs from sites that have not implemented crawler controls regularly show AI bots accounting for 20 to 40 percent of total traffic. For content-heavy sites — blogs, documentation, forums, news outlets — this can translate directly into higher bandwidth bills and degraded performance for real users.
Identifying AI Crawlers
The first step in controlling crawlers is knowing which ones are hitting your site. Check your server access logs for these common AI training User-Agent strings:
- GPTBot — OpenAI's web crawler used for training data collection
- ChatGPT-User — OpenAI's crawler for ChatGPT browsing features
- Google-Extended — Google's crawler specifically for AI training (separate from Googlebot search)
- CCBot — Common Crawl's crawler, whose datasets are widely used for LLM training
- anthropic-ai / Claude-Web — Anthropic's crawlers for Claude training data
- Bytespider — ByteDance's aggressive crawler used for TikTok and AI training
- FacebookBot — Meta's crawler used for AI model training
- cohere-ai — Cohere's crawler for LLM training data
Layer 1: robots.txt
The robots.txt file is the first line of defense. It is a voluntary protocol — crawlers are expected to check and respect it, but compliance is not guaranteed. Despite this limitation, most major AI companies do respect robots.txt directives, making it an essential baseline control.
Add explicit disallow rules for each AI crawler you want to block. Do not rely on a generic wildcard rule, because that would also block legitimate search engine crawlers. Instead, create separate User-agent blocks for each AI bot. Allow Googlebot and Bingbot to continue indexing your site for search while blocking Google-Extended and GPTBot from scraping for AI training.
Layer 2: Server-Level Blocking
For crawlers that ignore robots.txt, server-level controls provide enforcement. This can be implemented through your web server configuration (nginx, Apache), a CDN-level WAF (Cloudflare, Vercel), or application middleware.
- User-Agent header matching: Block requests from known AI crawler User-Agents at the web server or CDN level. This is fast and has minimal performance impact.
- Rate limiting: Implement per-IP rate limits that allow normal browsing patterns but throttle aggressive crawling. A limit of 60 requests per minute per IP is reasonable for most sites.
- IP range blocking: Some crawler operators publish their IP ranges. Blocking these at the firewall level stops requests before they reach your application server.
- CAPTCHA challenges: For high-value content, consider serving CAPTCHA challenges to suspected bot traffic. This adds friction for automated scrapers while allowing human visitors through.
Layer 3: Monitoring and Alerting
Crawler behavior changes over time. New bots appear, existing bots change their User-Agent strings, and request patterns evolve. Set up monitoring to track bot traffic as a percentage of total requests, alert on sudden spikes in crawler activity, and regularly review access logs for unfamiliar User-Agent strings.
If you notice a new crawler consuming significant resources, research its origin and decide whether to allow, throttle, or block it. The goal is to maintain control over who accesses your content and at what rate.
Legal and Ethical Considerations
The legal landscape around AI training data is evolving rapidly. Several high-profile lawsuits are challenging whether web scraping for AI training constitutes fair use. Regardless of the legal outcome, website operators have the technical right to control access to their servers. Blocking crawlers is not anti-AI — it is exercising control over your own infrastructure and content.
Recommended Action Plan
- Audit your server logs to identify which AI crawlers are currently accessing your site and how much traffic they generate.
- Update your robots.txt to explicitly block AI training crawlers while preserving search engine access.
- Implement server-level or CDN-level User-Agent blocking as an enforcement layer for non-compliant crawlers.
- Set up rate limiting to protect against any single IP consuming excessive resources.
- Monitor bot traffic regularly and adjust your rules as the crawler landscape evolves.