robots.txt AI crawlers AEO WordPress GEO

Training bot vs. search bot: configure robots.txt to block AI training without losing citation visibility

Cloudflare’s new default block frames the choice as all-or-nothing. It isn’t. Here’s the surgical split that protects your content and keeps you citable in ChatGPT, Claude, and Perplexity.

By Alexander Kirsch-Clayton, Founder June 15, 2026 ·11 min read

Cloudflare now asks every new domain a single question at sign-up: do you want to allow AI crawlers? Most site owners read that prompt as binary—allow all AI bots, or block all AI bots. That framing is wrong, and it’s quietly expensive.

As of mid-2025, Cloudflare flipped the default. In their words, every new domain will now be asked if they want to allow AI crawlers, so every new domain starts with the default of control. Good for publishers. But the prompt collapses two very different bots into one toggle, and that’s where the damage starts.

Here’s the thing the toggle hides: the bot that trains AI models and the bot that cites you in AI search are not the same bot. You can block one and keep the other. If you treat them as a single switch, you either hand your content to training crawlers for free or you delete yourself from ChatGPT Search without meaning to. This post is the clean answer to the question I keep getting from DACH B2B clients: can I stop AI companies from training on my content without disappearing from AI citations? Yes. Here’s exactly how.

The structural split: training crawlers ≠ search crawlers

Since 2025, the three dominant AI players—OpenAI, Anthropic, and Google—each run two separate crawler identities. One ingests your pages to train a foundation model. The other fetches your pages in real time to answer a user’s question and cite the source. They use the same protocol, but they do completely different jobs, and you control them with completely different rules.

Cloudflare’s own framing is blunt about why this matters. Training crawlers, they write, use the data they ingest from content sites to answer questions for their own customers directly, within their own apps. They typically send much less traffic back to the site they crawled. The numbers are not subtle. As of June 2025, Cloudflare measured Google’s search crawler at roughly 14 crawls per referral. OpenAI’s training crawl-to-referral ratio was 1,700:1. Anthropic’s was 73,000:1.

Read that again. A training crawler can hit your site tens of thousands of times for every visitor it sends back. That’s the case for blocking training. But the search crawler is the opposite trade—it reads you so an AI can name you in front of a buyer who’s actively asking. Blocking the training bot does not block the search bot. This is the single fact most WordPress site owners and most DACH agencies have not absorbed yet, and it’s the whole game.

The full crawler taxonomy

Here’s the reference table I use on every audit. Training identity on the left, search-and-citation identity in the middle, recommended stance for a typical B2B site on the right. The recommended column assumes you want IP control where it’s free and citation visibility everywhere it counts.

OpenAI—GPTBot (training) vs. OAI-SearchBot (search). GPTBot feeds OpenAI’s foundation models; disallowing it signals your content shouldn’t be used for training. OAI-SearchBot surfaces your site in ChatGPT’s search features. Cost of blocking the search bot: Sites that are opted out of OAI-SearchBot will not be shown in ChatGPT search answers. Stance: block GPTBot, allow OAI-SearchBot.
Anthropic—ClaudeBot (training) vs. Claude-SearchBot (search). ClaudeBot collects training data. Claude-SearchBot indexes for search citations. (A third, Claude-User, fetches a page live when a user asks about it.) Anthropic states that disabling Claude-SearchBot prevents our system from indexing your content for search optimization, which may reduce your site’s visibility and accuracy in user search results. Stance: block ClaudeBot, allow Claude-SearchBot.
Google—Google-Extended (training) vs. Googlebot (search + AI Overviews). Google-Extended is a product token, not a separate bot—it controls whether your content trains future Gemini models. Critically, it does not impact a site’s inclusion in Google Search nor is it used as a ranking signal. Googlebot still powers AI Overviews and AI Mode. Stance: optionally block Google-Extended; never block Googlebot.
Perplexity—PerplexityBot (unified, no split). PerplexityBot is designed to surface and link websites in search results on Perplexity. It is not used to crawl content for AI foundation models. There’s no separate training identity to block. Stance: allow it—blocking it only removes you from Perplexity citations.
Meta—Meta-ExternalAgent (training only). It crawls web content for training AI models and improving Meta’s products. No search or citation function. Stance: a Disallow here is pure IP protection with zero citation downside—block it freely if you want.

The robots.txt syntax—surgical allow/disallow rules

Three postures cover almost every B2B site. Copy the one that matches your decision, drop it into your robots.txt, and verify it (next section). One rule about precedence first, because it’s where people break their own config: crawlers process groups from top to bottom. A user agent can match only one rule set, which is the first, most specific group that matches. A named-bot group always wins over User-agent: * for that bot. So a blanket wildcard Disallow plus a named Allow does what you’d hope—the specific rule governs the named bot.

Posture A—block all training, allow all search. This is the right default for almost every commodity-content B2B site. You stop feeding the training machines and stay fully citable.

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# Keep AI search + citation crawlers allowed
User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

Posture B—block all AI entirely. Defensible only if your content is genuinely proprietary and you don’t need AI search traffic. Understand the cost: you’ve just made yourself uncitable in ChatGPT Search, Claude, and Perplexity. For most businesses that’s self-harm.

User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-SearchBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Google-Extended
Disallow: /

Posture C—allow everything (our posture). Digital Domination runs no AI Disallow rules at all. As a GEO-native brand, every training pass is a chance for a model to learn who we are, and every search pass is a citation we want. That’s a deliberate editorial choice, not laziness—and it’s the right one when visibility is the product.

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

In the audits I run, I see one mistake more than any other—a single blanket Disallow: / under User-agent: * that silently kills citation eligibility while the owner thinks they’re just protecting their IP. They blocked everything, including the search bots that would have named them in front of a buyer. The wildcard is a sledgehammer. Use named rules.

WordPress implementation: three paths

WordPress doesn’t ship a physical robots.txt. It generates a virtual one on the fly if no file exists at the site root—by default just User-agent: * and Disallow: /wp-admin/. Your SEO plugin then layers its own rules on top, or replaces the file entirely. Three ways to edit it:

Yoast SEO. Go to Yoast SEO → Tools → File editor. If no physical file exists, Yoast offers to create one; once created, paste your named-bot rules and save. Yoast’s own docs confirm WordPress generates a virtual robots.txt file if the site root does not contain a physical file—creating one through Yoast overrides that virtual default.
Rank Math. Open Rank Math → General Settings → Edit robots.txt. Same idea, different module. Note: Rank Math won’t let you edit through the UI if a physical robots.txt already exists at the root—in that case you edit the file directly.
Raw file access. When a plugin, a caching layer, or a CDN has overwritten the virtual file, edit /robots.txt directly via your host’s file manager or SSH. A physical file at the web root always wins over the WordPress-generated one.

The trap: two plugins fighting over the same virtual file, or a CDN serving a cached copy that ignores your edits. Don’t trust the editor screen. Open an incognito tab and load https://yourdomain.com/robots.txt directly. Whatever the browser shows is what the crawlers see. That’s the only source of truth.

DACH B2B decision framework: when blocking training is worth it

Blocking training isn’t free—it costs you nothing in citations if you keep the search bots open, but it also protects nothing if your content was never proprietary. Three questions decide it:

Is your content genuinely proprietary IP, or commodity educational material? A law firm’s annotated case methodology, a publisher’s archive, an agency’s framework documents—that’s IP worth keeping out of a training set. A blog post explaining what a GmbH is, for the hundredth time on the German internet, is not. Blocking training on commodity content protects nothing.
Do you sell to a buyer who might find you via ChatGPT Search or Perplexity? For most B2B KMU, the honest answer is increasingly yes. If a procurement lead asks ChatGPT for vendors in your category, you want to be in that answer. That requires the search bots open—full stop.
Do you already have earned-media coverage that seeds the models anyway? If trade press, directories, and partner sites already describe you, the models will learn you from those sources whether you allow GPTBot or not. Blocking your own training crawler then costs you the marginal upside of accuracy while gaining you nothing.

For the typical DACH B2B with no third-party coverage and commodity content, blocking training loses citation surface for zero IP upside—so don’t block the search bots, and don’t kid yourself that blocking GPTBot is protecting a secret. For law firms, publishers, and agencies sitting on proprietary methodology, the calculus flips: block the training crawlers, keep the search crawlers, and you get both protection and presence. That’s Posture A, and it’s why the split exists.

Verification: confirm your rules are respected

A rule you haven’t verified is a guess. Three checks, in order of effort:

Google crawlers: use the robots.txt report in Google Search Console to confirm Googlebot and Google-Extended are seeing what you intend. This is the fastest way to catch a wildcard rule that accidentally caught Googlebot.
OpenAI and Anthropic: there’s no first-party tester, so fetch /robots.txt yourself and cross-reference the exact user-agent strings against OpenAI’s and Anthropic’s published bot documentation. Spelling matters—OAI-SearchBot, not OpenAI-SearchBot; Claude-SearchBot, not ClaudeSearchBot. A typo’d user-agent is an ignored rule.
Real traffic: the ground truth is your access logs. In Cloudflare, open Analytics → Bots and watch which crawlers are actually hitting which paths. If you blocked GPTBot and still see it crawling everything, your rule isn’t being served—go back to the /robots.txt check.

Logs beat documentation. A well-behaved crawler honors robots.txt, but you confirm it by watching, not by trusting.

Citability check: does this config qualify you for AI citations?

Content quality is necessary but not sufficient. A site that blocks OAI-SearchBot cannot be cited by ChatGPT Search no matter how good the writing is—the crawler simply never reads the page it would have cited. So before you congratulate yourself on the content, run the access checklist:

OAI-SearchBot allowed—required to appear in ChatGPT search answers.
Claude-SearchBot allowed—required for Claude to index and cite you.
PerplexityBot allowed—required for Perplexity citations (no training split to worry about).
Googlebot allowed—required for Google AI Overviews and AI Mode, not just classic search.

Robots.txt only governs whether you’re eligible to be cited. Whether you actually are cited—and how ChatGPT, Claude, Perplexity, and Google AI Overviews describe you versus your competitors—is a separate measurement problem. That’s exactly why we built Cited: it tracks how the assistants describe your brand in real time and flags drops in citation share. Get the robots.txt right first; it’s the gate. Then measure what comes through it.

If you’d rather have someone audit the whole stack—crawler access, schema, citation eligibility, and where you’re losing ground to competitors—that’s the AI Visibility Audit. Either way, fix the file this week.