A single line in your robots.txt can decide whether ChatGPT ever mentions your product. Most founders never check that line, and a surprising number are blocking the exact bots they want crawling their site.
If you've read anything about AI search lately, you've probably wondered which crawlers to let in. Setting up robots.txt for AI crawlers sounds like a two-minute job, but the rules changed fast, the bot names are confusing, and one wrong Disallow can quietly remove you from ChatGPT, Perplexity, and Google's AI Overviews. This guide fixes that. You'll get the current bot list, copy-paste robots.txt configs, and a clear rule for when to allow GPTBot and when to block it.
What is an AI crawler (and why it's not just Googlebot)
An AI crawler is a bot that fetches your pages on behalf of an AI company instead of a traditional search engine. Some collect text to train large language models. Others fetch pages in real time so an assistant can cite you in an answer. A few do both.
This matters because the old mental model, "let Googlebot in, block the scrapers", no longer covers reality. There are now two very different jobs an AI crawler can do, and robots.txt treats them as separate user-agents you control one by one.
- Training crawlers collect content to improve a model. Blocking them keeps your words out of future training data.
- Retrieval / search crawlers fetch pages so the assistant can link to and quote you in live answers. Blocking them removes you from AI citations.
Mix these up and you'll either leak content you wanted private or vanish from the AI answers you were trying to win. So before you touch robots.txt for AI crawlers, you need to know who's knocking.
The current AI crawler list (2026)
Here are the user-agents worth knowing, grouped by who runs them. These are the strings you'll actually write into robots.txt.
| Bot (user-agent) | Run by | Job |
|---|---|---|
GPTBot | OpenAI | Trains OpenAI models |
OAI-SearchBot | OpenAI | Powers ChatGPT search citations |
ChatGPT-User | OpenAI | Real-time fetch when a user/agent browses |
ClaudeBot | Anthropic | Crawls for Claude (training + retrieval) |
PerplexityBot | Perplexity | Indexes pages for Perplexity citations |
Perplexity-User | Perplexity | Real-time fetch for a live query |
Google-Extended | Opt-out token for Gemini AI training | |
CCBot | Common Crawl | Open dataset many models train on |
Bytespider | ByteDance | Training data for TikTok's AI |
Applebot-Extended | Apple | Opt-out token for Apple AI training |
A few of these aren't crawlers at all in the normal sense. Google-Extended and Applebot-Extended are control tokens, you list them in robots.txt to opt out of AI training, but the regular Googlebot and Applebot keep crawling for search. That distinction trips up almost everyone, and we'll come back to it.
GPTBot: OpenAI's training crawler
GPTBot is the one everyone asks about. It's OpenAI's crawler for collecting content to train models. With around 2,000 global searches a month for the term alone, it's the bot founders look up first, usually right after reading that AI might be using their content.
The honest answer to "should I block GPTBot?" is: it depends on what you're optimizing for. If you sell content, run a paywalled archive, or simply don't want your work in training data, blocking GPTBot is reasonable. To do it, the OpenAI GPTBot robots.txt documentation specifies a simple block:
User-agent: GPTBot
Disallow: /
But here's the trap. Blocking GPTBot does not remove you from ChatGPT search. Training and search are different bots. If your goal is AI visibility, being the source ChatGPT recommends, blocking GPTBot costs you almost nothing on the visibility side, while blocking the next bot would cost you everything.
OAI-SearchBot: the one that actually affects ChatGPT citations
This is the most important section in this guide. OAI-SearchBot is the crawler that lets ChatGPT find, link to, and cite your site in search-style answers. If you want to show up when someone asks ChatGPT "what's the best tool for X," this is the bot that has to reach you.
Many sites copy a "block all AI bots" snippet they found online and accidentally disallow OAI-SearchBot along with GPTBot. The result: they stop AI training (fine) and delete themselves from ChatGPT's answers (not fine, if you're trying to get found).
For most businesses chasing answer engine optimization, the right move is to allow OAI-SearchBot and ChatGPT-User, then decide on GPTBot separately:
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: GPTBot
Disallow: /
That config keeps you citable in ChatGPT while keeping your content out of training, the configuration most founders actually want once they understand the difference.
ClaudeBot: the high-volume crawler everyone's Googling
ClaudeBot, Anthropic's crawler, generates roughly 14,000 global searches a month, far more than GPTBot. The reason is simple: site owners noticed it hitting their servers hard and went looking for "what is ClaudeBot" and "how do I block it."
If ClaudeBot is straining your server, you can slow or stop it. Per Anthropic's ClaudeBot user-agent robots.txt rules, the block is the familiar pattern:
User-agent: ClaudeBot
Disallow: /
Before you do, weigh the trade-off. ClaudeBot increasingly supports retrieval, not just training, meaning it can help Claude cite your site. If server load is the problem, a crawl-delay or blocking only heavy directories (like faceted search or infinite archives) usually beats a blanket ban.
User-agent: ClaudeBot
Disallow: /search
Disallow: /tag/
Crawl-delay: 10
PerplexityBot and Google-Extended
PerplexityBot indexes pages so Perplexity can cite them, and Perplexity-User fetches in real time for a live question. If Perplexity citations matter to you, and for many B2B niches they convert well, allow both. The PerplexityBot user-agent robots.txt setup is straightforward:
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
Google-Extended is the odd one. It's not a crawler, it's an opt-out token for Google's Gemini AI training. Adding it to robots.txt tells Google not to use your content for AI model training. According to the Google-Extended robots.txt documentation, the syntax looks like this:
User-agent: Google-Extended
Disallow: /
Critical nuance: blocking Google-Extended does not remove you from Google's AI Overviews. AI Overviews pull from the normal search index that Googlebot builds. So if you block Google-Extended hoping to disappear from AI Overviews, nothing happens, you've only opted out of Gemini training. To affect AI Overviews, you'd have to block Googlebot itself, which would also kill your normal search rankings. Almost nobody should do that.
The "block all" config, and why it usually backfires
The phrase "robots.txt block all" gets about 600 global searches a month, and it's tempting. The nuclear option looks like this:
User-agent: *
Disallow: /
Don't ship that unless you truly want zero crawlers, search engines included. It blocks Googlebot and Bingbot too, so you'd lose normal search and AI search in one line.
If your real goal is to block AI crawlers used for training while staying visible in search and AI answers, be surgical instead of using "block all robots.txt." Here's a balanced starting point that opts out of training, blocks the aggressive scrapers, and keeps the citation bots in:
# Allow search + AI citation crawlers
User-agent: Googlebot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
# Opt out of AI training
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
This is the kind of configuration we set up for early-stage AI startups at avinashvagh.com, visible where it counts, protected where it matters.
How to allow AI crawlers but block scrapers
The pattern to remember: allow the bots that send you traffic or citations, block the bots that only take. Retrieval crawlers (OAI-SearchBot, Perplexity-User, ChatGPT-User) can drive referral visits and AI mentions. Pure training and dataset crawlers (CCBot, Bytespider) give you nothing back, so blocking them is low-risk.
One more guardrail: robots.txt is a polite request, not a wall. Well-behaved bots from OpenAI, Anthropic, Google, and Perplexity respect it. Bad actors ignore it. For genuinely sensitive content, use authentication or server-level blocks (an X-Robots-Tag header or firewall rule), not just a Disallow.
After you edit robots.txt for AI crawlers, test it. A typo in a user-agent string fails silently, the bot simply ignores a rule it doesn't match, and you won't get an error. Confirm every directive is being read the way you intended before you assume you're protected or visible.
Verify your robots.txt in seconds
Once your file is live, check it against the real, current list of AI bots, including the ones added this year that most "block AI" snippets miss. Our free AI Crawler Checker scans your site and shows exactly which AI crawlers you're allowing and blocking right now, so you can confirm GPTBot, OAI-SearchBot, ClaudeBot, and the rest are set the way you want.
If you'd rather understand the bigger picture first, our guide to answer engine optimization explains how crawler access fits into getting cited by AI, and the llms.txt explainer covers the newer standard that works alongside robots.txt.
Conclusion: visibility first, protection second
Getting robots.txt for AI crawlers right comes down to one decision repeated per bot: does this crawler help me get found, or does it only take my content? Allow the ones that cite you, OAI-SearchBot, PerplexityBot, ChatGPT-User. Decide on training crawlers like GPTBot and Google-Extended based on whether you want your content in AI models. And never reach for "block all" unless you mean it.
If you're building an AI startup and want to be the answer ChatGPT and Perplexity recommend, crawler access is step one of a much bigger answer-engine strategy. That's the work we do at avinashvagh.com, making sure the AI engines can find you, then making sure they pick you.
Ready to check your own setup? Run your domain through the free AI Crawler Checker and see exactly which AI crawlers you're allowing today.
FAQs
Should I block GPTBot?+–
Only if you don't want your content used to train OpenAI's models. Blocking GPTBot does not remove you from ChatGPT search, that's a separate bot, OAI-SearchBot. If AI visibility is your goal, you can block GPTBot and still stay citable.
Does blocking AI crawlers hurt my SEO?+–
Blocking AI training crawlers (GPTBot, CCBot, Google-Extended) has no effect on traditional Google rankings, because those are controlled by Googlebot. Blocking retrieval crawlers like OAI-SearchBot or PerplexityBot won't hurt Google either, but it will remove you from AI answers in those tools.
What's the difference between GPTBot and OAI-SearchBot?+–
GPTBot collects content to train OpenAI's models. OAI-SearchBot crawls so ChatGPT can link to and cite your site in search-style answers. Allow OAI-SearchBot for visibility; block GPTBot only if you want to opt out of training.
Why is ClaudeBot crawling my site so much?+–
ClaudeBot is Anthropic's crawler and can be active on large or frequently updated sites. If it's straining your server, add a Crawl-delay or block heavy directories rather than banning it entirely, since ClaudeBot can also help Claude cite your content.
How do I allow AI crawlers but block scrapers?+–
Allow citation crawlers (OAI-SearchBot, PerplexityBot, ChatGPT-User) and block dataset crawlers (CCBot, Bytespider). For sensitive content, use authentication or an X-Robots-Tag header, because robots.txt only stops well-behaved bots.
What happens if I block ChatGPT from my site?+–
If you block OAI-SearchBot and ChatGPT-User, ChatGPT can no longer fetch or cite your pages in its answers, so you lose AI search visibility. Blocking only GPTBot keeps you in ChatGPT search while opting out of training.
Does blocking Google-Extended remove me from AI Overviews?+–
No. Google-Extended only opts you out of Gemini AI training. AI Overviews use the normal search index built by Googlebot, so blocking Google-Extended changes nothing there.