AI Crawler Directory & robots.txt
Know every AI crawler—and control which ones access your content
"You can't optimize for crawlers you don't know exist."
In this lesson, you'll learn every major AI crawler, what they do, and how to control their access to your content.
Understanding AI crawlers and embeddings explains how AI systems can access your content. This lesson focuses on how you interact with those crawlers—not how you control AI answers.
Not all AI crawlers are the same. Some collect data to train models. Others build search indexes. Some only activate when users request live information. Understanding who's crawling your site—and why—is essential for making informed decisions about access.
An AI crawler directory is about awareness, not control. Knowing which crawlers exist helps you decide who gets access to your content. It doesn't give you control over what appears in AI-generated answers.
🧠 Mental Model: Understanding what you can and can't control:
• Crawlers = Access (who can read your content)
• Embeddings = Memory (how AI stores meaning)
• Retrieval = Usage (what appears in answers)
robots.txt controls access only. It doesn't control memory or usage.
What Are the Three Types of AI Crawlers?
| Type | Purpose | When Active | Respects robots.txt |
|---|---|---|---|
| Training | Collect data to train AI models | Continuously crawling | ✅ Yes |
| Indexing | Build search indexes for AI search | Continuously crawling | ✅ Yes |
| On-Demand | Fetch content for user requests | Only when user asks | ⚠️ May bypass |
Different crawlers = different strategies. You might block training bots (to prevent your content from training competitors' AI) while allowing on-demand fetchers (so AI can cite you when answering questions). This isn't all-or-nothing.
Which AI Crawlers Exist and Who Operates Them?
| Provider | Crawler | Type | User-Agent | robots.txt |
|---|---|---|---|---|
|
|
GPTBot | Training | GPTBot/1.0 |
✅ Respects |
|
|
OAI-SearchBot | Indexing | OAI-SearchBot/1.0 |
✅ Respects |
|
|
ChatGPT-User | On-Demand | ChatGPT-User/1.0 |
⚠️ May bypass |
|
|
ClaudeBot | Training | ClaudeBot/1.0 |
✅ Respects |
|
|
Claude-User | On-Demand | Claude-User |
⚠️ May bypass |
|
|
Google-Extended | Training | Google-Extended |
✅ Respects |
|
|
Bingbot | Indexing | bingbot/2.0 |
✅ Respects |
|
|
PerplexityBot | Indexing | PerplexityBot/1.0 |
✅ Respects |
|
|
Perplexity-User | On-Demand | Perplexity-User/1.0 |
❌ No |
|
|
Meta-ExternalAgent | Training | meta-externalagent/1.1 |
✅ Respects |
|
|
Applebot-Extended | Training | Applebot-Extended |
✅ Respects |
|
Common Crawl
|
CCBot | Training | CCBot/2.0 |
✅ Respects |
💡 Note on Common Crawl: CCBot powers the Common Crawl dataset, which is used to train many AI models including some open-source LLMs. Blocking CCBot has a wide impact on AI training data.
- GPTBot is operated by OpenAI
- ClaudeBot is operated by Anthropic
- Google-Extended trains Gemini
- Bingbot powers Microsoft Copilot
- robots.txt controls access for AI Crawlers
- Training Bots always respect robots.txt
- On-Demand Fetchers may bypass robots.txt
How Do You Configure robots.txt to Control AI Crawlers?
Block All AI Training
Prevent your content from training any AI models:
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Applebot-Extended
Disallow: /
# Allow traditional search engines
User-agent: Googlebot
Allow: /
User-agent: bingbot
Allow: /
Allow Specific Providers Only
Allow only the AI systems you want to work with:
# Allow OpenAI crawlers
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
# Allow Perplexity
User-agent: PerplexityBot
Allow: /
# Block others
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
Block AI from Specific Sections
Allow AI crawlers generally but protect certain content:
# Allow GPTBot generally
User-agent: GPTBot
Allow: /
Disallow: /premium/
Disallow: /members-only/
Disallow: /proprietary-research/
⚠️ Important: robots.txt is voluntary. It's a request, not a command. Training bots from reputable companies respect it, but on-demand fetchers may bypass it when users explicitly ask for your content.
📌 Critical Understanding: Robots.txt can limit access, but it cannot guarantee exclusion from AI-generated answers. Your content may still appear in AI responses through other sources, citations, or cached data.
Blocking AI crawlers doesn't remove your content from the AI ecosystem—it only limits direct access. Your content may already be in training data, cited by other sources, or accessible through on-demand fetchers that bypass robots.txt.
Allowing or blocking crawlers affects access—not whether your content is ultimately useful to AI systems.
Should You Block or Allow AI Crawlers?
✅ Allow AI Crawlers If:
- You want maximum AI visibility
- You want to be cited in AI answers
- Your content is public/marketing-focused
- Brand awareness is a priority
- You're building thought leadership
❌ Block AI Crawlers If:
- You have proprietary/premium content
- You don't want to train competitors' AI
- Content is behind a paywall
- You have legal/compliance concerns
- You prefer users visit your site directly
The Middle Path (Most Common)
Many publishers block training bots but allow on-demand fetchers and indexing bots:
- Block: GPTBot, ClaudeBot, CCBot, Google-Extended (no training)
- Allow: PerplexityBot, OAI-SearchBot (search indexing)
- Accept: ChatGPT-User, Perplexity-User (will fetch anyway when users ask)
This approach protects your content from being baked into model weights while still allowing real-time citation in AI answers.
Blocking all AI crawlers = zero AI visibility. If your goal is GEO, you need AI systems to access your content somehow. The question isn't if, but which ones and how much.
How Do You Know Which AI Crawlers Are Visiting Your Site?
Look for these User-Agent patterns in your logs:
GPTBot/1.0— OpenAI trainingChatGPT-User— OpenAI on-demandClaudeBot/1.0— Anthropic trainingPerplexityBot— Perplexity indexingGoogle-Extended— Gemini training
💡 Pro Tip: If you see heavy crawling from a specific bot, that's a signal your content is being actively collected. Decide if you want that—and adjust robots.txt accordingly.
Key Takeaways
- A crawler directory is about awareness, not control. Knowing who crawls your site helps you make access decisions—it doesn't control what appears in AI answers.
- Three crawler types serve different purposes. Training bots collect data for models, indexing bots power search, on-demand fetchers respond to user requests.
- robots.txt controls access only. It can limit who reads your content, but it cannot guarantee exclusion from AI-generated answers.
- Blocking doesn't mean disappearing. Your content may already be in training data, cited by others, or fetched on-demand despite blocking.
- The decision is strategic, not technical. Blocking everything = zero AI visibility. Allowing everything = maximum exposure. Most choose a middle path.
- Monitor your logs. Know which crawlers visit your site and how often before making access decisions.
If blocking crawlers isn't enough to control AI answers, what actually increases the chances of your content being used? That's what we'll explore in the GEO Audit.
- Lesson 1: Introduction to GEO
- Lesson 2: How AI Crawlers Work
- Lesson 3: Embeddings & Vector Search
- Lesson 4: AI Crawler Directory & robots.txt ← You are here
- Lesson 5: GEO Audit
- Lesson 6: GEO Metrics & Measurement