Module 1 · Lesson 4

AI Crawler Directory & robots.txt

Know every AI crawler—and control which ones access your content

"You can't optimize for crawlers you don't know exist."

In this lesson, you'll learn every major AI crawler, what they do, and how to control their access to your content.

Understanding AI crawlers and embeddings explains how AI systems can access your content. This lesson focuses on how you interact with those crawlers—not how you control AI answers.

Not all AI crawlers are the same. Some collect data to train models. Others build search indexes. Some only activate when users request live information. Understanding who's crawling your site—and why—is essential for making informed decisions about access.

An AI crawler directory is about awareness, not control. Knowing which crawlers exist helps you decide who gets access to your content. It doesn't give you control over what appears in AI-generated answers.

Entity: robots.txt

Definition A text file at your site's root that tells crawlers which pages they can access Location yoursite.com/robots.txt Purpose Control crawler access to your content Enforcement Voluntary—crawlers choose to respect it (most do) Affects Training bots (always), on-demand fetchers (sometimes)

🧠 Mental Model: Understanding what you can and can't control:

• Crawlers = Access (who can read your content)
• Embeddings = Memory (how AI stores meaning)
• Retrieval = Usage (what appears in answers)

robots.txt controls access only. It doesn't control memory or usage.

What Are the Three Types of AI Crawlers?

AI crawlers fall into three categories: Training Bots that collect data for model training (GPTBot, ClaudeBot), Indexing Bots that build search indexes (PerplexityBot, OAI-SearchBot), and On-Demand Fetchers that retrieve content when users request it (ChatGPT-User, Perplexity-User).

Type	Purpose	When Active	Respects robots.txt
Training	Collect data to train AI models	Continuously crawling	✅ Yes
Indexing	Build search indexes for AI search	Continuously crawling	✅ Yes
On-Demand	Fetch content for user requests	Only when user asks	⚠️ May bypass

→ Why This Matters for GEO

Different crawlers = different strategies. You might block training bots (to prevent your content from training competitors' AI) while allowing on-demand fetchers (so AI can cite you when answering questions). This isn't all-or-nothing.

Which AI Crawlers Exist and Who Operates Them?

Major AI crawlers include GPTBot and ChatGPT-User (OpenAI), ClaudeBot and Claude-User (Anthropic), PerplexityBot and Perplexity-User (Perplexity), Google-Extended (Google/Gemini), Bingbot (Microsoft/Copilot), and several others from Meta, Apple, Amazon, and Common Crawl.

Provider	Crawler	Type	User-Agent	robots.txt
OpenAI	GPTBot	Training	`GPTBot/1.0`	✅ Respects
OpenAI	OAI-SearchBot	Indexing	`OAI-SearchBot/1.0`	✅ Respects
OpenAI	ChatGPT-User	On-Demand	`ChatGPT-User/1.0`	⚠️ May bypass
Anthropic	ClaudeBot	Training	`ClaudeBot/1.0`	✅ Respects
Anthropic	Claude-User	On-Demand	`Claude-User`	⚠️ May bypass
Google	Google-Extended	Training	`Google-Extended`	✅ Respects
Microsoft	Bingbot	Indexing	`bingbot/2.0`	✅ Respects
Perplexity	PerplexityBot	Indexing	`PerplexityBot/1.0`	✅ Respects
Perplexity	Perplexity-User	On-Demand	`Perplexity-User/1.0`	❌ No
Meta	Meta-ExternalAgent	Training	`meta-externalagent/1.1`	✅ Respects
Apple	Applebot-Extended	Training	`Applebot-Extended`	✅ Respects
CC Common Crawl	CCBot	Training	`CCBot/2.0`	✅ Respects

💡 Note on Common Crawl: CCBot powers the Common Crawl dataset, which is used to train many AI models including some open-source LLMs. Blocking CCBot has a wide impact on AI training data.

Semantic Relationships

GPTBot is operated by OpenAI
ClaudeBot is operated by Anthropic
Google-Extended trains Gemini
Bingbot powers Microsoft Copilot
robots.txt controls access for AI Crawlers
Training Bots always respect robots.txt
On-Demand Fetchers may bypass robots.txt

How Do You Configure robots.txt to Control AI Crawlers?

Add User-agent directives for specific crawlers followed by Allow or Disallow rules. Each crawler needs its own block. Training bots reliably respect these rules; on-demand fetchers may bypass them when users explicitly request content.

Block All AI Training

Prevent your content from training any AI models:

📄 robots.txt — Block all training bots

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

# Allow traditional search engines
User-agent: Googlebot
Allow: /

User-agent: bingbot
Allow: /

Allow Specific Providers Only

Allow only the AI systems you want to work with:

📄 robots.txt — Selective access

# Allow OpenAI crawlers
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

# Allow Perplexity
User-agent: PerplexityBot
Allow: /

# Block others
User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Block AI from Specific Sections

Allow AI crawlers generally but protect certain content:

📄 robots.txt — Partial access

# Allow GPTBot generally
User-agent: GPTBot
Allow: /
Disallow: /premium/
Disallow: /members-only/
Disallow: /proprietary-research/

⚠️ Important: robots.txt is voluntary. It's a request, not a command. Training bots from reputable companies respect it, but on-demand fetchers may bypass it when users explicitly ask for your content.

📌 Critical Understanding: Robots.txt can limit access, but it cannot guarantee exclusion from AI-generated answers. Your content may still appear in AI responses through other sources, citations, or cached data.

→ Reality Check

Blocking AI crawlers doesn't remove your content from the AI ecosystem—it only limits direct access. Your content may already be in training data, cited by other sources, or accessible through on-demand fetchers that bypass robots.txt.

Allowing or blocking crawlers affects access—not whether your content is ultimately useful to AI systems.

Should You Block or Allow AI Crawlers?

It depends on your goals. Allowing crawlers maximizes AI visibility and citation potential. Blocking training bots protects proprietary content but reduces AI's knowledge of you. Most publishers choose a middle path: block training, allow retrieval.

The Strategic Decision

✅ Allow AI Crawlers If:

You want maximum AI visibility
You want to be cited in AI answers
Your content is public/marketing-focused
Brand awareness is a priority
You're building thought leadership

❌ Block AI Crawlers If:

You have proprietary/premium content
You don't want to train competitors' AI
Content is behind a paywall
You have legal/compliance concerns
You prefer users visit your site directly

The Middle Path (Most Common)

Many publishers block training bots but allow on-demand fetchers and indexing bots:

Block: GPTBot, ClaudeBot, CCBot, Google-Extended (no training)
Allow: PerplexityBot, OAI-SearchBot (search indexing)
Accept: ChatGPT-User, Perplexity-User (will fetch anyway when users ask)

This approach protects your content from being baked into model weights while still allowing real-time citation in AI answers.

→ Why This Matters for GEO

Blocking all AI crawlers = zero AI visibility. If your goal is GEO, you need AI systems to access your content somehow. The question isn't if, but which ones and how much.

How Do You Know Which AI Crawlers Are Visiting Your Site?

Check your server logs for User-Agent strings matching AI crawlers. Filter for GPTBot, ClaudeBot, PerplexityBot, and others to see crawl frequency, which pages they access, and crawl patterns. This data informs your robots.txt strategy.

Look for these User-Agent patterns in your logs:

GPTBot/1.0 — OpenAI training
ChatGPT-User — OpenAI on-demand
ClaudeBot/1.0 — Anthropic training
PerplexityBot — Perplexity indexing
Google-Extended — Gemini training

💡 Pro Tip: If you see heavy crawling from a specific bot, that's a signal your content is being actively collected. Decide if you want that—and adjust robots.txt accordingly.

Key Takeaways

A crawler directory is about awareness, not control. Knowing who crawls your site helps you make access decisions—it doesn't control what appears in AI answers.
Three crawler types serve different purposes. Training bots collect data for models, indexing bots power search, on-demand fetchers respond to user requests.
robots.txt controls access only. It can limit who reads your content, but it cannot guarantee exclusion from AI-generated answers.
Blocking doesn't mean disappearing. Your content may already be in training data, cited by others, or fetched on-demand despite blocking.
The decision is strategic, not technical. Blocking everything = zero AI visibility. Allowing everything = maximum exposure. Most choose a middle path.
Monitor your logs. Know which crawlers visit your site and how often before making access decisions.

If blocking crawlers isn't enough to control AI answers, what actually increases the chances of your content being used? That's what we'll explore in the GEO Audit.

Up Next: Lesson 5

GEO Audit — Step-by-step process to evaluate your site's AI visibility and identify what makes content more likely to be retrieved and cited.

Module 1: GEO Fundamentals

Lesson 1: Introduction to GEO
Lesson 2: How AI Crawlers Work
Lesson 3: Embeddings & Vector Search
Lesson 4: AI Crawler Directory & robots.txt ← You are here
Lesson 5: GEO Audit
Lesson 6: GEO Metrics & Measurement