How to Configure llms.txt and robots.txt for AI Craw

AI crawlers do not read your site the way Googlebot does — and most robots.txt files still have no idea GPTBot, ClaudeBot or PerplexityBot exist. This is a hands-on technical guide to configuring both files correctly, with Nevatrix's own live setup as a real example.

robots.txt tells crawlers which parts of your site they may access. llms.txt is a newer, separate file that tells AI systems which pages are worth reading and how to understand your business at a glance. In 2026, both matter for GEO — if AI crawlers cannot access your site, or cannot quickly understand what it is about, you will not be cited in ChatGPT, Perplexity or Google AI Overviews, no matter how good your content is.

Allow AI crawlers explicitly in robots.txt (GPTBot, ClaudeBot, PerplexityBot, Applebot-Extended, Google-Extended and others), and publish a structured llms.txt file at yourdomain.com/llms.txt summarising your business, services, pricing and key pages — most default robots.txt files silently block several of these bots.

What Is llms.txt and Why Does It Exist?

llms.txt is a proposed plain-text standard, placed at yourdomain.com/llms.txt, that gives AI systems a clean, structured summary of your site — company facts, services, pricing, FAQs and key page links — instead of forcing them to parse and interpret your full HTML. Think of it as a business card for AI crawlers: robots.txt controls access, llms.txt improves comprehension.

What Is robots.txt and How AI Crawlers Use It Differently Than Google

robots.txt is the decades-old standard that tells any crawler which paths it may or may not access, using User-agent and Disallow/Allow rules. Googlebot has followed robots.txt for 25+ years — but AI crawlers are newer, each has a different name, and if you have never explicitly addressed them, many default robots.txt configurations either accidentally block them or simply have no rule for them at all, which most AI crawlers interpret conservatively.

The Major AI Crawlers You Need to Know in 2026

Crawler	Operated By	Purpose
GPTBot	OpenAI	Trains ChatGPT's models on public web content
ChatGPT-User	OpenAI	Fetches live pages when a user asks ChatGPT to browse
OAI-SearchBot	OpenAI	Powers ChatGPT Search result indexing
ClaudeBot / anthropic-ai	Anthropic	Crawls and trains Claude's models
Claude-Web	Anthropic	Fetches live pages for Claude's browsing feature
PerplexityBot / Perplexity-User	Perplexity AI	Indexes and fetches pages cited in Perplexity answers
Google-Extended	Google	Controls use of your content for Gemini and AI Overviews (separate from Googlebot)
Applebot-Extended	Apple	Controls use of your content for Apple Intelligence features
Amazonbot	Amazon	Crawls for Alexa and Amazon AI features
CCBot	Common Crawl	Public dataset used to train many third-party AI models

Why this table matters

Blocking Googlebot has always meant losing Google Search traffic — blocking Google-Extended specifically means losing Gemini/AI Overview visibility, without affecting normal Google rankings
Each AI company runs at least one "training" crawler and one "live fetch" crawler — you can allow one and block the other if you want citations without training use
A default, unedited robots.txt typically has no explicit rule for most of these — meaning your access policy is accidental, not intentional

How to Configure robots.txt to Allow AI Crawlers

Add an explicit User-agent block for each crawler you want to allow. A minimal AI-friendly addition looks like this:

Minimal robots.txt block to allow AI crawlers

User-agent: GPTBot → Allow: /
User-agent: ChatGPT-User → Allow: /
User-agent: ClaudeBot → Allow: /
User-agent: PerplexityBot → Allow: /
User-agent: Google-Extended → Allow: /
User-agent: Applebot-Extended → Allow: /

If you want citations and visibility but do not want your content used for model training, allow the "live fetch" bots (ChatGPT-User, Claude-Web, Perplexity-User) while disallowing the "training" bots (GPTBot, anthropic-ai, CCBot) — this is a legitimate, increasingly common configuration, though it means you may be cited less often since some tools rely on training data rather than live fetches.

How to Write an llms.txt File

Place a plain Markdown file at yourdomain.com/llms.txt. Structure it with a one-line company summary, key company facts, your services with pricing ranges, your team/author credentials, and a curated list of your most important pages — organised so an AI system can extract facts in seconds rather than crawling your entire site.

Real Example: Nevatrix's Own robots.txt and llms.txt

We do not just recommend this setup — it is what runs on nevatrix.com. Our robots.txt explicitly allows every major AI crawler (GPTBot, ChatGPT-User, OAI-SearchBot, Google-Extended, anthropic-ai, Claude-Web, ClaudeBot, PerplexityBot, Perplexity-User, Applebot, Applebot-Extended, Amazonbot, CCBot and more), while explicitly blocking known scraper bots that offer no citation value (DotBot, BLEXBot, PetalBot, Bytespider). Our llms.txt lists company facts, every service with real pricing, author credentials with LinkedIn links, and a curated map of our blog content by topic cluster — structured exactly the way this guide recommends.

Common Mistakes That Block AI Visibility

Using a wildcard "Disallow: /" for User-agent: * without adding explicit Allow rules for each AI crawler you want to permit — this silently blocks all of them
Never updating robots.txt since it was first generated, missing every AI crawler that has launched since (most robots.txt files predate GPTBot)
Publishing llms.txt with marketing fluff instead of extractable facts — AI systems cite specifics (prices, credentials, data), not adjectives
Forgetting to keep llms.txt updated as services, pricing or team members change — a stale llms.txt can actively mislead AI citations
Blocking AI crawlers at the CDN/firewall level (e.g. via a security plugin) even after correctly configuring robots.txt — check both layers

Get an AI Crawler Configuration Audit

Frequently Asked Questions

robots.txt controls which crawlers can access which parts of your site — it is a permissions file. llms.txt is a content summary file that helps AI systems understand your business quickly, structured for extraction rather than access control. You need both: robots.txt to allow AI crawlers in, and llms.txt to help them understand your site once they are in.

No. Google-Extended, GPTBot, ClaudeBot and other AI crawlers are entirely separate from Googlebot, which continues to crawl and rank your site for traditional search regardless of your AI crawler settings. Allowing or blocking AI crawlers has no direct effect on your Google Search rankings.

You can, but doing so also reduces your chances of being cited in ChatGPT answers, since training data influences what ChatGPT knows about your business. A common middle ground is blocking GPTBot (training) while allowing ChatGPT-User (live browsing), which still lets ChatGPT fetch and cite your current pages when a user asks it to look something up.

At the root of your domain — yourdomain.com/llms.txt — the same convention as robots.txt. It should be a plain text file (Markdown formatting is standard and widely used) accessible without authentication, so any AI crawler can fetch it directly.

No file guarantees citation. llms.txt improves the odds by making your facts easy for AI systems to extract accurately, but citation ultimately depends on content quality, topical authority, how often your brand is mentioned elsewhere on the web, and the specific query being asked.

Open yourdomain.com/robots.txt in a browser and search for the bot names — GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended. If a bot has no explicit Allow rule and your file has a broad "Disallow: /" under User-agent: *, that bot is likely blocked by default.

Yes. New AI crawlers have launched roughly every few months since 2023, and a robots.txt file set up before a crawler existed has no rule for it — which most AI crawlers interpret conservatively as "do not access." Review your robots.txt at least twice a year, or whenever a major new AI product launches.

Yes. robots.txt supports path-specific rules — for example, allowing crawlers on your blog and service pages while disallowing an internal admin or account area, using the same Allow/Disallow path syntax you would use for any other crawler.

Not yet formally standardised by a body like the W3C, but it has been widely adopted since 2024 by AI companies, SEO tools and a large number of active websites as a de facto convention, similar to how sitemap.xml became standard practice before it was formally adopted by major search engines.

A single, dense sentence describing what your company does, its location and its core offering — this is often the exact sentence an AI system will paraphrase when describing your business, so write it the way you would want your business summarised in an AI answer.

About the Author

Priya Reddy 7+ years experience

Priya Reddy is the Digital Marketing Lead at Nevatrix Technologies, Warangal. With 7+ years in SEO, content marketing, Google Ads and social media marketing, she has helped 50+ businesses across India grow their organic traffic and online revenue.

View full profile → LinkedIn

How to Configure llms.txt and robots.txt for AI Crawlers in 2026