Where do you put the robots.txt file?

The robots.txt file must be placed at the root of your domain - for example, https://www.example.com/robots.txt. Crawlers always look for it at this exact location. If the file is in a subdirectory or has a different name, crawlers will not find it and will treat your entire site as fully crawlable.

Does robots.txt block pages from appearing in search results?

Not exactly. robots.txt tells crawlers not to crawl specific pages, but if other sites link to those pages, search engines may still list the URLs in results without crawling the content. To fully prevent a page from appearing in search results, use a noindex meta tag instead of relying on robots.txt alone.

Should startups block AI crawlers in robots.txt?

If your goal is AI search visibility, no. Blocking crawlers like GPTBot, PerplexityBot, or ClaudeBot prevents those AI models from reading and citing your content. Only block AI crawlers if you have a specific reason, such as protecting proprietary content from being used in AI training data.

Can robots.txt improve your crawl budget?

Yes. By blocking crawlers from low-value pages like admin panels, duplicate content, or internal search results, you direct their limited crawl budget toward your most important content. This is especially valuable for larger sites where crawlers cannot visit every page on each visit.

What Is robots.txt?

robots.txt is a plain text file placed at the root of a website that instructs web crawlers - automated programs that scan and index web content - which pages or sections of the site they are allowed or not allowed to access. It follows a standard called the Robots Exclusion Protocol, first introduced in 1994 and formally standardized as RFC 9309 in 2022.

Every major search engine and AI crawler checks for a robots.txt file before crawling a site. If you are building a startup and want your content to appear in both traditional search results and AI-generated responses, understanding robots.txt is essential.

How Does robots.txt Work?

When a web crawler visits your site, the first thing it does is request https://yourdomain.com/robots.txt. This file contains rules that specify which crawlers (called "user-agents") can access which parts of your site.

A basic robots.txt file looks like this:

User-agent: *
Allow: /
Disallow: /admin/

This tells all crawlers (* is a wildcard) that they can access the entire site except the /admin/ directory. You can create rules for specific crawlers by replacing the wildcard with a crawler name like Googlebot, GPTBot, or PerplexityBot.

The file uses three main directives:

User-agent - identifies which crawler the following rules apply to
Allow - explicitly permits access to specified paths
Disallow - blocks access to specified paths

Rules are processed top to bottom, and more specific path patterns take precedence over general ones.

Why Does robots.txt Matter for AI Search?

With AI search engines like ChatGPT, Perplexity, and Gemini relying on web crawlers to find and index content, robots.txt directly controls whether your content can appear in AI-generated responses.

Each AI platform uses its own crawler:

AI Platform	Crawler Name	Purpose
OpenAI (ChatGPT)	GPTBot	Web browsing and content retrieval
OpenAI (ChatGPT)	OAI-SearchBot	Real-time search results
Perplexity	PerplexityBot	Answer engine content retrieval
Anthropic (Claude)	ClaudeBot	Web content access
Google (Gemini)	Google-Extended	AI training and retrieval

If your robots.txt blocks any of these crawlers, your content will not appear in that platform's responses. According to Originality.ai's research, over 35% of the top 1,000 websites now block at least one AI crawler. For startups seeking AI visibility, this is a competitive opportunity - your content can fill the gaps left by sites that block AI crawlers.

What Should Startups Include in robots.txt?

For most startups focused on maximizing visibility, a permissive robots.txt works best:

User-agent: *
Allow: /

Sitemap: https://www.yourdomain.com/sitemap.xml

This allows all crawlers full access and points them to your sitemap for efficient discovery. You should block only pages that have no SEO or AI search value - admin pages, internal search results, staging environments, and duplicate content.

If you want more granular control, you can allow AI crawlers explicitly:

User-agent: GPTBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

How Do You Check Your robots.txt?

Visit https://yourdomain.com/robots.txt in a browser. If the file does not exist, crawlers will treat your entire site as fully accessible - which is fine for most startups. If it does exist, review the rules to make sure you are not accidentally blocking AI crawlers or important content pages.

Google Search Console's robots.txt Tester tool lets you test specific URLs against your rules to verify whether they would be blocked or allowed. This is especially useful if your robots.txt has complex rules with multiple user-agent blocks.

What Are Common robots.txt Mistakes?

Blocking AI crawlers unintentionally. Some CMS platforms or hosting providers include default robots.txt rules that block non-Google crawlers. Check for broad Disallow rules that might catch AI crawlers.

Using robots.txt instead of noindex. robots.txt prevents crawling, not indexing. If a blocked page has external links pointing to it, search engines may still list its URL. Use a noindex meta tag for pages you want completely excluded from search results.

Blocking CSS and JavaScript. Modern crawlers need access to CSS and JavaScript to render pages properly. Blocking these resources can prevent crawlers from seeing your content, especially on JavaScript-heavy sites.

Forgetting the sitemap reference. Adding a Sitemap: directive to your robots.txt helps crawlers discover your content efficiently. This is a simple addition that improves crawl budget allocation.

Your robots.txt file is one of the simplest but most consequential technical configurations on your site. Get it right, and every AI crawler can find and index your content. Get it wrong, and you are invisible to AI search entirely. Review it today - it takes five minutes and the impact on your AI search optimization can be immediate.

What Is robots.txt?

How Does robots.txt Work?

Why Does robots.txt Matter for AI Search?

What Should Startups Include in robots.txt?

How Do You Check Your robots.txt?

What Are Common robots.txt Mistakes?

Frequently asked questions

Keep reading

New guides, straight to your inbox.