conbersa.ai
Technical4 min read

What Is robots.txt?

Neil Ruaro·Founder, Conbersa
·
robots-txtweb-crawlerstechnical-seoai-crawlers

robots.txt is a plain text file placed at the root of a website that instructs web crawlers - automated programs that scan and index web content - which pages or sections of the site they are allowed or not allowed to access. It follows a standard called the Robots Exclusion Protocol, first introduced in 1994 and formally standardized as RFC 9309 in 2022.

Every major search engine and AI crawler checks for a robots.txt file before crawling a site. If you are building a startup and want your content to appear in both traditional search results and AI-generated responses, understanding robots.txt is essential.

How Does robots.txt Work?

When a web crawler visits your site, the first thing it does is request https://yourdomain.com/robots.txt. This file contains rules that specify which crawlers (called "user-agents") can access which parts of your site.

A basic robots.txt file looks like this:

User-agent: *
Allow: /
Disallow: /admin/

This tells all crawlers (* is a wildcard) that they can access the entire site except the /admin/ directory. You can create rules for specific crawlers by replacing the wildcard with a crawler name like Googlebot, GPTBot, or PerplexityBot.

The file uses three main directives:

  • User-agent - identifies which crawler the following rules apply to
  • Allow - explicitly permits access to specified paths
  • Disallow - blocks access to specified paths

Rules are processed top to bottom, and more specific path patterns take precedence over general ones.

With AI search engines like ChatGPT, Perplexity, and Gemini relying on web crawlers to find and index content, robots.txt directly controls whether your content can appear in AI-generated responses.

Each AI platform uses its own crawler:

AI Platform Crawler Name Purpose
OpenAI (ChatGPT) GPTBot Web browsing and content retrieval
OpenAI (ChatGPT) OAI-SearchBot Real-time search results
Perplexity PerplexityBot Answer engine content retrieval
Anthropic (Claude) ClaudeBot Web content access
Google (Gemini) Google-Extended AI training and retrieval

If your robots.txt blocks any of these crawlers, your content will not appear in that platform's responses. According to Originality.ai's research, over 35% of the top 1,000 websites now block at least one AI crawler. For startups seeking AI visibility, this is a competitive opportunity - your content can fill the gaps left by sites that block AI crawlers.

What Should Startups Include in robots.txt?

For most startups focused on maximizing visibility, a permissive robots.txt works best:

User-agent: *
Allow: /

Sitemap: https://www.yourdomain.com/sitemap.xml

This allows all crawlers full access and points them to your sitemap for efficient discovery. You should block only pages that have no SEO or AI search value - admin pages, internal search results, staging environments, and duplicate content.

If you want more granular control, you can allow AI crawlers explicitly:

User-agent: GPTBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

How Do You Check Your robots.txt?

Visit https://yourdomain.com/robots.txt in a browser. If the file does not exist, crawlers will treat your entire site as fully accessible - which is fine for most startups. If it does exist, review the rules to make sure you are not accidentally blocking AI crawlers or important content pages.

Google Search Console's robots.txt Tester tool lets you test specific URLs against your rules to verify whether they would be blocked or allowed. This is especially useful if your robots.txt has complex rules with multiple user-agent blocks.

What Are Common robots.txt Mistakes?

Blocking AI crawlers unintentionally. Some CMS platforms or hosting providers include default robots.txt rules that block non-Google crawlers. Check for broad Disallow rules that might catch AI crawlers.

Using robots.txt instead of noindex. robots.txt prevents crawling, not indexing. If a blocked page has external links pointing to it, search engines may still list its URL. Use a noindex meta tag for pages you want completely excluded from search results.

Blocking CSS and JavaScript. Modern crawlers need access to CSS and JavaScript to render pages properly. Blocking these resources can prevent crawlers from seeing your content, especially on JavaScript-heavy sites.

Forgetting the sitemap reference. Adding a Sitemap: directive to your robots.txt helps crawlers discover your content efficiently. This is a simple addition that improves crawl budget allocation.

Your robots.txt file is one of the simplest but most consequential technical configurations on your site. Get it right, and every AI crawler can find and index your content. Get it wrong, and you are invisible to AI search entirely. Review it today - it takes five minutes and the impact on your AI search optimization can be immediate.

Frequently Asked Questions

Related Articles