What is robots.txt and how does it affect SEO?

As Head of SEO at TAL, I often describe robots.txt as one of those technical SEO elements that gets overlooked until something goes wrong. And when it does go wrong, it can go very wrong. A misconfigured robots.txt file can prevent Google from crawling your most important pages, tank your organic visibility overnight and take weeks to diagnose and recover from.

Understanding what robots.txt is, what it actually controls and where it is commonly misused is an important part of keeping any site technically healthy.

What is robots.txt?

Robots.txt is a plain text file that sits in the root directory of your website, typically accessible at yourdomain.com/robots.txt. It tells search engine crawlers which pages or sections of your site they are and are not allowed to access.

When Googlebot or any other search engine crawler arrives at your site, robots.txt is the first file it reads. The instructions inside it, known as directives, tell the crawler which paths it can follow and which it should skip.

A basic robots.txt file might look something like this:

User-agent: *

Disallow: /admin/

Disallow: /checkout/

Sitemap: https://www.yourdomain.com/sitemap.xml

The “User-agent: *” line means the rules apply to all crawlers. The Disallow lines tell them not to access the admin and checkout sections of the site. The Sitemap line points crawlers to the XML sitemap, which helps them find and index the right pages.

What robots.txt does and does not do

This is where most misunderstandings arise, and it is worth being clear.

Robots.txt controls crawling. It tells search engines which pages they are permitted to access and read. It does not control indexing. A page can be blocked in robots.txt and still appear in Google’s search results if Google has discovered the URL through an external link. In that case, Google knows the page exists but cannot access its content, so it may display the URL in search results with no description, which looks broken and unprofessional and is unlikely to earn any clicks.

Google’s own documentation is direct on this point: robots.txt is used to manage crawler traffic, not to keep pages out of search results. If you want a page excluded from Google’s index, a NOINDEX meta tag is the right tool, not a robots.txt disallow rule.

This distinction matters enormously in practice. Blocking a page in robots.txt while also expecting Google to drop it from the index is one of the most common and consequential technical SEO errors. If a page is blocked in robots.txt, Googlebot cannot access it and will therefore never see a NOINDEX tag placed within it. The result is a page that stays in the index indefinitely, which is often the opposite of what was intended.

How robots.txt affects SEO

Used correctly, robots.txt contributes positively to a site’s technical health in several ways.

Crawl budget management is one of the most significant. Google allocates a finite amount of resources to crawling each website. For large sites with thousands of pages, this means not every page will be crawled on every visit. By using robots.txt to direct crawlers away from low-value URLs, such as internal search result pages, session ID parameters, thank you pages and admin areas, you help ensure that Google spends its crawl budget on the pages that actually matter for your rankings.

Preventing the wrong content from being indexed is another key use. Pages like login screens, checkout flows, duplicate filtered pages and staging environments should not appear in search results. Robots.txt, used alongside NOINDEX tags where appropriate, helps keep those pages out of the picture.

Robots.txt is also commonly used to declare the location of your XML sitemap, as shown in the example above. This helps crawlers find your sitemap quickly, which supports faster discovery and indexing of your important pages.

The most common robots.txt mistakes

A surprisingly high proportion of websites contain robots.txt errors that actively harm their search visibility. Some of the most frequently encountered ones are worth knowing.

Accidentally blocking important pages is the most damaging. A single misplaced Disallow rule can block an entire section of the site from being crawled. This can happen through overly broad rules, wildcard misuse or, perhaps most dangerously, a staging site robots.txt file being deployed to the live environment. Staging sites are often configured to block all crawlers with “Disallow: /”, and if that configuration carries over to the production site at launch, the entire website becomes inaccessible to search engines. This is a scenario I and the team at TAL check for on every technical SEO audit.

Blocking CSS and JavaScript files is another significant error. Some site owners mistakenly restrict access to these resources thinking they are protecting code or reducing server load. In practice, blocking CSS and JavaScript prevents Google from rendering your pages properly, which can affect how they are evaluated and ranked.

Using robots.txt to try to remove pages from the index rather than using NOINDEX tags is a persistent misconception. As outlined above, disallowing a URL in robots.txt does not reliably remove it from search results and can make things worse by preventing the NOINDEX tag from being read.

Over-blocking on large sites is also common. E-commerce sites in particular often end up with robots.txt rules that block valuable category or filter pages that should be indexed, leading to lost organic visibility that can be difficult to trace back to the root cause without a thorough audit.

robots.txt and AI crawlers

One area that has become increasingly relevant is the use of robots.txt to manage AI crawlers. Tools like ChatGPT, Google’s AI systems and other large language model platforms use their own crawlers to gather training data and power AI search responses.

Site owners can add specific User-agent rules to either allow or block these crawlers from accessing their content. Whether to do so is a strategic decision that depends on how much a business values visibility in AI-generated answers versus protecting its content from being used in AI training. This is a developing area of AI search optimisation and one worth factoring into any technical SEO review.

How to check your robots.txt file

Your robots.txt file is publicly accessible. You can view it by typing your domain followed by /robots.txt directly into a browser.

Google Search Console also includes a robots.txt testing tool that allows you to check whether specific URLs are being blocked and validate that your directives are working as intended. Running this test regularly, and especially after any site migration, CMS update or development release, is a sensible precaution.

For a full picture of how robots.txt is interacting with your site’s crawlability and indexation, a dedicated technical SEO audit will surface issues that a quick visual check can miss, particularly on larger sites where the cumulative effect of multiple rules can be difficult to assess manually.

Getting robots.txt right

Robots.txt is a small file with a disproportionately large impact on SEO. It does not directly influence rankings, but it controls whether Google can access the pages you want indexed in the first place. An error here undermines everything else in your SEO services strategy, regardless of how strong your content or backlink profile might be.

The file itself is simple. Getting the configuration right requires understanding what it does, what it cannot do and how it interacts with other technical directives like NOINDEX tags and canonical URLs.

If you want to make sure your robots.txt file is set up correctly, or if you suspect technical issues may be holding your site back, get in touch with the team at TAL. A technical SEO audit will give you a clear picture of where things stand and what needs to be addressed.

All Articles