How to create a robots.txt file for your website

Key takeaways

  • A robots.txt file tells crawlers which URLs they can request from your site.
  • To create one, use a plain text file named exactly robots.txt and upload it to your site root.
  • User-agent, Disallow, Allow, and Sitemap rules help control how crawlers move through your site.
  • Robots.txt helps manage crawling, but it doesn’t secure private content or reliably remove pages from search results.

A robots.txt file is simple to create, but the setup details are important. The filename, location, syntax, and rules all affect how crawlers interact with your site.

This guide walks through how to create a robots.txt file, add crawler rules, upload it to the right place, and test it before relying on it.

Ready to get started?

Get website hosting built to help you win.

What is a robots.txt file?

A robots.txt is a plain text website file at the root of your site that follows the Robots Exclusion Protocol (standardized as RFC 9309). For example, www.yourdomain.com would have a robots.txt file at www.yourdomain.com/robots.txt.

The file includes rules that tell crawlers which paths they can or can’t request.

A robots.txt file can help you:

  • Guide crawlers toward the right parts of your site
  • Block crawling of duplicate, internal, or low-value URLs
  • Point crawlers to your XML sitemap
  • Reduce unnecessary crawler traffic on resource-heavy areas

Robots.txt is crawl guidance, not a privacy or security tool. If content needs to stay private, use password protection, authentication, or server-level access controls.

One more thing to know up front: if a site has no robots.txt file at all (the URL returns a 404), crawlers treat that as permission to crawl everything. A robots.txt file only adds restrictions; it never expands access beyond what’s already public.

Before you create a robots.txt file

Before creating the file, decide what crawlers should and shouldn’t access.

Review:

  • Which pages should stay crawlable
  • Which folders, URLs, or parameters should not be crawled
  • Where your XML sitemap is located
  • Whether a robots.txt file already exists at https://example.com/robots.txt
  • Whether your site uses subdomains (each needs its own file)
  • Whether you serve both http:// and https://, or www and non-www (each is a separate origin and needs its own file)
  • Whether a CDN or caching layer may need to be cleared after changes

Use a plain text editor. Avoid Microsoft Word, Google Docs, or rich text editors because they can add hidden formatting that may break crawler instructions.

How to create a robots.txt file

Step 1: Create a plain text file

You must have access to the root of your domain. Your web hosting provider can confirm whether you have the appropriate access.

Create a new plain text file and name it exactly:

robots.txt

Use all lowercase, and save the file as UTF-8. Google and other major crawlers may ignore characters outside the UTF-8 range, which can make the surrounding rules invalid. (A leading UTF-8 byte order mark is tolerated by Google, but it’s cleanest to save without one.)

Step 2: Add a user-agent rule

Use User-agent to tell crawlers which bot the rules apply to.

To apply rules to all crawlers that follow robots.txt, use:

User-agent: *

The pound sign (#) denotes the beginning of a comment:

# Rules for all crawlers

User-agent: *

A few things that trip people up:

  • User-agent names are matched case-insensitively, so Googlebot and googlebot are equivalent.
  • Each crawler obeys only one group of rules — the most specific User-agent group that matches its name. If you have both a User-agent: * group and a User-agent: Googlebot group, Googlebot follows only its own group and ignores the * rules. The groups don’t merge, so any rule you want to apply to a named bot must be repeated in its group.
  • You can list multiple User-agent lines above a single set of rules to apply the same rules to several bots.

Step 3: Add allow and disallow rules

Use Disallow to tell crawlers not to request a path:

User-agent: *

Disallow: /private-folder/

Use Allow to make an exception inside a blocked path:

User-agent: *

Disallow: /private-folder/

Allow: /private-folder/public-file.html

How conflicts are resolved. When both an Allow and a Disallow match the same URL, Google and other major search engines follow the most specific rule, the one with the longest path match, regardless of the order they appear in the file. If two rules are equally specific, the less restrictive Allow wins. This is why the example above works: /private-folder/public-file.html is longer and more specific than /private-folder/, so the Allow takes precedence for that one file. (Note: the original 1994 standard used first-match ordering, and a few older or simpler bots still behave that way.)

Matching is prefix-based. A rule matches the start of the URL path:

  • Disallow: /private blocks /private, /private/, and /private-report.html (anything that begins with /private).
  • Disallow: /private/ is scoped to the folder and blocks only paths inside /private/.

Paths are case-sensitive. /Private/ and /private/ are different paths, so match the exact casing used in your URLs.

Wildcards. Google, Bing, and most major crawlers support two special characters in paths:

  • * matches any sequence of characters.
  • $ anchors the match to the end of the URL.

For example, Disallow: /*.pdf$ blocks every URL ending in .pdf, and Disallow: /*? blocks any URL containing a query string.

The two cases to watch most closely:

Disallow:

and:

Disallow: /

Disallow: with nothing after it allows crawling (it’s an empty rule). Disallow: / blocks crawling for the entire site — a single misplaced slash is one of the most common and damaging robots.txt mistakes.

Step 4: Add your sitemap URL

The Sitemap directive is optional and gives the location of your sitemap. The only requirement is that it must be a fully qualified (absolute) URL.

Sitemap: https://example.com/sitemap.xml

If your site uses a sitemap index, point to that instead:

Sitemap: https://example.com/sitemap_index.xml

You can include more than one Sitemap: line if your site publishes multiple sitemaps. Unlike Disallow/Allow, Sitemap is independent of any User-agent group.

Step 5: Upload robots.txt to the site root

Websites don’t automatically come with a robots.txt file — it isn’t required. Once you decide to create one, upload the file to your website’s root directory.

Your robots.txt file should be available here:

https://example.com/robots.txt

Robots.txt files don’t get placed in a subdirectory of your domain (www.yourdomain.com/page/robots.txt will be ignored).

If your site uses subdomains, each subdomain needs its own robots.txt file:

https://blog.example.com/robots.txt

https://shop.example.com/robots.txt

Step 6: Verify the file is live

Open the file directly in your browser:

https://example.com/robots.txt

Confirm that:

  • The file loads with a 200 OK status
  • The filename and casing are correct
  • The rules display as plain text
  • The sitemap URL works
  • The file appears in the site root

A note on status codes, because crawlers treat the response differently:

  • 2xx — the rules are read and applied.
  • 404 (not found) — treated as “no restrictions,” so the whole site is crawlable.
  • 5xx (server error) or a failed fetch — Google treats this as “disallow everything” while the error persists, and a prolonged outage can eventually be treated like a 404. A robots.txt file that returns 500 can therefore quietly stall crawling of your entire site, so make sure it serves reliably.

Then test the file with a robots.txt testing tool to check for syntax or logic issues. Helpful options include:

  • The robots.txt report in Google Search Console (Settings → robots.txt), which shows the last-fetched file, its status, and any parsing issues. (Note: Google retired the older interactive “robots.txt Tester” tool, so the report is now the in-Console option.)
  • Google’s open-source robots.txt parser (github.com/google/robotstxt) if you want to test matches against the exact library Google uses.
  • The robots.txt Validator and Testing Tool from Merkle, Inc.

After testing, check important URLs to make sure you aren’t accidentally blocking pages, images, CSS, JavaScript, or other files search engines need to crawl and render your site.

Basic robots.txt template

Use this simple template when you want to allow crawlers to access the site and find the sitemap. This setup doesn’t block any crawlers; it simply gives them the sitemap location.

# Allow all crawlers

User-agent: *

Disallow:

# Sitemap location

Sitemap: https://example.com/sitemap.xml

WordPress robots.txt example

WordPress generates a virtual robots.txt file by default. That virtual file is only served when no physical robots.txt exists in the root — as soon as you upload a real file, it overrides the virtual one. You can also edit robots.txt with plugins like Yoast SEO or Rank Math, or upload the file manually through FTP, SFTP, SSH, or your hosting file manager.

If you use the Yoast SEO plugin, you’ll find a section in the admin area to create and edit a robots.txt file.

A common WordPress example looks like this:

User-agent: *

Disallow: /wp-admin/

Allow: /wp-admin/admin-ajax.php

Sitemap: https://example.com/sitemap_index.xml

This discourages crawling of the WordPress admin area while allowing access to admin-ajax.php, which some themes and plugins need. The Allow works because its path is more specific (longer) than the Disallow — see the precedence note in Step 3.

Ecommerce robots.txt example

Ecommerce sites often have URLs that search engines don’t need to crawl, such as cart, checkout, account, and add-to-cart URLs.

Add-to-cart links can cause more specific issues since those pages aren’t cached, increasing your server’s CPU and memory usage as the pages get hit repeatedly.

User-agent: *

Disallow: /cart/

Disallow: /checkout/

Disallow: /my-account/

Disallow: /*add-to-cart=

Sitemap: https://example.com/sitemap.xml

The Disallow: /*add-to-cart= rule uses the * wildcard to match any URL containing the add-to-cart= parameter, wherever it appears in the path.

Keep important product, category, and content pages crawlable unless you have a clear reason to block them.

Staging site robots.txt example

For a staging site, you may want to discourage crawling:

User-agent: *

Disallow: /

This can help keep compliant crawlers away, but it doesn’t make the staging site private, and a disallowed URL can still get indexed if it’s linked from elsewhere. Use password protection, IP restrictions, or authentication for staging environments that should not be public.

Common robots.txt directives

DirectiveWhat it doesExample
User-agentSelects the crawler the rules apply toUser-agent: *
DisallowTells crawlers not to request a pathDisallow: /checkout/
AllowCreates an exception inside a blocked pathAllow: /wp-admin/admin-ajax.php
SitemapPoints crawlers to the XML sitemapSitemap: https://example.com/sitemap.xml
Crawl-delayRequests a delay between crawler requests, but support variesCrawl-delay: 10

Google does not support Crawl-delay, so don’t rely on it to manage Googlebot’s crawl rate. Bing and Yandex do honor it.

Also worth knowing: Google does not support a noindex directive in robots.txt (support was dropped in 2019). To keep a page out of search results, use a noindex meta tag or X-Robots-Tag HTTP header on the page itself — not robots.txt.

Robots.txt vs noindex vs password protection

Robots.txt, noindex, and password protection solve different problems.

GoalUse
Manage crawler access to URLsrobots.txt
Keep a page out of search resultsnoindex (meta tag or X-Robots-Tag header)
Protect private or sensitive contentPassword protection, authentication, or server-level access controls
Remove an already indexed URLRemoval tools plus noindex, redirects, or proper status codes depending on the situation

Use robots.txt for crawl control. Use noindex when you want a page kept out of search results. Use authentication when content should not be public.

Important interaction: don’t use robots.txt and noindex together on the same page. If a page is blocked in robots.txt, crawlers can’t fetch it, so they never see the noindex tag — and the URL can still appear in search results as a bare listing if other sites link to it. For noindex to work, the page must remain crawlable.

Robots.txt and AI crawlers

Some sites add robots.txt rules for specific crawlers, including AI crawlers.

User-agent: GPTBot

Disallow: /

You can add separate User-agent groups for different crawlers. Common ones site owners ask about include:

  • GPTBot (OpenAI)
  • Google-Extended (controls use of your content for Google’s AI/Gemini training without affecting how Googlebot crawls or ranks your pages for Search)
  • ClaudeBot / anthropic-ai (Anthropic)
  • CCBot (Common Crawl)
  • PerplexityBot (Perplexity)
  • Bytespider (ByteDance)

Just remember that robots.txt depends on crawler compliance. Reputable crawlers usually follow it, but bad bots may ignore it entirely — so it isn’t a substitute for access controls.

How to test a robots.txt file

Testing should be part of the setup process, not an afterthought.

Before you rely on the file, review this checklist:

  • Open /robots.txt in a browser and confirm it returns 200 OK
  • Confirm the file is in the root directory
  • Check filename and casing
  • Validate the syntax with a testing tool or the Search Console robots.txt report
  • Test important URLs against the rules
  • Confirm important CSS and JavaScript are not blocked
  • Confirm product, service, or article pages are crawlable
  • Confirm blocked paths are intentional and the casing matches your URLs
  • Confirm the sitemap URL works
  • Purge CDN or site cache if old rules still appear

A small robots.txt mistake can create SEO problems, so test before and after publishing changes.

Common robots.txt mistakes

Avoid these common mistakes before publishing your file:

  • Naming the file robot.txt or Robots.txt instead of robots.txt
  • Uploading the file to the wrong folder instead of the site root
  • Blocking the whole site with Disallow: /
  • Letting robots.txt return a 5xx error, which can stall crawling site-wide
  • Using robots.txt to hide private content
  • Adding a Noindex: line to robots.txt (Google ignores it)
  • Blocking a page in robots.txt and expecting a noindex tag on it to work
  • Blocking CSS or JavaScript needed to render pages
  • Blocking important product, category, article, or landing pages
  • Forgetting that subdomains (and http/https, www/non-www) need their own files
  • Forgetting to update the sitemap URL
  • Forgetting to purge CDN or site cache
  • Assuming all bots will follow the rules

Robots.txt FAQs

Create a plain text file named robots.txt, add User-agent and Disallow rules, include your sitemap URL, upload it to the site root, and test it at https://example.com/robots.txt.

Place robots.txt in the root directory of the domain, such as https://example.com/robots.txt. Each subdomain needs its own robots.txt file.

They serve different purposes. A sitemap helps crawlers find important URLs, while robots.txt tells crawlers which URLs they can request. You don’t strictly need a robots.txt file, without one, your site is fully crawlable, but it’s useful for guiding crawl behavior and pointing to your sitemap.

Robots.txt is part of the Robots Exclusion Protocol (RFC 9309), but it is not a legal access control system. It provides crawler instructions that compliant bots are expected to follow.

No. Robots.txt controls crawling, not reliable indexing. A disallowed URL can still appear in search results if it’s linked elsewhere. Use noindex (on a crawlable page) or password protection when you need to keep pages out of search results or protect private content.

Getting started with a robots.txt file

Creating a robots.txt file is simple, but setup details matter. Use the exact filename, write valid rules, place it in the root directory, add the sitemap, and test the file before relying on it.

Start by reviewing the paths you want crawled or blocked. Then create a plain text file named robots.txt, upload it to your site root, and verify it in a browser and in Search Console.

Robots.txt works best when your site is easy to manage, test, and update from the hosting environment. Liquid Web hosting gives teams the control, performance, and support they need to manage technical SEO files, site updates, caching, and server behavior with confidence. Explore Liquid Web hosting solutions to find the right fit.

Ready to get started?

Get website hosting built to help you win.

Related articles

Wait! Get exclusive hosting insights

Subscribe to our newsletter and stay ahead of the competition with expert advice from our hosting pros.

Loading form…