What Is A Robots.txt File? Best Practices For Robot.txt Syntax
Updated by Jo Cameron November 7, 2024
What Is a Robots.txt File and Why It Matters for SEO
Robots.txt is a plain-text file found in the root of a domain and is available for anyone to access from yourwebsite.com/robots.txt. The robots.txt file is stored on the web server, just like other files, and plays a crucial role in guiding web crawlers on how to interact with the site. It is a widely acknowledged standard that allows webmasters to control all automated consumption of their sites, not just search engines.
Most sites will have a robots.txt either by default or created by webmasters to instruct web robots on how to crawl pages on their website. The robots.txt file is part of the robots exclusion protocol (REP), a group of web standards that regulate how robots crawl the web, access and index content, and serve that content up to users. The REP also includes directives like meta robots, as well as page-, subdirectory-, or site-wide instructions for how search engines should treat links (such as “follow” or “nofollow”).
In practice, robots.txt files indicate whether certain user agents (web-crawling software) can or cannot crawl parts of a website. These crawl instructions are specified by “disallowing” or “allowing” the behavior of certain (or all) user agents.
Basic robots.txt format:
User-agent: [user-agent name]
Disallow: [URL string not to be crawled]
Here it in in practice:
User-agent: Googlebot
Disallow: /example-subfolder/
Together, these two lines are considered a complete robots.txt file — though one robots file can contain multiple lines of user agents and directives (i.e., disallows, allows, crawl-delays, etc.).
Within a robots.txt file, each set of user-agent directives appears as a discrete set, separated by a line break:
In a robots.txt file with multiple user-agent directives, each disallow or allow rule only applies to the useragent(s) specified in that particular line break-separated set. If the file contains a rule that applies to more than one user-agent, a crawler will only pay attention to (and follow the directives in) the most specific group of instructions.
Here's an example:
Some bots only pay attention to the directives in their sections of the robots.txt file. If this is the case, you will need to call them out specifically. All other user-agents will follow the directives in the user-agent: * group.
Example robots.txt:
Here are a few examples of robots.txt in action for a www.example.com site:
By using specific directives, you can control which parts of your site appear in Google search results, optimizing your content for better visibility.
Robots.txt file URL and XML sitemap: ***www.example.com/robots.txt***
Blocking all web crawlers from all content
User-agent: *
Disallow: /
Using this syntax in a robots.txt file would tell all web crawlers not to crawl any pages on www.example.com, including the homepage.
Allowing all web crawlers access to all content
User-agent: *
Disallow:
Using this syntax in a robots.txt file tells web crawlers to crawl all pages on www.example.com, including the homepage.
Blocking a specific user agent from a specific folder
User-agent: Googlebot
Disallow: /example-subfolder/
This syntax tells only Google's crawler (user-agent name Googlebot) not to crawl any pages that contain the URL string www.example.com/example-subfolder/
Blocking a specific web crawler from a specific web page
User-agent: Bingbot
Disallow: /example-subfolder/blocked-page.html
This syntax tells only Bing’s crawler (user-agent name Bing) to avoid crawling the specific page at www.example.com/example-subfolder/blocked-page.html.
How does robots.txt work?
Search engines have two main jobs:
- Crawling the web to discover content;
- Indexing that content so that it can be served up to searchers who are looking for information.
To crawl sites, search engines follow links to get from one site to another — ultimately crawling across many billions of links and websites. This crawling behavior is sometimes known as “spidering.”
After arriving at a website but before spidering it, the search crawler will look for a robots.txt file. The robots.txt file is hosted on the web server and is one of the first files a most search engine results crawler looks for when it arrives at a website. If it finds one, the crawler will read that file first before continuing through the page. Because the robots.txt file contains information about how the search engine should crawl, the information found there will instruct further crawler action on this particular site. If the robots.txt file does not contain any directives that disallow a user-agent’s activity (or if the site doesn’t have a robots.txt file), it will proceed to crawl other information on the site.
User-agent and crawler management
User agent and crawler management is a crucial aspect of maintaining a healthy and optimized website. A user agent is a software program that acts on behalf of a user, such as a web browser or a search engine crawler. Web crawlers, also known as spiders or bots, are automated programs that systematically browse and index web pages to gather data for search engines.
To manage user agents and crawlers effectively, you need to understand how they interact with your website. Here are some key points to consider:
- User agent identification: Each user agent has a unique identifier, known as a user agent string, which can be used to identify the type of browser or crawler. Recognizing these strings helps you tailor your robots.txt file to manage specific user agents.
- Crawler behavior: Web crawlers can be configured to crawl your website at different frequencies, and some may be more aggressive than others. Understanding the behavior of different crawlers allows you to set appropriate crawl delays and disallow rules.
- Robots.txt file: A well-structured robots.txt file can help you manage crawler behavior and prevent unwanted crawling. By specifying which user agents can access certain parts of your site, you can protect sensitive areas and optimize crawl efficiency.
- XML Sitemap: An XML sitemap can help search engines understand the structure of your website and improve crawling efficiency. Including the location of your XML sitemap in your robots.txt file ensures that search engines can easily find and index your important pages.
By understanding user agent and crawler behavior, you can optimize your website for better search engine crawling and indexing. This not only improves your site’s visibility in search results but also ensures that your web pages are indexed accurately and efficiently.
Other quick robots.txt must-knows
There are a few important technical specifics to understand when you’re working with robots.txt files. While the files may seem trivial, they can affect how your whole site is crawled and indexed, so it’s important to know how they work.
- In order to be found, a robots.txt file must be placed in a website’s top-level directory. Since the robots.txt file is stored on the web server, it must be placed in the top-level directory to be found by web crawlers.
- Robots.txt is case-sensitive: the file must be named “robots.txt” (not Robots.txt, robots.TXT, or anything else).
- Some user agents (robots) may choose to ignore your robots.txt file. This is especially common with more nefarious crawlers, such as malware robots or email address scrapers.
- The /robots.txt file is a publicly available: just add /robots.txt to the end of any root domain to see that website’s directives (if that site has a robots.txt file!). This means that anyone can see what pages you do or don’t want to be crawled, so don’t use them to hide private user information.
- Each subdomain on a root domain uses separate robots.txt files. This means that both blog.example.com and example.com should have their own robots.txt files (at blog.example.com/robots.txt and example.com/robots.txt).
- It’s generally a best practice to indicate the location of any sitemaps associated with this domain at the bottom of the robots.txt file.
Identify critical robots.txt warnings using Moz Pro
Moz Pro's Site Crawl feature audits your site for issues and highlights urgent errors that could keep you from showing up on Google. Take a 30-day free trial on us and see what you can achieve
Technical robots.txt syntax
Robots.txt syntax can be thought of as the “language” of robots.txt files. There are five common terms you're likely to come across in a robots file. They include:
- User-agent: The specific web crawler to which you're giving crawl instructions (usually a search engine). A list of most user agents can be found here.
- Disallow: The command used to tell a user-agent not to crawl a particular URL. Only one "Disallow:" line is allowed for each URL.
- Allow (Only applicable for Googlebot): The command to tell Googlebot it can access a page or subfolder even though its parent page or subfolder may be disallowed.
- Crawl-delay: How many seconds a crawler should wait before loading and crawling page content. Note that Googlebot does not acknowledge this command, but the crawl rate can be set in Google Search Console.
- Sitemap: Used to call out the location of any XML sitemap(s) associated with this URL. Note this command is only supported by Google, Ask, Bing, and Yahoo.
Pattern-matching
When it comes to the actual URLs to block or allow, robots.txt files can get fairly complex as they allow the use of pattern-matching to cover a range of possible URL options. Google and Bing both honor two regular expressions that can be used to identify pages or subfolders that an SEO wants excluded. These two characters are the asterisk (*) and the dollar sign ($).
- * is a wildcard that represents any sequence of characters
- $ matches the end of the URL
Google offers a great list of possible pattern-matching syntax and examples here.
Can I block AI bots with robots.txt?
Robots.txt can be used to exclude AI bots like ClaudeBot, GPTbot, and PerplexityBot. A number of news and publication websites have blocked AI bots, and GPTbot is the most blocked bot from Tom Capper's research. Whether blocking AI bots is the right move for you and whether this will be honored by all AI bots is still under review and up for discussion.
How to block AI bots with robots.txt
To block AI bots, enter their unique user-agent and the areas of your site you would like to exclude them from. For example:
User-agent:GPTbot
Disallow: /blog
Disallow: /learn/seo
Where does robots.txt go on a site?
Whenever they come to a site, search engines and other web-crawling robots (like Facebook’s crawler, Facebot) know to look for a robots.txt file. The robots.txt file must be stored on the web server in the main directory to ensure it is found by search engines and other web-crawling robots. But they’ll only look for that file in one specific place: the main directory (typically your root domain or homepage). If a user agent visits www.example.com/robots.txt and does not find a robots file there, it will assume the site does not have one and proceed with crawling everything on the page (and maybe even on the entire site). Even if the robots.txt page did exist at, say, example.com/index/robots.txt or www.example.com/homepage/robots.txt, it would not be discovered by user agents, and thus, the site would be treated as if it had no robots file at all.
Always include your robots.txt file in your main directory or root domain to ensure it is found.
Why do you need robots.txt?
Robots.txt files control crawler access to certain areas of your site. While this can be very dangerous if you accidentally disallow Googlebot from crawling your entire site (!!), there are some situations in which a robots.txt file can be very handy.
Some common use cases include:
- Preventing duplicate content from appearing in SERPs (note that meta robots is often a better choice for this)
- Keeping entire sections of a website private (for instance, your engineering team’s staging site)
- Keeping internal search results pages from showing up in Google search results
- Specifying the location of sitemap(s)
- Preventing search engines from indexing certain files on your website (images, PDFs, etc.)
- Specifying a crawl delay in order to prevent your servers from being overloaded when crawlers load multiple pieces of content at once
If there are no areas on your site to which you want to control user-agent access, you may not need a robots.txt file at all.
Checking if you have a robots.txt file
Not sure if you have a robots.txt file? Simply type in your root domain, then add /robots.txt to the end of the URL. For instance, Moz's robots file is located at moz.com/robots.txt.
If no .txt page appears, you do not currently have a (live) robots.txt page.
How to create a robots.txt file
If you found you didn't have a robots.txt file or want to alter yours, creating one is a simple process. This article from Google walks through the robots.txt file creation process, and this tool allows you to test whether your file is set up correctly. You can also make use of the tool like Robots.txt Parser to parse your robots.txt file like Google does.
Common mistakes to avoid
When creating and managing a robots.txt file, there are several common mistakes to avoid:
- Incorrect file location: Make sure your robots.txt file is located in the root directory of your website (e.g., www.example.com/robots.txt). Placing it elsewhere will result in it not being found by user agents.
- Incorrect file format: Use a plain text file with UTF-8 encoding, and avoid using word processors that can add unexpected characters. This ensures that the file is readable by all user agents.
- Overly restrictive rules: Avoid disallowing entire directories or files that may be relevant to your website’s content. Overly restrictive rules can prevent important pages from being indexed, negatively impacting your SEO.
- Insufficient testing: Test your robots.txt file regularly to ensure it is working correctly and not blocking important pages or resources. Tools like Google’s robots.txt Tester can help you verify your file’s functionality.
- Ignoring crawler behavior: Understand how different crawlers behave and adjust your robots.txt file accordingly. Some crawlers may not respect certain directives, so it’s important to tailor your rules to the behavior of specific user agents.
- Not updating the file: Regularly update your robots.txt file to reflect changes to your website’s structure or content. As your site evolves, so should your robots.txt file to ensure it continues to serve its purpose effectively.
By avoiding these common mistakes, you can ensure your robots.txt file is working effectively to manage crawler behavior and improve your website’s search engine optimization. Proper management of your robots.txt file helps maintain a well-optimized site that is easily navigable by search engines, ultimately enhancing your visibility in search results.
SEO best practices for Robots.txt
- Make sure you’re not blocking any content or sections of your website you want crawled.
- Links on pages blocked by robots.txt will not be followed. This means 1.) Unless they’re also linked from other search engine-accessible pages (i.e. pages not blocked via robots.txt, meta robots, or otherwise), the linked resources will not be crawled and may not be indexed. 2.) No link equity can be passed from the blocked page to the link destination. If you have pages to which you want equity to be passed, use a different blocking mechanism other than robots.txt.
- Do not use robots.txt to prevent sensitive data (like private user information) from appearing in SERP results. Because other pages may link directly to the page containing private information (thus bypassing the robots.txt directives on your root domain or homepage), it may still get indexed. If you want to block your page from search results, use a different method like password protection or the noindex meta directive.
- Some search engines have multiple user-agents. For instance, Google uses Googlebot for organic search and Googlebot-Image for image search. Most user agents from the same search engine follow the same rules so there’s no need to specify directives for each of a search engine’s multiple crawlers, but having the ability to do so does allow you to fine-tune how your site content is crawled.
- A search engine will cache the robots.txt contents, but usually updates the cached contents at least once a day. If you change the file and want to update it more quickly than is occurring, you can submit your robots.txt url to Google.
Robots.txt vs meta robots vs x-robots
So many robots! What's the difference between these three types of robot instructions? First off, robots.txt is an actual text file, whereas the meta tags and x-robots are meta directives. Beyond what they actually are, the three all serve different functions. Robots.txt dictates site or directory-wide crawl behavior, whereas meta and x-robots can dictate indexation behavior at the individual page (or page element) level.
How do I view another site's robots.txt file?
Robots.txt is one of the more accessible areas of SEO since you can access any site's robots.txt.
http://www.example.com/robots.txt
You can view any site's robots.txt file by entering the URL above directly into the browser. Replace www.example.com with your chosen website domain.