What is robots.txt?
The robots.txt file is used to provide search engine robots and crawl programs with information about what they should do and what not to do on the page. Directives are sent using the Robots Exclusion Protocol standard, although it should be noted that some search engines declare that they include non-standard entries. The basic entries include messages whose parts of the page should not be read by robots, although there are more possible uses of the robots.txt file.
A bit of history
The Robots Exclusion Protocol was created literally a quarter of a century ago, in February 1994 - and has not changed much since then, except for the abovementioned non-standard records. Because in the days of its "youth" there were many search engines on the market (it's enough to mention AltaVista, HotBot, Lycos or InfoSeek - and the list was much longer after all), it quickly became the unofficial standard. It should be mentioned here, however, that the standard is unusual because this record was really and is a suggestion, which bots often do not respect or respect it only partially.
Interestingly, in July 2019 Google - whose bots also do not always fully comply with the directives saved in robots.txt files - proposed to consider the Robots Exclusion Protocol as an official standard. Can it currently change anything in the way robots.txt is used? Theoretically, no. However, it may cause discussions about the introduction of new entries that could help in more efficient "control" of search engine robots.
Which robots read the robots.txt file?
The robots.txt file is intended for all automation systems entering the site. This applies not only to the most obvious search engine robots from the SEO point of view. Bots to which directives of this file are addressed are also automatic archiving machines (such as Web Archive), programs that download the site to a local drive(e.g. HTTrack Website Copier), website analysis tools (including SEO tools such as Xenu, but also the Majestic SEO and Ahrefs bots) etc.
Of course, it is easy to guess that in many cases creators should not worry about directives. On the other hand, some robots allow their users to choose whether to comply with the detected directives.
Why use a robots.txt file?
This is a basic question that is worth asking - especially in the context of the information mentioned several times that it respects the entries in the robots.txt file. The answer is simple: little control over the robots is always better than the lack of it. And what can you gain from it? First of all, not allowing automation systems to browse those sections of the website that they should not visit for different reasons and showing them places where visits are most advisable.
Blocking specific areas of the page can be important for a variety of reasons:
- Security Issues - perhaps you just don't want robots (or accidental users who later use resources crawled by robots) to be able to get to sections that they shouldn't have access to too easily.
- Protection against duplicate content - if there is a large amount of internally duplicated content on the page, and at the same time, the URL scheme allows it to be clearly identified, using a robots.txt file you can give search engines a signal that this part of the site should not be tracked.
- Saving transfer - with the help of robots.txt entries you can try to remove from the paths that robots travel, entire subdirectories or specific types of files - even a folder containing graphics or their high-format versions. For some websites, the transfer savings can be significant.
- Content protection against "leaking" outside - note that the suggested above protection for a folder with large-format graphics can also be used to present only smaller versions in the image search engine. This can be important in the case of photo banks (but not only).
- Crawl budget optimization - although I mention it at the end of the list, it is definitely not a trivial thing. The larger the website, the more emphasis should be placed on optimizing the paths along which search engine indexing bots move. By blocking sites that are irrelevant to SEO at robots.txt, you simply increase the likelihood that robots will move where they should.
Basic directives in robots.txt: user-agent, allow and disallow
Let's get to the bottom of the issue: what the robots.txt file should look like. By definition, it should be a text file placed in the main directory of the website it relates to. It's main, most common directives are user-agent, allow and disallow. Using the first one, it is possible to determine to which bots a given rule is addressed. The other two indicate which areas the robot should have access to and where it is not welcome.
It is worth remembering that the robots.txt file supports the variable in the form of an asterisk (*) and the file paths to which the command applies should always be filled with anything, even a slash (/). Any lack of filling will ignore the field.
An example of a good fill may be the following:
- that is, declaring that all bots can index the entire site. Similarly:
- means denying access to the /img/ directory.
On the other hand:
- does not mean anything, due to the lack of a declared path after the disallow directive.
Of course, there may be more allow and disallow fields in one robots.txt file. Example? Here you go:
- that is, permission for the robots to visit the entire site, except the /img/ and /panel/ folders.
It should be added that the directives themselves may apply not only to entire directories, but also to individual files.
The order of allow and disallow directives in robots.txt
If there may be a problem with the interpretation of the allow and disallow directives, for example, if you want to prohibit robots from accessing a directory, but make an exception for a specific subdirectory, remember that the permitting directives should be above the prohibiting - for example:
In the example above, I immediately showed the case when separate rules were used for some bots - this way you "request" the robots from Ahrefs and Majestic SEO not to move along the page.
In addition to "invitations" and suggesting skipping directories, the file robots.txt can also be used to show robots the location of the site map. The sitemap directive is used for this, followed by the full map path. An example of the above looks as follows:
Of course, you can indicate more maps, which can be useful for very complex websites.
For very large websites, a dilemma often arises - on the one hand, their owners may want to index the whole site, on the other, excessive activity of search engine bots can consume quite a lot of transfer and load the server constantly with new queries. The idea to solve this problem was to introduce the use of the custom crawl-delay directive.
It is used to inform robots that they should not download new files more often than every x seconds, which translates into stretching the robot's work over time.
- that is, downloading subsequent documents every two seconds.
It is worth remembering that most search engines treat it quite freely, often simply ignoring it. Google for some time communicated the irrelevance of this directive, and finally, in July 2019, officially announced that it will not support it. Bing declares that the record is read by BingBot, its value should be between 1 and 30. The directive is also theoretically supported by Yandex, although it is different in practice.
Interestingly, the Czech search engine Seznam suggests using another directive, namely request-rate and assigning a value to it by providing the number of documents, a slash and time in the form of number and unit (s as seconds, m as minutes, has hours and days, each time without spaces after the number). An example of this may be as follows:
Seznam declares that the directive should not require slower indexation than one document for every 10 seconds.
Clean-param is a very interesting directive. Unfortunately it is a general standard. This directive is read by the Yandex search engine bots and allows you to ignore specific parameters assigned to addresses in specified paths.
How does it work in practice? Let's assume there are addresses within your page:
What happens if the variable "tlo" (background) only modifies the appearance of a page that has the same content all the time? In such cases, Yandex suggests to use the clean-param parameter. The appropriate record may look as follows:
- which means that all three addresses given in the previous example will be read as:
As you can see, this directive is more convenient because it can be limited to specific directories.
The custom robots.txt directive also lists the host command. Ignored by most search engines, it was mentioned on the Yandex help pages for some time, though now its description has disappeared.
The host command is used as (or maybe it served as) an indication of the preferred domain if you have several mirrors located at different addresses. What's important is that there should be at most one host directive in one robots.txt file (if more are placed, the next are ignored), and the domain entry after the host command cannot contain errors or port numbers.
Unfortunately, I do not know if the command still works, but I can assume, knowing the ingenuity of positioners, that it only tempted to various experiments with placing it on not necessarily the domains on which it should be. As a consolation for "ordinary webmasters" it should be mentioned that Yandex, when mentioning the directive on its pages, presented it as a "suggestion" for robots - so it did not treat it as obligatory.
Errors in robots.txt and their consequences
Although the contents of the robots.txt file can be used to "get along" with search engine robots, it might as well cause your site to fail. How? By excluding content from the search that should appear in the index. The effect can be a significant loss of visibility in search results. Especially in the case of extended robots.txt files with many entries to various subdirectories, it is easy to make a mistake somewhere along the way and exclude too many sections of the page.
The second major mistake is to cover all images, CSS styles and Java Script files with the disallow directive. It may seem like a good move, but the reality is slightly different for two reasons. First of all, in many cases it is a good idea if your page is found in the image search results (although of course you can forbid access to e.g. large-format versions, which is something I have mentioned earlier).
The second reason, however, is more important, and it is rendering of the site by Google Bot. If you do not allow the bot to access files that are important for the final appearance of the page, it will render it without them, which in some cases may make it incomplete from its point of view - and this may affect the ranking.
When creating a robots.txt file, should you pay attention to its size?
One Googler, John Mueller, once stated on his Google+ profile that the maximum size of a robots.txt file is 500 KB. Thus, it can be concluded that the problem is abstract because such an extension of the list of directives would be absurd. However, it is worth striving to ensure that even a short robots.txt file does not grow excessively and simply maintains readability for ...someone who will have to look at it and possibly supplement or modify it.
In addition, you must also remember that this is only about accepted value via Google Bot - for other search engines, the robots.txt file size limit may vary.
Is the page block in robots.txt sufficient?
Unfortunately, no. First of all, the main search engine robots don't always respect the bans (not to mention how some tools approach them). Secondly, even after reading the ban, Google may enter the page and add it to the index, taking into account only its title and URL address, and sometimes adding the following statement "For this page information is not available."
So it is still possible to get to this page from the search engine level, although this is unlikely. What's more, bots still go through such pages after subsequent links, even though they no longer provide link juice, and their ranking does not include data resulting from their content.
What else besides the robots.txt file?
If you want to exclude specific parts of the page from search engine indexes, you can always additionally use the robots meta tag placed in the <HEAD> section of individual subpages:
<meta name="robots" content="noindex, nofollow" />
- this is still not a method ensuring 100% success (and in addition is less convenient), but it is an additional signal for bots.
But what if you want to completely block access to bots and random people? In this situation, instead of passive methods, calculated with the fact that no one gets to a given place, it is much better to simply cover the given section of the page with a password (even through htaccess).
Theoretically, you can also reach for half measures, for example in the form of blocking access for calls from specific addresses and IP classes (those used by search engine bots), but in practice it would be enough to miss some addresses and the problem would still exist – and this leads us to the conclusion again, that forcing authorization will ensure full security.
Finally, we can return to the issue of the consequences of possible errors in completing the robots.txt file. Whatever you type in there, it's worth remembering what the effects can be and ... what it's meant to do. When you want to index something, pay attention to whether it will lead to side effects (see an example with the difficulty of rendering the page by Google Bot). In turn, if you care about security issues, remember that Google's exclusion from indexation still does not block access to scraping automation systems.
If you still feel like you could use some more knowledge ... you can read the other articles regarding positioning of pages in our Knowledge Base.