What Are Webcrawlers and robots.txt?
Web crawlers are the driving force behind populating the databases of search engines. These search engine crawlers are programs written by search engines in order to view, store, and index your websites within their search engines. Once your site has been crawled once, these bots and crawlers will continue to crawl your site, looking for updated or new content. In addition to search engines, web crawlers can be used to scan for copyright infringements, viruses, or for projects such as https://archive.org/.
While these web crawlers are (usually) well intentioned, they won’t crawl your site in the same way, nor will they crawl your site at the same pace. These differences are where the robots.txt file comes into play, so you can better control how these bots behave on your website. The robots.txt file is meant to serve as a guide so these bots know where not to scan, and at what rate they should be going over your site. Not everyone needs a robots.txt file, but if bots or web crawlers are becoming problematic, read on!
Using the robots.txt file
The robots.txt file should only exist directly within the document root of your site. For example, Liquid Web’s robots.txt file lives here: https://www.liquidweb.com/robots.txt. The robots.txt file is only required to be read by the crawler once when it begins its crawling session of your site.
There are only a few directives to use within the robots.txt file, each directive can take one value, or you can use an asterisk (*) to mean “all”.
- User-agent: <crawler-name> The user-agent section can be used to set specific robots.txt directives that only apply for a certain crawler. The crawler’s user-agent can be found within the access logs for your webserver, or within most analytics tools, such as Google analytics, awstats or webalizer.
- Disallow: </location/> This directive states where a crawler should not crawl. Typical disallowed crawling areas usually include:
- Administrative logins and panels
- Areas still in development
- Areas you’d just prefer the crawlers avoid
- Crawl-delay: <number> A crawl delay can be used to tell web crawlers to slow down the rate at which they go through your website. This can be particularly useful if web crawlers are causing your server’s load to increase, which can then lead to slower loading times on your website. It’s very important to point out that different crawlers will interpret the crawl-delay directive in different ways, which can be a bit confusing.
- Bing, for example will treat the number set as a factor at which it will slow down it’s crawling, where 1 is a little slower than usual, and 10 is significantly slower than usual. Most other crawlers and bots will treat the number you set as a crawl delay as the number of seconds to wait between each page it crawls on your site. Slowing down the rate at which these crawlers go over your site can reduce the amount of work your server has to do at once, and can help keep more of your server’s resources free for other tasks.
robots.txt and Best Practices |
| Keep in mind, robots.txt is a set of guidelines, and isn’t something that all crawlers are forced to follow. While it is best practice for these bots to follow your rules, some do not. Googlebot for example, requires you to set crawling directives within their Google Webmasters dashboard, rather than reading your robots.txt file. Much less reputable bots, on the other hand, will sometimes just overtly disregard the robots.txt file because they can. |
In cases where bots are causing a problem for your site, and aren’t following your directives in the robots.txt file, you always have the option to block them in your server’s firewall which will stop them from accessing your server entirely. As a “smaller hammer” alternative, you could also just block the IP of the bot via cPanel, which will only block them from accessing a specific website.