What is robots.txt and when to use it?

blog-image

Robots.txt is a file placed on a website’s home directory that contains directives or instructions for well-known bots (search engine or other crawlers). These instructions prevent them from scanning parts of the website. Webmasters use this to prevent scanning of duplicate content, sections behind login, shopping carts and other links that are of no use to robots. It serves as an informal guideline than a requirement for bots — like a “Do Not Track” request. Spam bots, website scrapers, and malware can choose to ignore these directives with no penalties.

When to use robots.txt?

Well-known search engines (Google, Bing, Yandex), and other crawlers (SemRush, Ahrefs, Telegram) respect these guidelines. This means you can limit them from crawling your website too often and take up resources. Webmasters can outright ban these bots from crawling your websites as well.

As a web developer, you may have already come across this file. Before you ban bots from crawling your website, you’ll not only need to analyze your bot traffic but also digital marketing and business strategy. Some bots are gathering data entirely for themselves with no benefits to the website owners. Webmasters can safely disallow them until they are required.

The Robots.txt Standard

Robots.txt should follow the robots exclusion standard. All well-known crawlers understand standard directives, but there are some unique non-standard directives only followed by a few.

How to create a directive?

To compose robots.txt directive you need two parts. The identifier and the rule.

The Identifier

Most of the time, the user-agent of the bot is the identifier unless you want a directive for every bot. The * character matches every bot that obeys robots.txt. To match multiple bots, list them one per line. The directive should contain the exact name of the bot. The line following it should contain the rule/s the bots listed in the first part of the directive should follow.

The *character is a wildcard match. Presence of this character represents all the bots that obey the robots exclusion standard. One or multiple rules follow this line.

The rules

The standard and the most used rule is the Disallow rule. A colon : and a path of the website follow the rule declaration. It tells the bot to ignore the page or multiple pages.

Non-standard directives

These are directives followed by a few bots. Inclusion of these do not invalidate a robots.txt. It depends on whether you would like to optimize your website for these bots.

The Sitemap directive provides the absolute links for the sitemap files on the website.

Sitemap: http://www.example.com/sitemap.xml

The Allow directive does exactly what it sounds. It allows a bot to crawl the path of the website. It is useful when you want to allow the crawling of a single file in a directory.

Allow: /comics/marvel.html
Disallow: /comics/

The Crawl-delay directive is the time duration (in seconds) between two crawls. Google ignores this directive.

User-agent: bingbot
Allow: /
Crawl-delay: 10

Types of robots

Search Engine Bots

Search Engine bots are important as they help you get found through search queries. But you do not have to let every search engine bots in to your website. If your website doesn’t contain or offer services to the Chinese audience, blocking the Baidu search engine crawler is a good option. Same for Yandex, as it mostly caters to a Russian-speaking audience.

Google and Bing are only major search engines that have the infrastructure to crawl the massive number of websites on the internet. If you’re looking for organic search traffic, these two search engines are the main ones that matter in Search Engine Optimization. Other meta search engines like DuckDuckGo, Qwant, Ecosia depend on the Bing search index with additional algorithms.

Major search engines obey the robots.txt directives and will act accordingly. Google also has robots.txt tester for users using Google webmasters, where you can check if you configured your file correctly for Google crawler. It may ignore rules like Crawl-Delay which Bing obeys.

Search Engine Marketing Robots

This section contains bots like Semrushbot, Ahrefsbot, Moz, others. These bots crawl websites, gathering vast trove of information from the websites content for their analytic products. If you do not subscribe to their products, these services do not provide any benefits.

Note: If you don’t use them and are not interested in participating in their data collection operation, you can block these with no issues. You might even save a little bandwidth.

Social Media and Chat Robots

Social media and chat robots are bots that gather information about the links shared on their platform. These include Facbookbot, Telegrambot, TwitterBot, and others. They grab the meta description from the links to display title, image, and sometimes the description.

Scrapers and others bot

Everyone can build their own bots and scrapers with many programming languages. There are tools like curl which can send http requests and receive responses with great flexibility. These one-off tools and bots do not check for robots.txt before navigating to your websites. They can use different user-agents to disguise themselves as browsers or other bots.

These tools are hard to block and used by malicious programs to enumerate through a website looking for vulnerabilities, exploits or just to scrape content.

The best way to prevent unwanted bots from taking up valuable processing power is to use server configuration files like .htaccess (for Apache) to prevent access for certain bots. If you’re not technically inclined, you can use a service like CloudFlare to block them with Firewall Rules.

Conclusion

Organizing robots.txt should be one of your main priority when setting up your website. When a search engine indexes your link, you may find it hard to remove them from search results. Blocking Search Engine Marketing robots may be a safe bet if you don’t use their services and want nothing to do with them. You can always allow these robots to crawl your website, when and if you may want to try them out.

There are also some common mistakes that you might want to avoid. These include:

  1. Duplicate directives for the same user-agent cause only a single one being followed
  2. Paths are case sensitive
  3. Make sure only an authorized person can change the content of the robots.txt file.

This should cover all the basics of the robots.txt, and how when to use the for the best results.

Category: Internet