What is robots.txt? : Explaining its purpose and how to write it

2024.05.22

contents

1 What is robots.txt?
2 Setting up robots.txt
3 Do not use robots.txt to deny indexing
4 FAQs on robots.txt
5 Summary

Have you, as a website manager, heard of robots.txt?

This file instructs search engine crawlers which pages to crawl and which to avoid.

Properly configuring robots.txt aids in search engine optimization (SEO) and efficiently advances the indexing of your website.

In this guide, we will explain the purpose of setting up robots.txt and provide a detailed guide on how to write it.

SEO Internal Measures Checklist

Title Character Count Meta Description Tag Heading (H Tag) Setup

Placement of HTML Tags Optimization of Internal Links Optimization of Directory Structure

Implementation of Breadcrumb Navigation What is Alt Attribute? Indexing Measures

Implementation of Structured Data SSL Encryption (HTTPS) Setting up Canonical Tags

What is a Sitemap? Setting up robots.txt File URL Normalization

Improving Page Load Speed Handling URLs with and without ‘www’ Promoting Core Web Vitals

Improving UI and UX Speeding up Server Processing Responsive Web Design

What is robots.txt?

Robots.txt refers to a file used to deny crawling of specific content. Crawling is when search engine crawlers, or robots, traverse sites on the internet to gather site information, a crucial process for collecting data that gets stored on search engine pages.

Typically, a website contains both critical and less critical content. Using robots.txt allows the control of certain crawls, focusing the crawling on important content.

History of robots.txt

Robots.txt was devised in 1994 by Martijn Koster, who was working for the search engine WebCrawler. Initially, it was provided as a means for website operators to control crawler access.

In 1997, the Robots Exclusion Standard (also known as Robots Exclusion Protocol) was established, making robots.txt more commonly used.

Now, based on this standard, website operators can instruct crawlers on which pages to deny crawling, making it an indispensable tool for website management.

Reference page: Wikipedia

Difference between robots.txt and noindex

A setting that can be easily confused with robots.txt is the use of noindex. Noindex is a setting to prevent search engines from indexing (saving information to data pages), implemented through a meta tag in the HTML code.

Hence, there are significant differences in purpose and setup method as outlined below:

Item	robots.txt	noindex
Format	Text file	Meta element or HTTP header
Scope	Can be set for the entire site	Set for individual pages
Purpose	Denies crawling	Denies indexing

With noindex, indexing is denied, so the content will not appear in search results. However, with robots.txt, the setting denies crawling, but there is still a possibility for the content to appear in search results.

The purpose of robots.txt

The primary purpose of robots.txt is to deny crawling, but it also plays various roles such as optimizing crawlability and submitting XML sitemaps. It is an indispensable element as an alternative to noindex, which can only be used with HTML.

Here, we’ll explain in detail the purpose of robots.txt:

Prevent crawling of specific pages

The main purpose of robots.txt is to deny crawling of specific content, which can be set at various levels, such as by page or directory.

For example, it can be utilized for content like;

-Incomplete pages

-Pages that require login

-Pages only available to members

Such pages are often not intended to be shown in search results, or SEO is not considered at all. Unnecessary crawling of content can potentially be counterproductive in SEO.

By using robots.txt to deny crawling, you can prevent situations that could undesirably lower the site’s rating.

Preventing crawling of image and video files

While using images and videos is common in website management, these files are not HTML, and therefore cannot be set with noindex. However, robots.txt allows you to deny crawling for non-HTML files, making it often used as an alternative when noindex cannot be applied.

In some cases, image pages may be automatically generated by the site being used, or they can be created separately, allowing the use of noindex just like with regular pages.

Optimizing crawlability

Robots.txt can direct crawling to important content, thus contributing to optimized crawlability.

For sites with a small amount of content, this isn’t an issue, but for sites like e-commerce sites with a large number of pages, crawlers may not be able to crawl every page.

Some important pages might not get crawled due to the sheer number of pages.

If such pages are critical for attracting visitors or generating inquiries, missing them can be a significant loss for the site.

Therefore, using robots.txt to deny crawling of unnecessary pages and ensure important pages are crawled can be effective. Moreover, optimizing crawling with robots.txt can increase the overall crawl frequency of the site and the amount of crawled content.

Presenting an XML Sitemap

Robots.txt can include an XML sitemap to guide search engines in crawling your website. While you can submit sitemaps via Google Search Console or Bing Webmaster Tools, some search engines lack such tools.

Therefore, robots.txt becomes a handy method to efficiently communicate your site’s structure and crawling preferences.

Writing robots.txt

When configuring robots.txt, you follow a set structure, inputting relevant details in predefined sections. There are typically four main elements to consider. For specific sample code, refer to “Google Search Central”.

This section will explain how to write a robots.txt file.

User-Agent

User-Agent is used to specify which crawlers you want to control.

Here are the details:

For all crawlers: * (asterisk)

For Google’s crawler: Googlebot

For smartphone crawlers: Googlebot-Mobile

For AdSense’s crawler: Mediapartners-Google

For Google image search’s crawler: Googlebot-Image

The basic notation is to use ‘*’ to target all crawlers. If you want to block crawling from Google, you would specify the Google crawler as ‘Googlebot’.

Disallow

Disallow is used to specify which pages or directories should not be crawled. By entering the URL path, you can selectively prevent crawling.

For the entire site: ‘Disallow: /’

To specify a directory: ‘Disallow: /abc9999/’

To specify a page: ‘Disallow: /abc9999.html/’

Replace ‘abc9999’ with the actual URL path.

Remember the usage of Disallow as it is commonly used in directives.

Allow

Allow is used to permit crawling, serving the opposite function of Disallow. Typically, crawling is allowed by default, so the use of the Allow directive is less common.

Essentially, you use Allow when you want to permit crawling of specific pages or directories despite having a Disallow directive in place.

For example:

User-agent: *

Disallow: /sample/

Allow: /sample/abc9999.html

In the above case, while the ‘sample’ directory is disallowed for crawling, the page ‘abc9999.html’ within that directory is explicitly allowed for crawling.

Sitemap

Sitemap, as the name suggests, is used to submit the site map for crawling.

Including the Sitemap is optional, but doing so can increase the speed of crawling, which is beneficial for improving crawlability.

Here’s how you specify it:

-Sitemap: http://abc9999.com/sitemap.xml

-Replace ‘abc9999.com’ with the actual path to your sitemap.

If you have multiple sitemap paths, enter each on a new line.

Related Article: Does a Sitemap Help SEO? The Surprising Risks of Sitemaps

Setting up robots.txt

To set up robots.txt, follow these steps:

-Use a plugin

-Direct upload

For WordPress sites, using a plugin like “All in One SEO Pack” makes setting up robots.txt straightforward. Here’s how to configure robots.txt:

Using a Plugin

If you’re using a WordPress site, you can easily set up your robots.txt file using the “All in One SEO Pack” plugin.

Here’s how you can configure it:

-Download and activate the “All in One SEO Pack”.

-From the WordPress dashboard, navigate to “Posts” → “Robots.txt” settings.

-Activate all features in the “Features” section.

After these steps, the following will appear at the bottom of the “Create Robots.txt file” section:

User-agent: *

Disallow: /wp/wp-admin/

Allow: /wp/wp-admin/admin-ajax.php

Sitemap: https://sample.com/sitemap.xml

Edit as necessary, referencing the previously mentioned guidelines to complete the setup.

Direct Upload

A universal method is to directly upload to the site’s root directory.

Specific requirements are:

-File format: Unformatted text encoded in UTF-8

-File size: Maximum 500KB

While subdomains are fine, note that it won’t be detected in subdirectories.

Checking robots.txt

While you can manually check the robots.txt file, using a tool is recommended to avoid oversights or errors. Google’s “robots.txt Tester” is a free tool that allows easy error checking by entering the URL.

Here, we explain how to use the “robots.txt Tester” to verify your robots.txt.

Reference page: Search Console Help

Syntax Checking

Syntax checking refers to verifying whether the contents of a robots.txt file are grammatically correct. You can perform a syntax check using the following method:

-Access the “robots.txt Tester”.

-Enter the relevant URL path in the URL input field at the bottom of the screen and click “Test”.

-The test results will be displayed.

Before testing, ensure that your site is correctly linked. If your site does not reflect in the results, it indicates that the robots.txt file is not properly installed. In that case, reinstall the robots.txt file and then conduct the test again.

Syntax Correction

After checking the test results with the “robots.txt Tester,” ensure there are no errors.

If there are errors, first correct them within the “robots.txt Tester” by clicking on the error location and directly entering the text, which will modify the syntax.

If errors persist, continue modifying the content until there are no more errors.

Note that changes made in the “robots.txt Tester” do not alter the actual robots.txt file. Therefore, after identifying the errors, you should correct the actual file.

Repeat the test process to ensure no errors occur, completing the verification.

Notes on Setting up robots.txt

Setting up robots.txt is relatively straightforward, as it involves entering data according to specified fields. However, certain aspects can be easily misunderstood, so it’s crucial to be attentive to ensure the use aligns with the intended purpose:

-Do not use robots.txt to deny indexing.

-Do not use it for handling duplicate content.

-It does not restrict user access.

-Update the txt.

Here’s an explanation for each point:

Do not use robots.txt to deny indexing

A common mistake is using robots.txt with the intention of denying indexing.

Robots.txt is strictly for denying crawl access, whereas noindex must be used for denying indexing. While they may seem to produce similar outcomes, incorrect usage can lead to issues like displaying search results without a site description, thus it’s vital to use the right method to avoid negatively impacting the site’s overall evaluation.

Do Not Use for Duplicate Content Prevention

For the same reasons as “index denial” mentioned earlier, do not use robots.txt for duplicate content prevention.

As the volume of content on a site increases, duplicate content is more likely to occur. While it may seem logical to use robots.txt to deny crawling and eliminate duplication, once content is indexed, it is recognized as duplicate by search engines.

Since robots.txt cannot fully address this, use “noindex” or “URL canonicalization” to manage duplicate content.

Not for User Access Restriction

A common misuse of robots.txt is attempting to restrict user access. This is a misconception, akin to thinking it completely excludes content from search engines.

However, robots.txt does not restrict user access. Therefore, even if crawling is denied, access to the URL is still possible if it is listed on the internet.

To implement access restrictions, separate settings are required, so be careful not to misunderstand the effects that can be achieved.

Updating robots.txt

When a website is redesigned or its URL structure changes, updating the robots.txt file is necessary not only to provide accurate information to search engines but also to impact the website’s SEO positively. Specifically, if a page indexed under an old URL is redirected to a new one, a 301 redirect should be set up, and the new URL should be reflected in the robots.txt file. This ensures search engines correctly crawl the new page, preventing negative impacts on the website’s SEO.

Changes in page deletion or addition should also prompt an update to the robots.txt settings.

FAQs on robots.txt

Here we have compiled some frequently asked questions about robots.txt.

Q: Where should the robots.txt file be placed?

A: The robots.txt file should be placed in the root directory of the website, which is the highest level directory. While it’s technically possible to place robots.txt elsewhere, it’s not recommended because search engines might not crawl it correctly. For more details on structuring directories effectively for SEO, refer to the following page.

Reference page: Guides on creating strong directory structures for SEO

While it’s technically possible to place the robots.txt file outside the root directory, it’s not recommended because it might prevent search engines from correctly crawling the site.

Q: What problems can occur from incorrectly setting up robots.txt?

A: Incorrect configuration of robots.txt can prevent search engines from crawling pages of the website. For example, specifying wrong directives can lead to search engines being instructed not to crawl, potentially causing a drop in SEO rankings.

Q: Does robots.txt affect SEO?

A: Properly configured robots.txt can positively impact SEO by directing search engines to ignore unnecessary pages. Conversely, improper configuration can lead to lower SEO rankings.

Q: Is robots.txt used to hide pages?

A: Robots.txt is used to specify pages or directories that search engines should not crawl, not to hide pages from view. To hide pages, mechanisms other than robots.txt, such as the nofollow attribute in meta tags, should be used.

Summary

This article has explained the basics, writing methods, and specific settings for robots.txt. It’s crucial to understand the difference between noindex and crawl prevention to avoid losing potential SEO benefits. As setting up robots.txt can be straightforward, consider following the guidelines and methods discussed in this article for effective implementation.

Author Profile

Mr. Takeshi Amano, CEO of Admano Co., Ltd.

Mr. Takeshi Amano is a graduate of the Faculty of Law at Nihon University. With 12 years of experience working in the advertising agency industry, he discovered SEO and began his research during the early days of SEO. He self-taught and conducted experiments and verifications on over 100 websites. Using this expertise, he founded Admano Co., Ltd., which is currently in its 11th year of operation. Mr. Amano handles sales, SEO consulting, web analytics (holding the Google Analytics Individual Qualification certification), coding, and website development. The company has successfully managed SEO strategies for over 2000 websites to date.