What is robots.txt? Explaining Its Purpose and How to Write It
contents
If you’re a web administrator, you’ve likely heard of robots.txt. This file instructs search engine crawlers which pages to crawl and which to avoid. Properly setting up robots.txt can aid in search engine optimization (SEO) and effectively manage the indexing of your website.
In this article, we’ll explain the purpose of robots.txt and how to write it.
What is robots.txt?
Robots.txt is a file used to deny crawlers access to specific content. Crawling, performed by search engine robots, involves traversing internet sites to gather information, an essential mechanism for collecting data that is later stored on search engine data pages.
Typically, a site contains both important and less important content. Robots.txt allows you to control some of this crawling, focusing it on the more critical content.
History of robots.txt
Robots.txt was conceived in 1994 by Martijn Koster, who worked for the search engine WebCrawler. Initially, it was provided as a means for website operators to control crawler access.
In 1997, the Robots Exclusion Standard (also known as Robots Exclusion Protocol) was established, making robots.txt more commonly used. Today, based on this standard, website operators can instruct crawlers on which pages to avoid, making robots.txt an indispensable tool.
Reference page: Wikipedia
Difference Between noindex and robots.txt
A setting often confused with robots.txt is the use of noindex. Noindex is a setting to prevent search engines from indexing (storing information on data pages), implemented through a meta tag in HTML code.
Thus, there are significant differences in purpose and setting methods between the two:
Item | robots.txt | noindex |
Format | Text file | Meta element or HTTP header |
Applicability | Can be set for the entire site | Set for individual pages |
Purpose | Deny crawling | Deny indexing |
Note that noindex prevents indexing, so pages won’t appear in search results. However, robots.txt denies crawling, meaning pages could still appear in search results.
Purpose of robots.txt
The primary purpose of robots.txt is to deny crawling, but it also plays various roles, such as optimizing crawlability and submitting XML sitemaps. As an alternative to noindex, which can’t be used for non-HTML content, robots.txt is essential.
Here, we delve into the purposes of robots.txt:
Prevent Crawling of Specific Pages
The main goal of robots.txt is to deny crawling of certain content, applicable at various levels, like page or directory.
It’s useful for pages like unfinished ones, login-required pages, or member-exclusive content. Preventing unnecessary content from being crawled can avert potential negative SEO effects.
Deny Crawling of Image and Video Files: While images and videos are frequently used, they can’t be set with noindex as they are not HTML. However, robots.txt allows you to deny crawling for non-HTML files. It’s often used as an alternative when noindex isn’t applicable.
Optimize Crawlability
Robots.txt can direct crawlers to important content, optimizing crawlability. For sites with many pages, like e-commerce sites, not all pages may be crawled. By using robots.txt to deny crawling of less critical pages, you ensure important pages are crawled, potentially increasing the site’s overall crawl frequency and volume.
Submit XML Sitemaps
Robots.txt can include XML sitemaps, informing search engines about the sitemap. While you can submit sitemaps through tools like Google Search Console or Bing Webmaster Tools, robots.txt is a convenient method when such tools aren’t available.
How to Write robots.txt
When setting up robots.txt, you follow specific items and input relevant content. There are mainly four items to describe, and for specific sample codes, please refer to “Google Search Central.“
Here, we explain how to write robots.txt:
User-Agent
User-Agent is used to specify the crawler you want to control.
Content to write includes:
- All crawlers: *(asterisk)
- Google’s crawler: Googlebot
- Crawler for smartphones: Googlebot
- AdSense’s crawler: Mediapartners-Google
- Google Image Search’s crawler: Googlebot-Image
The basic method is to input ‘*’ for all crawlers. If you want to deny crawling from Google, input ‘Googlebot’ for Google’s crawler.
Disallow
Disallow is used to specify pages or directories to deny crawling. By entering the URL path, you can set limited crawl denial.
- Entire site: “Disallow: /”
- Specify a directory: “Disallow: /abc9999/”
- Specify a page: “Disallow: /abc9999.html/”
- Input the URL path in “abc9999”
Remember the content for Disallow as it’s a frequently used item.
Allow
Allow is used to permit crawling, the opposite role of Disallow. However, normally, crawling is allowed even without inputting Allow items. Therefore, its use is less frequent.
Basically, you use Allow when you have entered Disallow but want to permit crawling for specific pages or directories.
For example:
- User-agent: *
- Disallow: /sample/
- Allow: /sample/abc9999.html
In the above case, it allows crawling for the ‘sample’ directory but only permits the ‘abc9999.html’ page.
Sitemap
Sitemap, as the name suggests, is used to submit sitemaps.
Inputting a Sitemap is optional, but doing so tends to increase crawl speed. Therefore, it’s recommended to input it if you want to improve crawlability.
Content to write:
- Sitemap: http://abc9999.com/sitemap.xml
- Input the sitemap path for “abc9999.com.” If there are multiple sitemap paths, input them on separate lines.
Setting Up robots.txt
To set up robots.txt, implement the following methods:
- Using plugins
- Direct upload
For WordPress sites, using plugins that allow easy settings is recommended. Here, we explain how to set up robots.txt using plugins.
Using Plugins
For WordPress sites, you can easily set up robots.txt using the “All in One SEO Pack” plugin.
The setting can be done as follows:
- Download and activate “All in One SEO Pack.”
- Display the “Robots.txt” setting screen from the WordPress admin panel under “Posts.”
- Activate all features under “Function.”
After these settings, the following will be written at the bottom of “Create Robots.txt File”:
- User-agent: *
- Disallow: /wp/wp-admin/
- Allow: /wp/wp-admin/admin-ajax.php
- Sitemap: https://sample.com/sitemap.xml
Then, edit according to the previously mentioned “How to Write” section.
Direct Upload
A common method for all sites is to upload directly to the top directory of the site.
Specific conditions:
- File format: Unformatted text encoded in “UTF-8”
- File size: Maximum 500KB
Subdomains are fine, but note that it won’t be detected in subdirectories.
How to Verify robots.txt
While you can directly check the robots.txt file, using tools is recommended to avoid overlooking or making mistakes. “robots.txt Tester” is a free tool provided by Google that allows you to easily check for errors by simply entering a URL.
Here’s how to use “robots.txt Tester” to verify robots.txt
Syntax Verification
Syntax verification ensures that the content of the robots.txt file is grammatically correct. You can verify the syntax as follows:
- Access “robots.txt Tester.”
- Enter the relevant URL path in the input field at the bottom of the screen and click “Test.”
- The test results will be displayed.
Before testing, ensure that your site is correctly linked. If your site is not reflected, it means the robots.txt file is not properly set up. In that case, reinstall the robots.txt file and then perform the test.
Syntax Correction
After checking the test results in “robots.txt Tester,” verify if there are any errors.
If errors are found, first make corrections within “robots.txt Tester.” Click on the error location and directly input the text to change the syntax. Keep modifying the content until there are no more errors.
However, note that making corrections in “robots.txt Tester” does not change the actual robots.txt file. After identifying the errors, you need to modify the actual file.
Repeat the testing process mentioned above, and if no errors occur, the verification is complete.
Points to Note When Setting Up robots.txt
While setting up robots.txt is relatively simple, as it involves inputting items according to specific criteria, there are some easily mistaken aspects to be cautious about:
- Do not use it with the intention of denying indexing.
- Do not use it for duplicate content strategy.
- It does not restrict user access.
- Update the robots.txt.
Let’s explain each point:
Not for Index Denial
A common mistake is using robots.txt with the intention of denying indexing. Remember, robots.txt is for denying crawling, not indexing. For index denial, use noindex. Misuse can result in errors like “search results displaying without site description.”
Not for Duplicate Content Strategy
Similarly, do not use robots.txt as a strategy for duplicate content. If indexed, search engines will recognize it as duplicate content. For duplicate content, use “noindex” or URL normalization.
Does Not Restrict User Access
Using robots.txt for user access restriction is a misunderstanding. It does not prevent user access. If the URL is available online, access is possible even if crawling is denied. For access restriction, separate settings are required.
Updating robots.txt
When renewing your website or changing page URLs, it’s necessary to update robots.txt appropriately. This affects both the provision of accurate information to search engines and your website’s SEO.
For instance, if pages with old URLs are indexed, set up a 301 redirect to the new URLs and reflect these in robots.txt. This ensures search engines correctly crawl the new pages without negatively impacting your website’s SEO.
Also, modify the robots.txt settings when deleting or adding pages.
Frequently Asked Questions About robots.txt
Here are some common questions and answers about robots.txt:
Q: Where should robots.txt be placed?
A: Place robots.txt in the root directory of your website. The root directory refers to the top-level directory of a website. For more details, refer to the page on creating an SEO-friendly directory structure. Avoid placing robots.txt outside the root directory, as it may not be correctly crawled by search engines.
Q: What problems can arise from incorrect robots.txt settings?
A: Incorrect robots.txt settings can prevent search engines from crawling your website’s pages. For example, specifying the wrong directives can lead to search engine crawl denial, impacting SEO rankings.
Q: Does robots.txt affect SEO?
A: Properly setting up robots.txt can positively impact SEO by specifying pages that don’t need to be crawled. Conversely, improper settings can lead to lower SEO rankings.
Q: Is robots.txt used to hide pages?
A: robots.txt is used to specify pages or directories to be excluded from crawling. It is not intended to hide pages. To hide pages, use meta tags with the nofollow attribute, not robots.txt.
Summary
This article has covered the basics of robots.txt, from its purpose and writing method to specific settings. While it’s easy to confuse robots.txt with noindex, their effects differ significantly. Misuse can lead to losing potential benefits, so it’s crucial to use robots.txt according to its intended purpose. Since setting up robots.txt is relatively straightforward, refer to this article for guidance and implement it in your practices.