What is the method to control Google’s crawling, and what are the key crawlers?
contents
- 1 How to Request a Re-crawl from Google
- 2 Verifying a Crawler as Googlebot
- 3 Summary
Generally, Google automatically crawls websites, so web administrators do not need to control Google’s crawling. However, there may be times when you want to request a re-crawl of updated pages or reduce the crawl frequency to lessen the load on your infrastructure. In In such cases, initiating control of the crawl is necessary.
An SEO consultant will explain how to request a crawl from Google and how to reduce the crawl frequency. However, other than submitting a sitemap, these methods should not be a routine practice. Especially since reducing the crawl frequency can have disadvantages, it’s crucial to verify if control is truly necessary before making a careful decision.
How to Request a Re-crawl from Google
If you’ve added or updated a page on your site, Google will automatically crawl it, but you can request a re-indexing from Google. However, it may take days to weeks for the crawl to occur, and it’s possible that it won ‘t be reflected in the search results immediately.
The absence from search results can be due to Google’s system prioritizing high-quality and useful content. Over time, or after sending another request after some time, the content may appear in search results.
Using the URL Inspection Tool
For a small number of URLs, you can use the Google Search Console’s URL Inspection Tool to request indexing. However, there’s a limit to the number of URLs you can submit, and it’s not suitable for a large number of URLs. Additionally, you might have to wait 1 to 2 minutes per request, which isn’t efficient.
When submitting a large number of URLs, it’s necessary to submit an XML sitemap.
Submitting an XML Sitemap
Submitting an XML sitemap through Google Search Console is effective for a large number of URL requests. The sitemap is a crucial tool for Google to understand your site’s URLs. It is particularly useful when Google needs to find new sites, such as when a site is new with no links from other sites or immediately after a site migration.
However, submitting a sitemap doesn’t guarantee crawling or appearance in search results. It’s important to include all URLs you want crawled in your sitemap submission.
Reducing Googlebot’s Crawl Frequency
Google optimizes the frequency of site crawling with sophisticated algorithms to efficiently crawl pages within a site during a single visit, minimizing the load on servers. However, there may be situations where site crawling by Google places an undue load on servers or incurs unnecessary costs during service downtime. In such cases, it is possible to reduce the crawl frequency from Google.
However, reducing Google’s crawl frequency can impact the entire site. It may result in fewer detectable pages within the site or updates not being recognized. There’s also a risk that deleted pages may remain indexed for an extended period. Therefore, consider carefully whether to reduce the crawl frequency.
To reduce the crawl frequency of Googlebot, you can use the following methods.
- Reduce the crawl rate through Google Search Console settings
- Use status codes to reduce the crawl frequency
*Normally, there’s no need to reduce Google’s crawl frequency. From an SEO perspective, it’s preferable to increase crawl frequency to boost the number of pages appearing in search results. The following are exceptional methods for managing situations where you want to reduce crawl frequency. Note that these methods should not be used regularly.
Reducing Crawl Frequency Through Google Search Console Settings
You can reduce the crawl frequency by selecting “Limit Google’s maximum crawl rate” in the Google Search Console settings . However, this feature will be discontinued as of January 8, 2024, so there’s no need to adjust crawl frequency for routine management.
The discontinuation is noted in the support ending announcement for the Search Console’s crawl rate limit tool , due to advancements in automatic adjustment processes for crawl frequency and automatic reductions in crawl frequency based on certain status codes.
In urgent cases where you need to reduce crawl frequency, use your website’s logs or the crawl statistics report to identify any Google crawlers that are crawling too frequently. Then, block agents like Googlebot or AdsBot using robots.txt.
Using Status Codes to Reduce Crawl Frequency
If you need to reduce crawl frequency for a limited time of several hours or days, consider returning HTTP status codes like 500, 503, or 429 instead of content. Googlebot automatically crawl frequency when encountering these status codes.
However, returning status codes 500, 503, or 429 can affect the crawl of the entire website, potentially leading to the removal of pages from Google’s index due to long-term changes. Crawl frequency will automatically recover as the number of errors decreases, but pages removed from the index will not appear in search results until they are re-crawled.
Reference: List of Key Crawlers
User-agent tokens are used in the User-agent line of robots.txt to create crawl rules for a site. The complete user-agent string provides a detailed description of the crawler and is displayed in HTTP requests and web logs.
Googlebot for Smartphones
- User-agent Token: Googlebot
- Complete User-agent String: Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/WXYZ Mobile Safari/537.36 (compatible; Googlebot/2.1; + http://www.google.com/bot.html )
Googlebot for Computers
- User-agent Token: Googlebot
- Complete User-agent String: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; + http://www.google.com/bot.html ) Chrome/WXYZ Safari/537.36
Googlebot for Images
- User-agent Tokens: Googlebot-Image, Googlebot
- Complete User-agent String: Googlebot-Image/1.0
Googlebot for News
- User-agent Tokens: Googlebot-News, Googlebot
Googlebot for Videos
- User-agent Tokens: Googlebot-Video, Googlebot
- Complete User-agent String: Googlebot-Video/1.0
Google StoreBot
- User-agent Token: Storebot-Google
- Complete User-agent String:
- Desktop Agent: Mozilla/5.0 (X11; Linux x86_64; Storebot-Google/1.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36
- Mobile Agent: Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012; Storebot-Google/1.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Mobile Safari/537.36
Google-InspectionTool
- User-agent Token: Google-InspectionTool, Googlebot
- Complete User-agent String:
- Mobile: Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/WXYZ Mobile Safari/537.36 (compatible; Google-InspectionTool/1.0;)
- Desktop: Mozilla/5.0 (compatible; Google-InspectionTool/1.0;)
GoogleOther
- User-agent Token: GoogleOther
- Complete User-agent String: GoogleOther
Google-Extended
- User-agent Token: Google-Extended
APIs-Google
- User-agent Token: APIs-Google
- Complete User-agent String: APIs-Google (+ https://developers.google.com/webmasters/APIs-Google.html )
Verifying a Crawler as Googlebot
There are two ways to confirm whether access is from a spam source masquerading as Googlebot. Normally, you can verify using command-line tools, but for mass verification, you may need to use an automated solution to match the crawler’s IP address against a publicly available list of Googlebot’s IP addresses.
Using Command-Line Tools
- Use the host command on the web server to verify that the domain name of the accessing IP address is either googlebot.com, google.com, or googleusercontent.com.
- Use the host command with the obtained domain name to check if it matches the recorded IP address in the logs.
However, the host command is not available in the standard Windows command prompt. Since command-line tools are meant for technicians or server administrators, it’s safer for server administrators rather than web managers to perform this verification.
Using an Automated Solution
Match the crawler’s IP address against the list of IP address ranges for Google crawlers and fetchers to identify Googlebot by IP address.
- Googlebot
- Specialized crawlers like AdsBot
- User-triggered fetchers
Google may access sites using IP addresses not on this list, in which case match the accessing IP address against the regular list of Google’s IP addresses.
Summary
While requesting crawls through the URL Inspection Tool and XML Sitemaps were highlighted, these features are not meant for routine use. They are intended for situations like opening a new site or after a site migration when external links are not yet established to encourage crawling. However, issues such as content updates not appearing in search results for an extended period can occur, making requests through the URL Inspection Tool necessary. Google does not guarantee indexing after crawling; it assesses content quality and usefulness before indexing. Consider the disadvantages of controlling crawling , content should be the primary focus, and requests should only be made after ensuring content issues are resolved. Additionally, controlling crawl frequency could potentially lead to a reduction in site indexing. Emergency use of HTTP status codes to manipulate crawling is possible, but for long-term reductions in crawl frequency, other methods should be considered.