Abstract
This document details how Google handles the robots.txt file that allows you to control how Google's website crawlers crawl and index publicly accessible websites.
Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Basic definitions
- crawler: A crawler is a service or agent that crawls websites. Generally speaking, a crawler automatically and recursively accesses known URLs of a host that exposes content which can be accessed with standard web-browsers. As new URLs are found (through various means, such as from links on existing, crawled pages or from Sitemap files), these are also crawled in the same way.
- user-agent: a means of identifying a specific crawler or set of crawlers.
- directives: the list of applicable guidelines for a crawler or group of crawlers set forth in the robots.txt file.
- URL: Uniform Resource Locators as defined in RFC 1738.
- Google-specific: These elements are specific to Google's implementation of robots.txt and may not be relevant for other parties.
Applicability
The guidelines set forth in this document are followed by all automated crawlers at Google. When an agent accesses URLs on behalf of a user (for example, for translation, manually subscribed feeds, malware analysis, etc), these guidelines do not need to apply.
File location & range of validity
The robots.txt file must be in the top-level directory of the host, accessible though the appropriate protocol and port number. Generally accepted protocols for robots.txt (and crawling of websites) are "http" and "https". On http and https, the robots.txt file is fetched using a HTTP non-conditional GET request.
Google-specific: Google also accepts and follows robots.txt files for FTP sites. FTP-based robots.txt files are accessed via the FTP protocol, using an anonymous login.
The directives listed in the robots.txt file apply only to the host, protocol and port number where the file is hosted.
Note: the URL for the robots.txt file is - like other URLs - case-sensitive.
Examples of valid robots.txt URLs:
| Robots.txt URL | Valid for | Not valid for | Comments |
|---|---|---|---|
| http://example.com/robots.txt | http://example.com/ http://example.com/folder/file |
http://other.example.com/ https://example.com/ http://example.com:8181/ |
This is the general case. It is not valid for other subdomains, protocols or port numbers. It is valid for all files in all subdirectories on the same host, protocol and port number. |
| http://www.example.com/robots.txt | http://www.example.com/ | http://example.com/ http://shop.www.example.com/ http://www.shop.example.com/ |
A robots.txt on a subdomain is only valid for that subdomain. |
| http://example.com/folder/robots.txt | not a valid robots.txt file! | Crawlers will not check for robots.txt files in subdirectories. | |
| http://www.müller.eu/robots.txt | http://www.müller.eu/ http://www.xn--mller-kva.eu/ |
http://www.muller.eu/ | IDNs are equivalent to their punycode versions. See also RFC 3492. |
| ftp://example.com/robots.txt | ftp://example.com/ | http://example.com/ | Google-specific: We use the robots.txt for FTP resources. |
| http://212.96.82.21/robots.txt | http://212.96.82.21/ | http://example.com/ (even if hosted on 212.96.82.21) | A robots.txt with IP-address as host name will only be valid for crawling of that IP-address as host name. It will not automatically be valid for all websites hosted on that IP-address (though it is possible that the robots.txt file is shared, in which case it would also be available under the shared host name). |
| http://example.com:80/robots.txt | http://example.com:80/ http://example.com/ |
http://example.com:81/ | Standard port numbers (80 for http, 443 for https, 21 for ftp) are equivalent to their default host names. See also [portnumbers]. |
| http://example.com:8181/robots.txt | http://example.com:8181/ | http://example.com/ | Robots.txt files on non-standard port numbers are only valid for content made available through those port numbers. |
Handling HTTP result codes
There are generally three different outcomes when robots.txt files are fetched:
- full allow: All content may be crawled.
- full disallow: No content may be crawled.
- conditional allow: The directives in the robots.txt determine the ability to crawl certain content.
- 2xx (successful)
- HTTP result codes that signal success result in a "conditional allow" of crawling.
- 3xx (redirection)
- Redirects will generally be followed until a valid result can be found (or a loop is recognized). We will follow a limited number of redirect hops (RFC 1945 for HTTP/1.0 allows up to 5 hops) and then stop and treat it as a 404. Handling of robots.txt redirects to disallowed URLs is undefined and discouraged. Handling of logical redirects for the robots.txt file based on HTML content that returns 2xx (frames, JavaScript, or meta refresh-type redirects) is undefined and discouraged.
- 4xx (client errors)
- Google treats all 4xx errors in the same way and assumes that no valid robots.txt file exists. It is assumed that there are no restrictions. This is a "full allow" for crawling. Note: this includes 401 "Unauthorized" and 403 "Forbidden" HTTP result codes.
- 5xx (server error)
- Server errors are seen as temporary errors that
result in a "full disallow" of crawling. The request is retried until
a non-server-error HTTP result code is obtained. A 503 (Service
Unavailable) error will result in fairly frequent retrying. To temporarily
suspend crawling, it is recommended to serve a 503 HTTP result code.
Handling of a permanent server error is undefined.
Google-specific: If we are able to determine that a site is incorrectly configured to return 5xx instead of a 404 for missing pages, we will treat a 5xx error from that site as a 404. - Unsuccessful requests or incomplete data
- Handling of a robots.txt file which cannot be fetched due to DNS or networking issues such as timeouts, invalid responses, reset / hung up connections, HTTP chunking errors, etc. is undefined.
- Caching
- A robots.txt request is generally cached for up to one day, but may be cached longer in situations where refreshing the cached version is not possible (for example, due to timeouts or 5xx errors). The cached response may be shared by different crawlers. Google may increase or decrease the cache lifetime based on max-age Cache-Control HTTP headers.
File format
The expected file format is plain text encoded in UTF-8. The file consists of records (lines) separated by CR, CR/LF or LF.
Only valid records will be considered; all other content will be ignored. For example, if the resulting document is a HTML page, only valid text lines will be taken into account, the rest will be discarded without warning or error.
If a character encoding is used that results in characters being used which are not a subset of UTF-8, this may result in the contents of the file being parsed incorrectly.
An optional Unicode BOM (byte order mark) at the beginning of the robots.txt file is ignored.
Each record consists of a field, a colon, and a value. Spaces are optional (but recommended to improve readability). Comments can be included at any location in the file using the "#" character; all content after the start of a comment until the end of the record is treated as a comment and ignored. The general format is "<field>:<value><#optional-comment>". Whitespace at the beginning and at the end of the record is ignored.
The <field> element is case-insensitive. The <value> element may be case-sensitive, depending on the <field> element.
Handling of <field> elements with simple errors / typos (eg "useragent" instead of "user-agent") is undefined and may be interpreted as correct directives by some user-agents.
A maximum file size may be enforced per crawler. Content which is after the maximum file size may be ignored. Google currently enforces a size limit of 500kb.
Formal syntax / definition
This is a Backus-Naur Form (BNF)-like description, using the conventions of RFC 822, except that "|" is used to designate alternatives. Literals are quoted with "", parentheses "(" and ")" are used to group elements, optional elements are enclosed in [brackets], and elements may be preceded with <n>* to designate n or more repetitions of the following element; n defaults to 0.
robotstxt = *entries
entries = *( ( <1>*startgroupline
*(groupmemberline | nongroupline | comment)
| nongroupline
| comment) )
startgroupline = [LWS] "user-agent" [LWS] ":" [LWS] agentvalue [comment] EOL
groupmemberline = [LWS] (
pathmemberfield [LWS] ":" [LWS] pathvalue
| othermemberfield [LWS] ":" [LWS] textvalue) [comment] EOL
nongroupline = [LWS] (
urlnongroupfield [LWS] ":" [LWS] urlvalue
| othernongroupfield [LWS] ":" [LWS] textvalue) [comment] EOL
comment = [LWS] "#" *anychar
agentvalue = textvalue
pathmemberfield = "disallow" | "allow"
othermemberfield = ()
urlnongroupfield = "sitemap"
othernongroupfield = ()
pathvalue = "/" path
urlvalue = absoluteURI
textvalue = *(valuechar | SP)
valuechar = <any UTF-8 character except ("#" CTL)>
anychar = <any UTF-8 character except CTL>
EOL = CR | LF | (CR LF)
The syntax for "absoluteURI", "CTL", "CR", "LF", "LWS" are defined in RFC 1945. The syntax for "path" is defined in RFC 1808.
Grouping of records
Records are categorized into different types based on the type of <field> element:
- start-of-group
- group-member
- non-group
user-agent(start of group)disallow(only valid as a group-member record)allow(only valid as a group-member record)sitemap(non-group record)
All group-member records after a start-of-group record up to the next
start-of-group record are treated as a group of records. The only
start-of-group field element is user-agent.
Muiltiple start-of-group
lines directly after each other will follow the group-member records
following the final start-of-group line. Any group-member records
without a preceding start-of-group record are ignored. All non-group
records are valid independently of all groups.
Valid <field> elements, which will be individually detailed further on in this document, are:
All other <field> elements may be ignored.
The start-of-group element user-agent is used to specify
for which crawler the group is valid. Only one group of records is valid
for a particular crawler. We will cover order of precedence later in this
document.
Example groups:
user-agent: a disallow: /c user-agent: b disallow: /d user-agent: e user-agent: f disallow: /g
There are three distinct groups specified, one for "a" and one for "b" as well as one for both "e" and "f". Each group has its own group-member record. Note the optional use of white-space (an empty line) to improve readability.
Order of precedence for user-agents
Only one group of group-member records is valid for a particular crawler.
The crawler must determine the correct group of records by finding the
group with the most specific user-agent that still matches. All other
groups of records are ignored by the crawler. The user-agent is
non-case-sensitive. All non-matching text is ignored (for example, both
googlebot/1.2 and googlebot* are
equivalent to googlebot). The
order of the groups within the robots.txt file is irrelevant.
Example:
Assuming the following robots.txt file:
user-agent: googlebot-news (group 1) user-agent: * (group 2) user-agent: googlebot (group 3)
This is how the crawlers would choose the relevant group:
| Name of crawler | Record group followed | Comments |
|---|---|---|
| Googlebot News | (group 1) | Only the most specific group is followed, all others are ignored. |
| Googlebot (web) | (group 3) | |
| Googlebot Images | (group 3) | There is no specific googlebot-images group, so the
more generic group is followed. |
| Googlebot News (when crawling images) | (group 1) | These images are crawled for and by Googlebot News, therefore only the Googlebot News group is followed. |
| Otherbot (web) | (group 2) | |
| Otherbot (News) | (group 2) | Even if there is an entry for a related crawler, it is only valid if it is specifically matching. |
Also see Google's crawlers and user-agent strings
Group-member records
Only general and Google-specific group-member record types are covered in this section. These record types are also called "directives" for the crawlers. These directives are specified in the form of "directive: [path]" where [path] is optional. By default, there are no restrictions for crawling for the designated crawlers. Directives without a [path] are ignored.
The [path] value, if specified, is to be seen relative from the root of the website for which the robots.txt file was fetched (using the same protocol, port number, host and domain names). The path value must start with "/" to designate the root. If a path without a beginning slash is found, it may be assumed to be there. The path is case-sensitive. More information can be found in the section "URL matching based on path values" below.
disallow
The disallow directive specifies paths that must not be
accessed by the designated crawlers. When no path is specified, the
directive is ignored.
Usage:
disallow: [path]
allow
The allow directive specifies paths that may be accessed by the
designated crawlers. When no path is specified, the directive is
ignored.
Usage:
allow: [path]
URL matching based on path values
The path value is used as a basis to determine whether or not a rule applies to a specific URL on a site. With the exception of wildcards, the path is used to match the beginning of a URL (and any valid URLs that start with the same path). Non-7-bit ASCII characters in a path may be included as UTF-8 characters or as percent-escaped UTF-8 encoded characters per RFC 3986.
Note: "AJAX-Crawling" URLs must be specified in their crawled versions.
Google, Bing, Yahoo, and Ask support a limited form of "wildcards" for path values. These are:
- * designates 0 or more instances of any valid character
- $ designates the end of the URL
Example path matches
| [path] | Matches | Does not match | Comments |
|---|---|---|---|
| / | any valid URL | Matches the root and any lower level URL | |
/* | equivalent to / | equivalent to / | Equivalent to "/" -- the trailing wildcard is ignored. |
/fish | /fish /fish.html /fish/salmon.html /fishheads /fishheads/yummy.html /fish.php?id=anything |
/Fish.asp /catfish /?id=fish |
Note the case-sensitive matching. |
/fish* | /fish /fish.html /fish/salmon.html /fishheads /fishheads/yummy.html /fish.php?id=anything |
/Fish.asp /catfish /?id=fish |
Equivalent to "/fish" -- the trailing wildcard is ignored. |
/fish/ | /fish/ /fish/?id=anything /fish/salmon.htm | /fish /fish.html /Fish/Salmon.asp |
The trailing slash means this matches anything in this folder. |
fish/ | equivalent to /fish/ | equivalent to /fish/ | equivalent to /fish/ |
/*.php | /filename.php /folder/filename.php /folder/filename.php?parameters /folder/any.php.file.html /filename.php/ |
/ (even if it maps to /index.php) /windows.PHP |
|
/*.php$ | /filename.php /folder/filename.php | /filename.php?parameters /filename.php/ /filename.php5 /windows.PHP |
|
/fish*.php | /fish.php /fishheads/catfish.php?parameters |
/Fish.PHP |
Google-supported non-group-member records
sitemap
Supported by Google, Ask, Bing, Yahoo; defined on sitemaps.org.
Usage:
sitemap: [absoluteURL]
[absoluteURL] points to a Sitemap, Sitemap Index file or equivalent URL.
The URL does not have to be on the same host as the robots.txt file.
Multiple sitemap entries may exist. As non-group-member
records, these are not tied to any specific user-agents and may be followed
by all crawlers, provided it is not disallowed.
Order of precedence for group-member records
At a group-member level, in particular for allow and
disallow directives, the most specific rule based on the
length of the [path] entry will trump the less specific (shorter) rule.
The order of precedence for rules with wildcards is undefined.
Sample situations:
| URL | allow: | disallow: | Verdict | Comments |
|---|---|---|---|---|
| http://example.com/page | /p |
/ | allow | |
| http://example.com/folder/page | /folder/ |
/folder | allow | |
| http://example.com/page.htm | /page |
/*.htm | undefined | |
| http://example.com/ | /$ |
/ | allow | |
| http://example.com/page.htm | /$ |
/ | disallow |