Robots.txt Specifications

Abstract

This document details how Google handles the robots.txt file that allows you to control how Google's website crawlers crawl and index publicly accessible websites.

Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Basic definitions

crawler: A crawler is a service or agent that crawls websites. Generally speaking, a crawler automatically and recursively accesses known URLs of a host that exposes content which can be accessed with standard web-browsers. As new URLs are found (through various means, such as from links on existing, crawled pages or from Sitemap files), these are also crawled in the same way.
user-agent: a means of identifying a specific crawler or set of crawlers.
directives: the list of applicable guidelines for a crawler or group of crawlers set forth in the robots.txt file.
URL: Uniform Resource Locators as defined in RFC 1738.
Google-specific: These elements are specific to Google's implementation of robots.txt and may not be relevant for other parties.

Applicability

The guidelines set forth in this document are followed by all automated crawlers at Google. When an agent accesses URLs on behalf of a user (for example, for translation, manually subscribed feeds, malware analysis, etc), these guidelines do not need to apply.

File location & range of validity

The robots.txt file must be in the top-level directory of the host, accessible though the appropriate protocol and port number. Generally accepted protocols for robots.txt (and crawling of websites) are "http" and "https". On http and https, the robots.txt file is fetched using a HTTP non-conditional GET request.

Google-specific: Google also accepts and follows robots.txt files for FTP sites. FTP-based robots.txt files are accessed via the FTP protocol, using an anonymous login.

The directives listed in the robots.txt file apply only to the host, protocol and port number where the file is hosted.

Note: the URL for the robots.txt file is - like other URLs - case-sensitive.

Examples of valid robots.txt URLs:

Robots.txt URL	Valid for	Not valid for	Comments
http://example.com/robots.txt	http://example.com/ http://example.com/folder/file	http://other.example.com/ https://example.com/ http://example.com:8181/	This is the general case. It is not valid for other subdomains, protocols or port numbers. It is valid for all files in all subdirectories on the same host, protocol and port number.
http://www.example.com/robots.txt	http://www.example.com/	http://example.com/ http://shop.www.example.com/ http://www.shop.example.com/	A robots.txt on a subdomain is only valid for that subdomain.
http://example.com/folder/robots.txt	not a valid robots.txt file!		Crawlers will not check for robots.txt files in subdirectories.
http://www.müller.eu/robots.txt	http://www.müller.eu/ http://www.xn--mller-kva.eu/	http://www.muller.eu/	IDNs are equivalent to their punycode versions. See also RFC 3492.
ftp://example.com/robots.txt	ftp://example.com/	http://example.com/	Google-specific: We use the robots.txt for FTP resources.
http://212.96.82.21/robots.txt	http://212.96.82.21/	http://example.com/ (even if hosted on 212.96.82.21)	A robots.txt with IP-address as host name will only be valid for crawling of that IP-address as host name. It will not automatically be valid for all websites hosted on that IP-address (though it is possible that the robots.txt file is shared, in which case it would also be available under the shared host name).
http://example.com:80/robots.txt	http://example.com:80/ http://example.com/	http://example.com:81/	Standard port numbers (80 for http, 443 for https, 21 for ftp) are equivalent to their default host names. See also [portnumbers].
http://example.com:8181/robots.txt	http://example.com:8181/	http://example.com/	Robots.txt files on non-standard port numbers are only valid for content made available through those port numbers.

Handling HTTP result codes

There are generally three different outcomes when robots.txt files are fetched:

full allow: All content may be crawled.
full disallow: No content may be crawled.
conditional allow: The directives in the robots.txt determine the ability to crawl certain content.

2xx (successful): HTTP result codes that signal success result in a "conditional allow" of crawling.
3xx (redirection): Redirects will generally be followed until a valid result can be found (or a loop is recognized). We will follow a limited number of redirect hops (RFC 1945 for HTTP/1.0 allows up to 5 hops) and then stop and treat it as a 404. Handling of robots.txt redirects to disallowed URLs is undefined and discouraged. Handling of logical redirects for the robots.txt file based on HTML content that returns 2xx (frames, JavaScript, or meta refresh-type redirects) is undefined and discouraged.
4xx (client errors): Google treats all 4xx errors in the same way and assumes that no valid robots.txt file exists. It is assumed that there are no restrictions. This is a "full allow" for crawling. Note: this includes 401 "Unauthorized" and 403 "Forbidden" HTTP result codes.
5xx (server error): Server errors are seen as temporary errors that result in a "full disallow" of crawling. The request is retried until a non-server-error HTTP result code is obtained. A 503 (Service Unavailable) error will result in fairly frequent retrying. To temporarily suspend crawling, it is recommended to serve a 503 HTTP result code. Handling of a permanent server error is undefined.

Google-specific: If we are able to determine that a site is incorrectly configured to return 5xx instead of a 404 for missing pages, we will treat a 5xx error from that site as a 404.
Unsuccessful requests or incomplete data: Handling of a robots.txt file which cannot be fetched due to DNS or networking issues such as timeouts, invalid responses, reset / hung up connections, HTTP chunking errors, etc. is undefined.
Caching: A robots.txt request is generally cached for up to one day, but may be cached longer in situations where refreshing the cached version is not possible (for example, due to timeouts or 5xx errors). The cached response may be shared by different crawlers. Google may increase or decrease the cache lifetime based on max-age Cache-Control HTTP headers.

File format

The expected file format is plain text encoded in UTF-8. The file consists of records (lines) separated by CR, CR/LF or LF.

Only valid records will be considered; all other content will be ignored. For example, if the resulting document is a HTML page, only valid text lines will be taken into account, the rest will be discarded without warning or error.

If a character encoding is used that results in characters being used which are not a subset of UTF-8, this may result in the contents of the file being parsed incorrectly.

An optional Unicode BOM (byte order mark) at the beginning of the robots.txt file is ignored.

Each record consists of a field, a colon, and a value. Spaces are optional (but recommended to improve readability). Comments can be included at any location in the file using the "#" character; all content after the start of a comment until the end of the record is treated as a comment and ignored. The general format is "<field>:<value><#optional-comment>". Whitespace at the beginning and at the end of the record is ignored.

The <field> element is case-insensitive. The <value> element may be case-sensitive, depending on the <field> element.

Handling of <field> elements with simple errors / typos (eg "useragent" instead of "user-agent") is undefined and may be interpreted as correct directives by some user-agents.

A maximum file size may be enforced per crawler. Content which is after the maximum file size may be ignored. Google currently enforces a size limit of 500kb.

Formal syntax / definition

This is a Backus-Naur Form (BNF)-like description, using the conventions of RFC 822, except that "|" is used to designate alternatives. Literals are quoted with "", parentheses "(" and ")" are used to group elements, optional elements are enclosed in [brackets], and elements may be preceded with <n>* to designate n or more repetitions of the following element; n defaults to 0.

robotstxt = *entries
entries = *( ( <1>*startgroupline 
  *(groupmemberline | nongroupline | comment)
  | nongroupline
  | comment) )
startgroupline = [LWS] "user-agent" [LWS] ":" [LWS] agentvalue [comment] EOL
groupmemberline = [LWS] (
  pathmemberfield [LWS] ":" [LWS] pathvalue
  | othermemberfield [LWS] ":" [LWS] textvalue) [comment] EOL
nongroupline = [LWS] (
  urlnongroupfield [LWS] ":" [LWS] urlvalue
  | othernongroupfield [LWS] ":" [LWS] textvalue) [comment] EOL
comment = [LWS] "#" *anychar
agentvalue = textvalue

pathmemberfield = "disallow" | "allow"
othermemberfield = ()
urlnongroupfield = "sitemap"
othernongroupfield = ()

pathvalue = "/" path
urlvalue = absoluteURI
textvalue = *(valuechar | SP)
valuechar = <any UTF-8 character except ("#" CTL)>
anychar = <any UTF-8 character except CTL>
EOL = CR | LF | (CR LF)

The syntax for "absoluteURI", "CTL", "CR", "LF", "LWS" are defined in RFC 1945. The syntax for "path" is defined in RFC 1808.

Grouping of records

Records are categorized into different types based on the type of <field> element:

start-of-group
group-member
non-group

All group-member records after a start-of-group record up to the next start-of-group record are treated as a group of records. The only start-of-group field element is user-agent. Muiltiple start-of-group lines directly after each other will follow the group-member records following the final start-of-group line. Any group-member records without a preceding start-of-group record are ignored. All non-group records are valid independently of all groups.

Valid <field> elements, which will be individually detailed further on in this document, are:

user-agent (start of group)
disallow (only valid as a group-member record)
allow (only valid as a group-member record)
sitemap (non-group record)

All other <field> elements may be ignored.

The start-of-group element user-agent is used to specify for which crawler the group is valid. Only one group of records is valid for a particular crawler. We will cover order of precedence later in this document.

Example groups:

user-agent: a
disallow: /c

user-agent: b
disallow: /d

user-agent: e
user-agent: f
disallow: /g

There are three distinct groups specified, one for "a" and one for "b" as well as one for both "e" and "f". Each group has its own group-member record. Note the optional use of white-space (an empty line) to improve readability.

Order of precedence for user-agents

Only one group of group-member records is valid for a particular crawler. The crawler must determine the correct group of records by finding the group with the most specific user-agent that still matches. All other groups of records are ignored by the crawler. The user-agent is non-case-sensitive. All non-matching text is ignored (for example, both googlebot/1.2 and googlebot* are equivalent to googlebot). The order of the groups within the robots.txt file is irrelevant.

Example:

Assuming the following robots.txt file:

user-agent: googlebot-news
(group 1)

user-agent: *
(group 2)

user-agent: googlebot
(group 3)

This is how the crawlers would choose the relevant group:

Name of crawler	Record group followed	Comments
Googlebot News	(group 1)	Only the most specific group is followed, all others are ignored.
Googlebot (web)	(group 3)
Googlebot Images	(group 3)	There is no specific `googlebot-images` group, so the more generic group is followed.
Googlebot News (when crawling images)	(group 1)	These images are crawled for and by Googlebot News, therefore only the Googlebot News group is followed.
Otherbot (web)	(group 2)
Otherbot (News)	(group 2)	Even if there is an entry for a related crawler, it is only valid if it is specifically matching.

Also see Google's crawlers and user-agent strings

Group-member records

Only general and Google-specific group-member record types are covered in this section. These record types are also called "directives" for the crawlers. These directives are specified in the form of "directive: [path]" where [path] is optional. By default, there are no restrictions for crawling for the designated crawlers. Directives without a [path] are ignored.

The [path] value, if specified, is to be seen relative from the root of the website for which the robots.txt file was fetched (using the same protocol, port number, host and domain names). The path value must start with "/" to designate the root. If a path without a beginning slash is found, it may be assumed to be there. The path is case-sensitive. More information can be found in the section "URL matching based on path values" below.

disallow

The disallow directive specifies paths that must not be accessed by the designated crawlers. When no path is specified, the directive is ignored.

Usage:

disallow: [path]

allow

The allow directive specifies paths that may be accessed by the designated crawlers. When no path is specified, the directive is ignored.

Usage:

allow: [path]

URL matching based on path values

The path value is used as a basis to determine whether or not a rule applies to a specific URL on a site. With the exception of wildcards, the path is used to match the beginning of a URL (and any valid URLs that start with the same path). Non-7-bit ASCII characters in a path may be included as UTF-8 characters or as percent-escaped UTF-8 encoded characters per RFC 3986.

Note: "AJAX-Crawling" URLs must be specified in their crawled versions.

Google, Bing, Yahoo, and Ask support a limited form of "wildcards" for path values. These are:

* designates 0 or more instances of any valid character
$ designates the end of the URL

Example path matches

[path]	Matches	Does not match	Comments
/	any valid URL		Matches the root and any lower level URL
`/*`	equivalent to /	equivalent to /	Equivalent to "/" -- the trailing wildcard is ignored.
`/fish`	/fish /fish.html /fish/salmon.html /fishheads /fishheads/yummy.html /fish.php?id=anything	/Fish.asp /catfish /?id=fish	Note the case-sensitive matching.
`/fish*`	/fish /fish.html /fish/salmon.html /fishheads /fishheads/yummy.html /fish.php?id=anything	/Fish.asp /catfish /?id=fish	Equivalent to "/fish" -- the trailing wildcard is ignored.
`/fish/`	/fish/ /fish/?id=anything /fish/salmon.htm	/fish /fish.html /Fish/Salmon.asp	The trailing slash means this matches anything in this folder.
`fish/`	equivalent to /fish/	equivalent to /fish/	equivalent to /fish/
`/*.php`	/filename.php /folder/filename.php /folder/filename.php?parameters /folder/any.php.file.html /filename.php/	/ (even if it maps to /index.php) /windows.PHP
`/*.php$`	/filename.php /folder/filename.php	/filename.php?parameters /filename.php/ /filename.php5 /windows.PHP
`/fish*.php`	/fish.php /fishheads/catfish.php?parameters	/Fish.PHP

Google-supported non-group-member records

sitemap

Supported by Google, Ask, Bing, Yahoo; defined on sitemaps.org.

Usage:

sitemap: [absoluteURL]

[absoluteURL] points to a Sitemap, Sitemap Index file or equivalent URL. The URL does not have to be on the same host as the robots.txt file. Multiple sitemap entries may exist. As non-group-member records, these are not tied to any specific user-agents and may be followed by all crawlers, provided it is not disallowed.

Order of precedence for group-member records

At a group-member level, in particular for allow and disallow directives, the most specific rule based on the length of the [path] entry will trump the less specific (shorter) rule. The order of precedence for rules with wildcards is undefined.

Sample situations:

URL	allow:	disallow:	Verdict
http://example.com/page	`/p`	`/`	allow
http://example.com/folder/page	`/folder/`	`/folder`	allow
http://example.com/page.htm	`/page`	`/*.htm`	undefined
http://example.com/	`/$`	`/`	allow
http://example.com/page.htm	`/$`	`/`	disallow