| < draft-fielding-uri-rfc2396bis | rfc3986.txt | |||
|---|---|---|---|---|
| Network Working Group T. Berners-Lee | Network Working Group T. Berners-Lee | |||
| Internet-Draft W3C/MIT | Request for Comments: 3986 W3C/MIT | |||
| Updates: 1738 (if approved) R. Fielding | STD: 66 R. Fielding | |||
| Obsoletes: 2732, 2396, 1808 (if approved) Day Software | Updates: 1738 Day Software | |||
| L. Masinter | Obsoletes: 2732, 2396, 1808 L. Masinter | |||
| Expires: March 26, 2005 Adobe | Category: Standards Track Adobe Systems | |||
| September 25, 2004 | January 2005 | |||
| Uniform Resource Identifier (URI): Generic Syntax | Uniform Resource Identifier (URI): Generic Syntax | |||
| draft-fielding-uri-rfc2396bis-07 | ||||
| Status of this Memo | ||||
| This document is an Internet-Draft and is subject to all provisions | ||||
| of section 3 of RFC 3667. By submitting this Internet-Draft, each | ||||
| author represents that any applicable patent or other IPR claims of | ||||
| which he or she is aware have been or will be disclosed, and any of | ||||
| which he or she become aware will be disclosed, in accordance with | ||||
| RFC 3668. | ||||
| Internet-Drafts are working documents of the Internet Engineering | Status of This Memo | |||
| Task Force (IETF), its areas, and its working groups. Note that | ||||
| other groups may also distribute working documents as | ||||
| Internet-Drafts. | ||||
| Internet-Drafts are draft documents valid for a maximum of six months | ||||
| and may be updated, replaced, or obsoleted by other documents at any | ||||
| time. It is inappropriate to use Internet-Drafts as reference | ||||
| material or to cite them other than as "work in progress." | ||||
| The list of current Internet-Drafts can be accessed at | ||||
| <http://www.ietf.org/ietf/1id-abstracts.txt>. | ||||
| The list of Internet-Draft Shadow Directories can be accessed at | This document specifies an Internet standards track protocol for the | |||
| <http://www.ietf.org/shadow.html>. | Internet community, and requests discussion and suggestions for | |||
| improvements. Please refer to the current edition of the "Internet | ||||
| Official Protocol Standards" (STD 1) for the standardization state | ||||
| and status of this protocol. Distribution of this memo is unlimited. | ||||
| Copyright Notice | Copyright Notice | |||
| Copyright (C) The Internet Society (2004). | Copyright (C) The Internet Society (2005). | |||
| Abstract | Abstract | |||
| A Uniform Resource Identifier (URI) is a compact sequence of | A Uniform Resource Identifier (URI) is a compact sequence of | |||
| characters for identifying an abstract or physical resource. This | characters that identifies an abstract or physical resource. This | |||
| specification defines the generic URI syntax and a process for | specification defines the generic URI syntax and a process for | |||
| resolving URI references that might be in relative form, along with | resolving URI references that might be in relative form, along with | |||
| guidelines and security considerations for the use of URIs on the | guidelines and security considerations for the use of URIs on the | |||
| Internet. The URI syntax defines a grammar that is a superset of all | Internet. The URI syntax defines a grammar that is a superset of all | |||
| valid URIs, such that an implementation can parse the common | valid URIs, allowing an implementation to parse the common components | |||
| components of a URI reference without knowing the scheme-specific | of a URI reference without knowing the scheme-specific requirements | |||
| requirements of every possible identifier. This specification does | of every possible identifier. This specification does not define a | |||
| not define a generative grammar for URIs; that task is performed by | generative grammar for URIs; that task is performed by the individual | |||
| the individual specifications of each URI scheme. | specifications of each URI scheme. | |||
| Editorial Note | ||||
| Discussion of this draft and comments to the editors should be sent | ||||
| to the [email protected] mailing list. An issues list and version history | ||||
| is available at <http://gbiv.com/protocols/uri/rev-2002/issues.html>. | ||||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 1.1 Overview of URIs . . . . . . . . . . . . . . . . . . . . . 4 | 1.1. Overview of URIs . . . . . . . . . . . . . . . . . . . . 4 | |||
| 1.1.1 Generic Syntax . . . . . . . . . . . . . . . . . . . . 6 | 1.1.1. Generic Syntax . . . . . . . . . . . . . . . . . 6 | |||
| 1.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . 7 | 1.1.2. Examples . . . . . . . . . . . . . . . . . . . . 7 | |||
| 1.1.3 URI, URL, and URN . . . . . . . . . . . . . . . . . . 7 | 1.1.3. URI, URL, and URN . . . . . . . . . . . . . . . 7 | |||
| 1.2 Design Considerations . . . . . . . . . . . . . . . . . . 7 | 1.2. Design Considerations . . . . . . . . . . . . . . . . . 8 | |||
| 1.2.1 Transcription . . . . . . . . . . . . . . . . . . . . 7 | 1.2.1. Transcription . . . . . . . . . . . . . . . . . 8 | |||
| 1.2.2 Separating Identification from Interaction . . . . . . 9 | 1.2.2. Separating Identification from Interaction . . . 9 | |||
| 1.2.3 Hierarchical Identifiers . . . . . . . . . . . . . . . 10 | 1.2.3. Hierarchical Identifiers . . . . . . . . . . . . 10 | |||
| 1.3 Syntax Notation . . . . . . . . . . . . . . . . . . . . . 11 | 1.3. Syntax Notation . . . . . . . . . . . . . . . . . . . . 11 | |||
| 2. Characters . . . . . . . . . . . . . . . . . . . . . . . . . . 11 | 2. Characters . . . . . . . . . . . . . . . . . . . . . . . . . . 11 | |||
| 2.1 Percent-Encoding . . . . . . . . . . . . . . . . . . . . . 12 | 2.1. Percent-Encoding . . . . . . . . . . . . . . . . . . . . 12 | |||
| 2.2 Reserved Characters . . . . . . . . . . . . . . . . . . . 12 | 2.2. Reserved Characters . . . . . . . . . . . . . . . . . . 12 | |||
| 2.3 Unreserved Characters . . . . . . . . . . . . . . . . . . 13 | 2.3. Unreserved Characters . . . . . . . . . . . . . . . . . 13 | |||
| 2.4 When to Encode or Decode . . . . . . . . . . . . . . . . . 13 | 2.4. When to Encode or Decode . . . . . . . . . . . . . . . . 14 | |||
| 2.5 Identifying Data . . . . . . . . . . . . . . . . . . . . . 14 | 2.5. Identifying Data . . . . . . . . . . . . . . . . . . . . 14 | |||
| 3. Syntax Components . . . . . . . . . . . . . . . . . . . . . . 16 | 3. Syntax Components . . . . . . . . . . . . . . . . . . . . . . 16 | |||
| 3.1 Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 16 | 3.1. Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 17 | |||
| 3.2 Authority . . . . . . . . . . . . . . . . . . . . . . . . 17 | 3.2. Authority . . . . . . . . . . . . . . . . . . . . . . . 17 | |||
| 3.2.1 User Information . . . . . . . . . . . . . . . . . . . 18 | 3.2.1. User Information . . . . . . . . . . . . . . . . 18 | |||
| 3.2.2 Host . . . . . . . . . . . . . . . . . . . . . . . . . 18 | 3.2.2. Host . . . . . . . . . . . . . . . . . . . . . . 18 | |||
| 3.2.3 Port . . . . . . . . . . . . . . . . . . . . . . . . . 21 | 3.2.3. Port . . . . . . . . . . . . . . . . . . . . . . 22 | |||
| 3.3 Path . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 | 3.3. Path . . . . . . . . . . . . . . . . . . . . . . . . . . 22 | |||
| 3.4 Query . . . . . . . . . . . . . . . . . . . . . . . . . . 23 | 3.4. Query . . . . . . . . . . . . . . . . . . . . . . . . . 23 | |||
| 3.5 Fragment . . . . . . . . . . . . . . . . . . . . . . . . . 24 | 3.5. Fragment . . . . . . . . . . . . . . . . . . . . . . . . 24 | |||
| 4. Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 | 4. Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 | |||
| 4.1 URI Reference . . . . . . . . . . . . . . . . . . . . . . 25 | 4.1. URI Reference . . . . . . . . . . . . . . . . . . . . . 25 | |||
| 4.2 Relative Reference . . . . . . . . . . . . . . . . . . . . 26 | 4.2. Relative Reference . . . . . . . . . . . . . . . . . . . 26 | |||
| 4.3 Absolute URI . . . . . . . . . . . . . . . . . . . . . . . 26 | 4.3. Absolute URI . . . . . . . . . . . . . . . . . . . . . . 27 | |||
| 4.4 Same-document Reference . . . . . . . . . . . . . . . . . 27 | 4.4. Same-Document Reference . . . . . . . . . . . . . . . . 27 | |||
| 4.5 Suffix Reference . . . . . . . . . . . . . . . . . . . . . 27 | 4.5. Suffix Reference . . . . . . . . . . . . . . . . . . . . 27 | |||
| 5. Reference Resolution . . . . . . . . . . . . . . . . . . . . . 28 | 5. Reference Resolution . . . . . . . . . . . . . . . . . . . . . 28 | |||
| 5.1 Establishing a Base URI . . . . . . . . . . . . . . . . . 28 | 5.1. Establishing a Base URI . . . . . . . . . . . . . . . . 28 | |||
| 5.1.1 Base URI Embedded in Content . . . . . . . . . . . . . 29 | 5.1.1. Base URI Embedded in Content . . . . . . . . . . 29 | |||
| 5.1.2 Base URI from the Encapsulating Entity . . . . . . . . 29 | 5.1.2. Base URI from the Encapsulating Entity . . . . . 29 | |||
| 5.1.3 Base URI from the Retrieval URI . . . . . . . . . . . 30 | 5.1.3. Base URI from the Retrieval URI . . . . . . . . 30 | |||
| 5.1.4 Default Base URI . . . . . . . . . . . . . . . . . . . 30 | 5.1.4. Default Base URI . . . . . . . . . . . . . . . . 30 | |||
| 5.2 Relative Resolution . . . . . . . . . . . . . . . . . . . 30 | 5.2. Relative Resolution . . . . . . . . . . . . . . . . . . 30 | |||
| 5.2.1 Pre-parse the Base URI . . . . . . . . . . . . . . . . 30 | 5.2.1. Pre-parse the Base URI . . . . . . . . . . . . . 31 | |||
| 5.2.2 Transform References . . . . . . . . . . . . . . . . . 31 | 5.2.2. Transform References . . . . . . . . . . . . . . 31 | |||
| 5.2.3 Merge Paths . . . . . . . . . . . . . . . . . . . . . 32 | 5.2.3. Merge Paths . . . . . . . . . . . . . . . . . . 32 | |||
| 5.2.4 Remove Dot Segments . . . . . . . . . . . . . . . . . 32 | 5.2.4. Remove Dot Segments . . . . . . . . . . . . . . 33 | |||
| 5.3 Component Recomposition . . . . . . . . . . . . . . . . . 34 | 5.3. Component Recomposition . . . . . . . . . . . . . . . . 35 | |||
| 5.4 Reference Resolution Examples . . . . . . . . . . . . . . 34 | 5.4. Reference Resolution Examples . . . . . . . . . . . . . 35 | |||
| 5.4.1 Normal Examples . . . . . . . . . . . . . . . . . . . 35 | 5.4.1. Normal Examples . . . . . . . . . . . . . . . . 36 | |||
| 5.4.2 Abnormal Examples . . . . . . . . . . . . . . . . . . 35 | 5.4.2. Abnormal Examples . . . . . . . . . . . . . . . 36 | |||
| 6. Normalization and Comparison . . . . . . . . . . . . . . . . . 36 | ||||
| 6.1 Equivalence . . . . . . . . . . . . . . . . . . . . . . . 37 | 6. Normalization and Comparison . . . . . . . . . . . . . . . . . 38 | |||
| 6.2 Comparison Ladder . . . . . . . . . . . . . . . . . . . . 37 | 6.1. Equivalence . . . . . . . . . . . . . . . . . . . . . . 38 | |||
| 6.2.1 Simple String Comparison . . . . . . . . . . . . . . . 38 | 6.2. Comparison Ladder . . . . . . . . . . . . . . . . . . . 39 | |||
| 6.2.2 Syntax-based Normalization . . . . . . . . . . . . . . 39 | 6.2.1. Simple String Comparison . . . . . . . . . . . . 39 | |||
| 6.2.3 Scheme-based Normalization . . . . . . . . . . . . . . 40 | 6.2.2. Syntax-Based Normalization . . . . . . . . . . . 40 | |||
| 6.2.4 Protocol-based Normalization . . . . . . . . . . . . . 41 | 6.2.3. Scheme-Based Normalization . . . . . . . . . . . 41 | |||
| 7. Security Considerations . . . . . . . . . . . . . . . . . . . 41 | 6.2.4. Protocol-Based Normalization . . . . . . . . . . 42 | |||
| 7.1 Reliability and Consistency . . . . . . . . . . . . . . . 41 | 7. Security Considerations . . . . . . . . . . . . . . . . . . . 43 | |||
| 7.2 Malicious Construction . . . . . . . . . . . . . . . . . . 42 | 7.1. Reliability and Consistency . . . . . . . . . . . . . . 43 | |||
| 7.3 Back-end Transcoding . . . . . . . . . . . . . . . . . . . 42 | 7.2. Malicious Construction . . . . . . . . . . . . . . . . . 43 | |||
| 7.4 Rare IP Address Formats . . . . . . . . . . . . . . . . . 43 | 7.3. Back-End Transcoding . . . . . . . . . . . . . . . . . . 44 | |||
| 7.5 Sensitive Information . . . . . . . . . . . . . . . . . . 44 | 7.4. Rare IP Address Formats . . . . . . . . . . . . . . . . 45 | |||
| 7.6 Semantic Attacks . . . . . . . . . . . . . . . . . . . . . 44 | 7.5. Sensitive Information . . . . . . . . . . . . . . . . . 45 | |||
| 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 45 | 7.6. Semantic Attacks . . . . . . . . . . . . . . . . . . . . 45 | |||
| 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 45 | 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 46 | |||
| 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 46 | 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 46 | |||
| 10.1 Normative References . . . . . . . . . . . . . . . . . . . . 46 | 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 46 | |||
| 10.2 Informative References . . . . . . . . . . . . . . . . . . . 46 | 10.1. Normative References . . . . . . . . . . . . . . . . . . 46 | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 48 | 10.2. Informative References . . . . . . . . . . . . . . . . . 47 | |||
| A. Collected ABNF for URI . . . . . . . . . . . . . . . . . . . . 49 | A. Collected ABNF for URI . . . . . . . . . . . . . . . . . . . . 49 | |||
| B. Parsing a URI Reference with a Regular Expression . . . . . . 51 | B. Parsing a URI Reference with a Regular Expression . . . . . . 50 | |||
| C. Delimiting a URI in Context . . . . . . . . . . . . . . . . . 52 | C. Delimiting a URI in Context . . . . . . . . . . . . . . . . . 51 | |||
| D. Changes from RFC 2396 . . . . . . . . . . . . . . . . . . . . 53 | D. Changes from RFC 2396 . . . . . . . . . . . . . . . . . . . . 53 | |||
| D.1 Additions . . . . . . . . . . . . . . . . . . . . . . . . 53 | D.1. Additions . . . . . . . . . . . . . . . . . . . . . . . 53 | |||
| D.2 Modifications . . . . . . . . . . . . . . . . . . . . . . 54 | D.2. Modifications . . . . . . . . . . . . . . . . . . . . . 53 | |||
| E. Instructions to RFC Editor . . . . . . . . . . . . . . . . . . 56 | Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 | |||
| Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 | Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 60 | |||
| Intellectual Property and Copyright Statements . . . . . . . . 61 | Full Copyright Statement . . . . . . . . . . . . . . . . . . . . . 61 | |||
| 1. Introduction | 1. Introduction | |||
| A Uniform Resource Identifier (URI) provides a simple and extensible | A Uniform Resource Identifier (URI) provides a simple and extensible | |||
| means for identifying a resource. This specification of URI syntax | means for identifying a resource. This specification of URI syntax | |||
| and semantics is derived from concepts introduced by the World Wide | and semantics is derived from concepts introduced by the World Wide | |||
| Web global information initiative, whose use of such identifiers | Web global information initiative, whose use of these identifiers | |||
| dates from 1990 and is described in "Universal Resource Identifiers | dates from 1990 and is described in "Universal Resource Identifiers | |||
| in WWW" [RFC1630], and is designed to meet the recommendations laid | in WWW" [RFC1630]. The syntax is designed to meet the | |||
| out in "Functional Recommendations for Internet Resource Locators" | recommendations laid out in "Functional Recommendations for Internet | |||
| [RFC1736] and "Functional Requirements for Uniform Resource Names" | Resource Locators" [RFC1736] and "Functional Requirements for Uniform | |||
| [RFC1737]. | Resource Names" [RFC1737]. | |||
| This document obsoletes [RFC2396], which merged "Uniform Resource | This document obsoletes [RFC2396], which merged "Uniform Resource | |||
| Locators" [RFC1738] and "Relative Uniform Resource Locators" | Locators" [RFC1738] and "Relative Uniform Resource Locators" | |||
| [RFC1808] in order to define a single, generic syntax for all URIs. | [RFC1808] in order to define a single, generic syntax for all URIs. | |||
| It contains the updates from, and obsoletes, [RFC2732], which | It obsoletes [RFC2732], which introduced syntax for an IPv6 address. | |||
| introduced syntax for IPv6 addresses. It excludes those portions of | It excludes portions of RFC 1738 that defined the specific syntax of | |||
| RFC 1738 that defined the specific syntax of individual URI schemes; | individual URI schemes; those portions will be updated as separate | |||
| those portions will be updated as separate documents. The process | documents. The process for registration of new URI schemes is | |||
| for registration of new URI schemes is defined separately by [BCP35]. | defined separately by [BCP35]. Advice for designers of new URI | |||
| Advice for designers of new URI schemes can be found in [RFC2718]. | schemes can be found in [RFC2718]. All significant changes from RFC | |||
| 2396 are noted in Appendix D. | ||||
| All significant changes from RFC 2396 are noted in Appendix D. | ||||
| This specification uses the terms "character" and "coded character | This specification uses the terms "character" and "coded character | |||
| set" in accordance with the definitions provided in [BCP19], and | set" in accordance with the definitions provided in [BCP19], and | |||
| "character encoding" in place of what [BCP19] refers to as a | "character encoding" in place of what [BCP19] refers to as a | |||
| "charset". | "charset". | |||
| 1.1 Overview of URIs | 1.1. Overview of URIs | |||
| URIs are characterized as follows: | URIs are characterized as follows: | |||
| Uniform | Uniform | |||
| Uniformity provides several benefits: it allows different types of | Uniformity provides several benefits. It allows different types | |||
| resource identifiers to be used in the same context, even when the | of resource identifiers to be used in the same context, even when | |||
| mechanisms used to access those resources may differ; it allows | the mechanisms used to access those resources may differ. It | |||
| uniform semantic interpretation of common syntactic conventions | allows uniform semantic interpretation of common syntactic | |||
| across different types of resource identifiers; it allows | conventions across different types of resource identifiers. It | |||
| introduction of new types of resource identifiers without | allows introduction of new types of resource identifiers without | |||
| interfering with the way that existing identifiers are used; and, | interfering with the way that existing identifiers are used. It | |||
| it allows the identifiers to be reused in many different contexts, | allows the identifiers to be reused in many different contexts, | |||
| thus permitting new applications or protocols to leverage a | thus permitting new applications or protocols to leverage a pre- | |||
| pre-existing, large, and widely-used set of resource identifiers. | existing, large, and widely used set of resource identifiers. | |||
| Resource | Resource | |||
| This specification does not limit the scope of what might be a | This specification does not limit the scope of what might be a | |||
| resource; rather, the term "resource" is used in a general sense | resource; rather, the term "resource" is used in a general sense | |||
| for whatever might be identified by a URI. Familiar examples | for whatever might be identified by a URI. Familiar examples | |||
| include an electronic document, an image, a source of information | include an electronic document, an image, a source of information | |||
| with consistent purpose (e.g., "today's weather report for Los | with a consistent purpose (e.g., "today's weather report for Los | |||
| Angeles"), a service (e.g., an HTTP to SMS gateway), a collection | Angeles"), a service (e.g., an HTTP-to-SMS gateway), and a | |||
| of other resources, and so on. A resource is not necessarily | collection of other resources. A resource is not necessarily | |||
| accessible via the Internet; e.g., human beings, corporations, and | accessible via the Internet; e.g., human beings, corporations, and | |||
| bound books in a library can also be resources. Likewise, | bound books in a library can also be resources. Likewise, | |||
| abstract concepts can be resources, such as the operators and | abstract concepts can be resources, such as the operators and | |||
| operands of a mathematical equation, the types of a relationship | operands of a mathematical equation, the types of a relationship | |||
| (e.g., "parent" or "employee"), or numeric values (e.g., zero, | (e.g., "parent" or "employee"), or numeric values (e.g., zero, | |||
| one, and infinity). | one, and infinity). | |||
| Identifier | Identifier | |||
| An identifier embodies the information required to distinguish | An identifier embodies the information required to distinguish | |||
| what is being identified from all other things within its scope of | what is being identified from all other things within its scope of | |||
| identification. Our use of the terms "identify" and "identifying" | identification. Our use of the terms "identify" and "identifying" | |||
| refer to this purpose of distinguishing one resource from all | refer to this purpose of distinguishing one resource from all | |||
| other resources, regardless of how that purpose is accomplished | other resources, regardless of how that purpose is accomplished | |||
| (e.g., by name, address, context, etc.). These terms should not | (e.g., by name, address, or context). These terms should not be | |||
| be mistaken as an assumption that an identifier defines or | mistaken as an assumption that an identifier defines or embodies | |||
| embodies the identity of what is referenced, though that may be | the identity of what is referenced, though that may be the case | |||
| the case for some identifiers. Nor should it be assumed that a | for some identifiers. Nor should it be assumed that a system | |||
| system using URIs will access the resource identified: in many | using URIs will access the resource identified: in many cases, | |||
| cases, URIs are used to denote resources without any intention | URIs are used to denote resources without any intention that they | |||
| that they be accessed. Likewise, the "one" resource identified | be accessed. Likewise, the "one" resource identified might not be | |||
| might not be singular in nature (e.g., a resource might be a named | singular in nature (e.g., a resource might be a named set or a | |||
| set or a mapping that varies over time). | mapping that varies over time). | |||
| A URI is an identifier, consisting of a sequence of characters | A URI is an identifier consisting of a sequence of characters | |||
| matching the syntax rule named <URI> in Section 3, that enables | matching the syntax rule named <URI> in Section 3. It enables | |||
| uniform identification of resources via a separately defined, | uniform identification of resources via a separately defined | |||
| extensible set of naming schemes (Section 3.1). How that | extensible set of naming schemes (Section 3.1). How that | |||
| identification is accomplished, assigned, or enabled is delegated to | identification is accomplished, assigned, or enabled is delegated to | |||
| each scheme specification. | each scheme specification. | |||
| This specification does not place any limits on the nature of a | This specification does not place any limits on the nature of a | |||
| resource, the reasons why an application might wish to refer to a | resource, the reasons why an application might seek to refer to a | |||
| resource, or the kinds of system that might use URIs for the sake of | resource, or the kinds of systems that might use URIs for the sake of | |||
| identifying resources. This specification does not require that a | identifying resources. This specification does not require that a | |||
| URI persists in identifying the same resource over all time, though | URI persists in identifying the same resource over time, though that | |||
| that is a common goal of all URI schemes. Nevertheless, nothing in | is a common goal of all URI schemes. Nevertheless, nothing in this | |||
| this specification prevents an application from limiting itself to | specification prevents an application from limiting itself to | |||
| particular types of resources, or to a subset of URIs that maintains | particular types of resources, or to a subset of URIs that maintains | |||
| characteristics desired by that application. | characteristics desired by that application. | |||
| URIs have a global scope and are interpreted consistently regardless | URIs have a global scope and are interpreted consistently regardless | |||
| of context, though the result of that interpretation may be in | of context, though the result of that interpretation may be in | |||
| relation to the end-user's context. For example, "http://localhost/" | relation to the end-user's context. For example, "http://localhost/" | |||
| has the same interpretation for every user of that reference, even | has the same interpretation for every user of that reference, even | |||
| though the network interface corresponding to "localhost" may be | though the network interface corresponding to "localhost" may be | |||
| different for each end-user: interpretation is independent of access. | different for each end-user: interpretation is independent of access. | |||
| However, an action made on the basis of that reference will take | However, an action made on the basis of that reference will take | |||
| place in relation to the end-user's context, which implies that an | place in relation to the end-user's context, which implies that an | |||
| action intended to refer to a single, globally unique thing must use | action intended to refer to a globally unique thing must use a URI | |||
| a URI that distinguishes that resource from all other things. URIs | that distinguishes that resource from all other things. URIs that | |||
| that identify in relation to the end-user's local context should only | identify in relation to the end-user's local context should only be | |||
| be used when the context itself is a defining aspect of the resource, | used when the context itself is a defining aspect of the resource, | |||
| such as when an on-line help manual refers to a file on the | such as when an on-line help manual refers to a file on the end- | |||
| end-user's filesystem (e.g., "file:///etc/hosts"). | user's file system (e.g., "file:///etc/hosts"). | |||
| 1.1.1 Generic Syntax | 1.1.1. Generic Syntax | |||
| Each URI begins with a scheme name, as defined in Section 3.1, that | Each URI begins with a scheme name, as defined in Section 3.1, that | |||
| refers to a specification for assigning identifiers within that | refers to a specification for assigning identifiers within that | |||
| scheme. As such, the URI syntax is a federated and extensible naming | scheme. As such, the URI syntax is a federated and extensible naming | |||
| system wherein each scheme's specification may further restrict the | system wherein each scheme's specification may further restrict the | |||
| syntax and semantics of identifiers using that scheme. | syntax and semantics of identifiers using that scheme. | |||
| This specification defines those elements of the URI syntax that are | This specification defines those elements of the URI syntax that are | |||
| required of all URI schemes or are common to many URI schemes. It | required of all URI schemes or are common to many URI schemes. It | |||
| thus defines the syntax and semantics that are needed to implement a | thus defines the syntax and semantics needed to implement a scheme- | |||
| scheme-independent parsing mechanism for URI references, such that | independent parsing mechanism for URI references, by which the | |||
| the scheme-dependent handling of a URI can be postponed until the | scheme-dependent handling of a URI can be postponed until the | |||
| scheme-dependent semantics are needed. Likewise, protocols and data | scheme-dependent semantics are needed. Likewise, protocols and data | |||
| formats that make use of URI references can refer to this | formats that make use of URI references can refer to this | |||
| specification as defining the range of syntax allowed for all URIs, | specification as a definition for the range of syntax allowed for all | |||
| including those schemes that have yet to be defined, thus decoupling | URIs, including those schemes that have yet to be defined. This | |||
| the evolution of identification schemes from the evolution of | decouples the evolution of identification schemes from the evolution | |||
| protocols, data formats, and implementations that make use of URIs. | of protocols, data formats, and implementations that make use of | |||
| URIs. | ||||
| A parser of the generic URI syntax is capable of parsing any URI | A parser of the generic URI syntax can parse any URI reference into | |||
| reference into its major components; once the scheme is determined, | its major components. Once the scheme is determined, further | |||
| further scheme-specific parsing can be performed on the components. | scheme-specific parsing can be performed on the components. In other | |||
| In other words, the URI generic syntax is a superset of the syntax of | words, the URI generic syntax is a superset of the syntax of all URI | |||
| all URI schemes. | schemes. | |||
| 1.1.2 Examples | 1.1.2. Examples | |||
| The following example URIs illustrate several URI schemes and | The following example URIs illustrate several URI schemes and | |||
| variations in their common syntax components: | variations in their common syntax components: | |||
| ftp://ftp.is.co.za/rfc/rfc1808.txt | ftp://ftp.is.co.za/rfc/rfc1808.txt | |||
| http://www.ietf.org/rfc/rfc2396.txt | http://www.ietf.org/rfc/rfc2396.txt | |||
| ldap://[2001:db8::7]/c=GB?objectClass?one | ldap://[2001:db8::7]/c=GB?objectClass?one | |||
| mailto:[email protected] | mailto:[email protected] | |||
| news:comp.infosystems.www.servers.unix | news:comp.infosystems.www.servers.unix | |||
| tel:+1-816-555-1212 | tel:+1-816-555-1212 | |||
| telnet://192.0.2.16:80/ | telnet://192.0.2.16:80/ | |||
| urn:oasis:names:specification:docbook:dtd:xml:4.1.2 | urn:oasis:names:specification:docbook:dtd:xml:4.1.2 | |||
| 1.1.3 URI, URL, and URN | 1.1.3. URI, URL, and URN | |||
| A URI can be further classified as a locator, a name, or both. The | A URI can be further classified as a locator, a name, or both. The | |||
| term "Uniform Resource Locator" (URL) refers to the subset of URIs | term "Uniform Resource Locator" (URL) refers to the subset of URIs | |||
| that, in addition to identifying a resource, provide a means of | that, in addition to identifying a resource, provide a means of | |||
| locating the resource by describing its primary access mechanism | locating the resource by describing its primary access mechanism | |||
| (e.g., its network "location"). The term "Uniform Resource Name" | (e.g., its network "location"). The term "Uniform Resource Name" | |||
| (URN) has been used historically to refer to both URIs under the | (URN) has been used historically to refer to both URIs under the | |||
| "urn" scheme [RFC2141], which are required to remain globally unique | "urn" scheme [RFC2141], which are required to remain globally unique | |||
| and persistent even when the resource ceases to exist or becomes | and persistent even when the resource ceases to exist or becomes | |||
| unavailable, and to any other URI with the properties of a name. | unavailable, and to any other URI with the properties of a name. | |||
| An individual scheme does not need to be classified as being just one | An individual scheme does not have to be classified as being just one | |||
| of "name" or "locator". Instances of URIs from any given scheme may | of "name" or "locator". Instances of URIs from any given scheme may | |||
| have the characteristics of names or locators or both, often | have the characteristics of names or locators or both, often | |||
| depending on the persistence and care in the assignment of | depending on the persistence and care in the assignment of | |||
| identifiers by the naming authority, rather than any quality of the | identifiers by the naming authority, rather than on any quality of | |||
| scheme. Future specifications and related documentation should use | the scheme. Future specifications and related documentation should | |||
| the general term "URI", rather than the more restrictive terms URL | use the general term "URI" rather than the more restrictive terms | |||
| and URN [RFC3305]. | "URL" and "URN" [RFC3305]. | |||
| 1.2 Design Considerations | 1.2. Design Considerations | |||
| 1.2.1 Transcription | 1.2.1. Transcription | |||
| The URI syntax has been designed with global transcription as one of | The URI syntax has been designed with global transcription as one of | |||
| its main considerations. A URI is a sequence of characters from a | its main considerations. A URI is a sequence of characters from a | |||
| very limited set: the letters of the basic Latin alphabet, digits, | very limited set: the letters of the basic Latin alphabet, digits, | |||
| and a few special characters. A URI may be represented in a variety | and a few special characters. A URI may be represented in a variety | |||
| of ways: e.g., ink on paper, pixels on a screen, or a sequence of | of ways; e.g., ink on paper, pixels on a screen, or a sequence of | |||
| character encoding octets. The interpretation of a URI depends only | character encoding octets. The interpretation of a URI depends only | |||
| on the characters used and not how those characters are represented | on the characters used and not on how those characters are | |||
| in a network protocol. | represented in a network protocol. | |||
| The goal of transcription can be described by a simple scenario. | The goal of transcription can be described by a simple scenario. | |||
| Imagine two colleagues, Sam and Kim, sitting in a pub at an | Imagine two colleagues, Sam and Kim, sitting in a pub at an | |||
| international conference and exchanging research ideas. Sam asks Kim | international conference and exchanging research ideas. Sam asks Kim | |||
| for a location to get more information, so Kim writes the URI for the | for a location to get more information, so Kim writes the URI for the | |||
| research site on a napkin. Upon returning home, Sam takes out the | research site on a napkin. Upon returning home, Sam takes out the | |||
| napkin and types the URI into a computer, which then retrieves the | napkin and types the URI into a computer, which then retrieves the | |||
| information to which Kim referred. | information to which Kim referred. | |||
| There are several design considerations revealed by the scenario: | There are several design considerations revealed by the scenario: | |||
| o A URI is a sequence of characters that is not always represented | o A URI is a sequence of characters that is not always represented | |||
| as a sequence of octets. | as a sequence of octets. | |||
| o A URI might be transcribed from a non-network source, and thus | o A URI might be transcribed from a non-network source and thus | |||
| should consist of characters that are most likely to be able to be | should consist of characters that are most likely able to be | |||
| entered into a computer, within the constraints imposed by | entered into a computer, within the constraints imposed by | |||
| keyboards (and related input devices) across languages and | keyboards (and related input devices) across languages and | |||
| locales. | locales. | |||
| o A URI often needs to be remembered by people, and it is easier for | o A URI often has to be remembered by people, and it is easier for | |||
| people to remember a URI when it consists of meaningful or | people to remember a URI when it consists of meaningful or | |||
| familiar components. | familiar components. | |||
| These design considerations are not always in alignment. For | These design considerations are not always in alignment. For | |||
| example, it is often the case that the most meaningful name for a URI | example, it is often the case that the most meaningful name for a URI | |||
| component would require characters that cannot be typed into some | component would require characters that cannot be typed into some | |||
| systems. The ability to transcribe a resource identifier from one | systems. The ability to transcribe a resource identifier from one | |||
| medium to another has been considered more important than having a | medium to another has been considered more important than having a | |||
| URI consist of the most meaningful of components. | URI consist of the most meaningful of components. | |||
| In local or regional contexts and with improving technology, users | In local or regional contexts and with improving technology, users | |||
| might benefit from being able to use a wider range of characters; | might benefit from being able to use a wider range of characters; | |||
| such use is not defined by this specification. Percent-encoded | such use is not defined by this specification. Percent-encoded | |||
| octets (Section 2.1) may be used within a URI to represent characters | octets (Section 2.1) may be used within a URI to represent characters | |||
| outside the range of the US-ASCII coded character set if such | outside the range of the US-ASCII coded character set if this | |||
| representation is allowed by the scheme or by the protocol element in | representation is allowed by the scheme or by the protocol element in | |||
| which the URI is referenced; such a definition should specify the | which the URI is referenced. Such a definition should specify the | |||
| character encoding used to map those characters to octets prior to | character encoding used to map those characters to octets prior to | |||
| being percent-encoded for the URI. | being percent-encoded for the URI. | |||
| 1.2.2 Separating Identification from Interaction | 1.2.2. Separating Identification from Interaction | |||
| A common misunderstanding of URIs is that they are only used to refer | A common misunderstanding of URIs is that they are only used to refer | |||
| to accessible resources. In fact, the URI alone only provides | to accessible resources. The URI itself only provides | |||
| identification; access to the resource is neither guaranteed nor | identification; access to the resource is neither guaranteed nor | |||
| implied by the presence of a URI. Instead, an operation (if any) | implied by the presence of a URI. Instead, any operation associated | |||
| associated with a URI reference is defined by the protocol element, | with a URI reference is defined by the protocol element, data format | |||
| data format attribute, or natural language text in which it appears. | attribute, or natural language text in which it appears. | |||
| Given a URI, a system may attempt to perform a variety of operations | Given a URI, a system may attempt to perform a variety of operations | |||
| on the resource, as might be characterized by such words as "access", | on the resource, as might be characterized by words such as "access", | |||
| "update", "replace", or "find attributes". Such operations are | "update", "replace", or "find attributes". Such operations are | |||
| defined by the protocols that make use of URIs, not by this | defined by the protocols that make use of URIs, not by this | |||
| specification. However, we do use a few general terms for describing | specification. However, we do use a few general terms for describing | |||
| common operations on URIs. URI "resolution" is the process of | common operations on URIs. URI "resolution" is the process of | |||
| determining an access mechanism and the appropriate parameters | determining an access mechanism and the appropriate parameters | |||
| necessary to dereference a URI; such resolution may require several | necessary to dereference a URI; this resolution may require several | |||
| iterations. To use that access mechanism to perform an action on the | iterations. To use that access mechanism to perform an action on the | |||
| URI's resource is to "dereference" the URI. | URI's resource is to "dereference" the URI. | |||
| When URIs are used within information retrieval systems to identify | When URIs are used within information retrieval systems to identify | |||
| sources of information, the most common form of URI dereference is | sources of information, the most common form of URI dereference is | |||
| "retrieval": making use of a URI in order to retrieve a | "retrieval": making use of a URI in order to retrieve a | |||
| representation of its associated resource. A "representation" is a | representation of its associated resource. A "representation" is a | |||
| sequence of octets, along with representation metadata describing | sequence of octets, along with representation metadata describing | |||
| those octets, that constitutes a record of the state of the resource | those octets, that constitutes a record of the state of the resource | |||
| at the time that the representation is generated. Retrieval is | at the time when the representation is generated. Retrieval is | |||
| achieved by a process that might include using the URI as a cache key | achieved by a process that might include using the URI as a cache key | |||
| to check for a locally cached representation, resolution of the URI | to check for a locally cached representation, resolution of the URI | |||
| to determine an appropriate access mechanism (if any), and | to determine an appropriate access mechanism (if any), and | |||
| dereference of the URI for the sake of applying a retrieval | dereference of the URI for the sake of applying a retrieval | |||
| operation. Depending on the protocols used to perform the retrieval, | operation. Depending on the protocols used to perform the retrieval, | |||
| additional information might be supplied about the resource (resource | additional information might be supplied about the resource (resource | |||
| metadata) and its relation to other resources. | metadata) and its relation to other resources. | |||
| URI references in information retrieval systems are designed to be | URI references in information retrieval systems are designed to be | |||
| late-binding: the result of an access is generally determined at the | late-binding: the result of an access is generally determined when it | |||
| time it is accessed and may vary over time or due to other aspects of | is accessed and may vary over time or due to other aspects of the | |||
| the interaction. Such references are created in order to be used in | interaction. These references are created in order to be used in the | |||
| the future: what is being identified is not some specific result that | future: what is being identified is not some specific result that was | |||
| was obtained in the past, but rather some characteristic that is | obtained in the past, but rather some characteristic that is expected | |||
| expected to be true for future results. In such cases, the resource | to be true for future results. In such cases, the resource referred | |||
| referred to by the URI is actually a sameness of characteristics as | to by the URI is actually a sameness of characteristics as observed | |||
| observed over time, perhaps elucidated by additional comments or | over time, perhaps elucidated by additional comments or assertions | |||
| assertions made by the resource provider. | made by the resource provider. | |||
| Although many URI schemes are named after protocols, this does not | Although many URI schemes are named after protocols, this does not | |||
| imply that use of such a URI will result in access to the resource | imply that use of these URIs will result in access to the resource | |||
| via the named protocol. URIs are often used simply for the sake of | via the named protocol. URIs are often used simply for the sake of | |||
| identification. Even when a URI is used to retrieve a representation | identification. Even when a URI is used to retrieve a representation | |||
| of a resource, that access might be through gateways, proxies, | of a resource, that access might be through gateways, proxies, | |||
| caches, and name resolution services that are independent of the | caches, and name resolution services that are independent of the | |||
| protocol associated with the scheme name, and the resolution of some | protocol associated with the scheme name. The resolution of some | |||
| URIs may require the use of more than one protocol (e.g., both DNS | URIs may require the use of more than one protocol (e.g., both DNS | |||
| and HTTP are typically used to access an "http" URI's origin server | and HTTP are typically used to access an "http" URI's origin server | |||
| when a representation isn't found in a local cache). | when a representation isn't found in a local cache). | |||
| 1.2.3 Hierarchical Identifiers | 1.2.3. Hierarchical Identifiers | |||
| The URI syntax is organized hierarchically, with components listed in | The URI syntax is organized hierarchically, with components listed in | |||
| order of decreasing significance from left to right. For some URI | order of decreasing significance from left to right. For some URI | |||
| schemes, the visible hierarchy is limited to the scheme itself: | schemes, the visible hierarchy is limited to the scheme itself: | |||
| everything after the scheme component delimiter (":") is considered | everything after the scheme component delimiter (":") is considered | |||
| opaque to URI processing. Other URI schemes make the hierarchy | opaque to URI processing. Other URI schemes make the hierarchy | |||
| explicit and visible to generic parsing algorithms. | explicit and visible to generic parsing algorithms. | |||
| The generic syntax uses the slash ("/"), question mark ("?"), and | The generic syntax uses the slash ("/"), question mark ("?"), and | |||
| number sign ("#") characters for the purpose of delimiting components | number sign ("#") characters to delimit components that are | |||
| that are significant to the generic parser's hierarchical | significant to the generic parser's hierarchical interpretation of an | |||
| interpretation of an identifier. In addition to aiding the | identifier. In addition to aiding the readability of such | |||
| readability of such identifiers through the consistent use of | identifiers through the consistent use of familiar syntax, this | |||
| familiar syntax, this uniform representation of hierarchy across | uniform representation of hierarchy across naming schemes allows | |||
| naming schemes allows scheme-independent references to be made | scheme-independent references to be made relative to that hierarchy. | |||
| relative to that hierarchy. | ||||
| It is often the case that a group or "tree" of documents has been | It is often the case that a group or "tree" of documents has been | |||
| constructed to serve a common purpose, wherein the vast majority of | constructed to serve a common purpose, wherein the vast majority of | |||
| URI references in these documents point to resources within the tree | URI references in these documents point to resources within the tree | |||
| rather than outside of it. Similarly, documents located at a | rather than outside it. Similarly, documents located at a particular | |||
| particular site are much more likely to refer to other resources at | site are much more likely to refer to other resources at that site | |||
| that site than to resources at remote sites. Relative referencing of | than to resources at remote sites. Relative referencing of URIs | |||
| URIs allows document trees to be partially independent of their | allows document trees to be partially independent of their location | |||
| location and access scheme. For instance, it is possible for a | and access scheme. For instance, it is possible for a single set of | |||
| single set of hypertext documents to be simultaneously accessible and | hypertext documents to be simultaneously accessible and traversable | |||
| traversable via each of the "file", "http", and "ftp" schemes if the | via each of the "file", "http", and "ftp" schemes if the documents | |||
| documents refer to each other using relative references. | refer to each other with relative references. Furthermore, such | |||
| Furthermore, such document trees can be moved, as a whole, without | document trees can be moved, as a whole, without changing any of the | |||
| changing any of the relative references. | relative references. | |||
| A relative reference (Section 4.2) refers to a resource by describing | A relative reference (Section 4.2) refers to a resource by describing | |||
| the difference within a hierarchical name space between the reference | the difference within a hierarchical name space between the reference | |||
| context and the target URI. The reference resolution algorithm, | context and the target URI. The reference resolution algorithm, | |||
| presented in Section 5, defines how such a reference is transformed | presented in Section 5, defines how such a reference is transformed | |||
| to the target URI. Since relative references can only be used within | to the target URI. As relative references can only be used within | |||
| the context of a hierarchical URI, designers of new URI schemes | the context of a hierarchical URI, designers of new URI schemes | |||
| should use a syntax consistent with the generic syntax's hierarchical | should use a syntax consistent with the generic syntax's hierarchical | |||
| components unless there are compelling reasons to forbid relative | components unless there are compelling reasons to forbid relative | |||
| referencing within that scheme. | referencing within that scheme. | |||
| NOTE: Previous specifications used the terms "partial URI" and | NOTE: Previous specifications used the terms "partial URI" and | |||
| "relative URI" to denote a relative reference to a URI. Since | "relative URI" to denote a relative reference to a URI. As some | |||
| some readers misunderstood those terms to mean that relative URIs | readers misunderstood those terms to mean that relative URIs are a | |||
| are a subset of URIs, rather than a method of referencing URIs, | subset of URIs rather than a method of referencing URIs, this | |||
| this specification simply refers to them as relative references. | specification simply refers to them as relative references. | |||
| All URI references are parsed by generic syntax parsers when used. | All URI references are parsed by generic syntax parsers when used. | |||
| However, since hierarchical processing has no effect on an absolute | However, because hierarchical processing has no effect on an absolute | |||
| URI used in a reference unless it contains one or more dot-segments | URI used in a reference unless it contains one or more dot-segments | |||
| (complete path segments of "." or "..", as described in Section 3.3), | (complete path segments of "." or "..", as described in Section 3.3), | |||
| URI scheme specifications can define opaque identifiers by | URI scheme specifications can define opaque identifiers by | |||
| disallowing use of slash characters, question mark characters, and | disallowing use of slash characters, question mark characters, and | |||
| the URIs "scheme:." and "scheme:..". | the URIs "scheme:." and "scheme:..". | |||
| 1.3 Syntax Notation | 1.3. Syntax Notation | |||
| This specification uses the Augmented Backus-Naur Form (ABNF) | This specification uses the Augmented Backus-Naur Form (ABNF) | |||
| notation of [RFC2234], including the following core ABNF syntax rules | notation of [RFC2234], including the following core ABNF syntax rules | |||
| defined by that specification: ALPHA (letters), CR (carriage return), | defined by that specification: ALPHA (letters), CR (carriage return), | |||
| DIGIT (decimal digits), DQUOTE (double quote), HEXDIG (hexadecimal | DIGIT (decimal digits), DQUOTE (double quote), HEXDIG (hexadecimal | |||
| digits), LF (line feed), and SP (space). The complete URI syntax is | digits), LF (line feed), and SP (space). The complete URI syntax is | |||
| collected in Appendix A. | collected in Appendix A. | |||
| 2. Characters | 2. Characters | |||
| The URI syntax provides a method of encoding data, presumably for the | The URI syntax provides a method of encoding data, presumably for the | |||
| sake of identifying a resource, as a sequence of characters. The URI | sake of identifying a resource, as a sequence of characters. The URI | |||
| characters are, in turn, frequently encoded as octets for transport | characters are, in turn, frequently encoded as octets for transport | |||
| or presentation. This specification does not mandate any particular | or presentation. This specification does not mandate any particular | |||
| character encoding for mapping between URI characters and the octets | character encoding for mapping between URI characters and the octets | |||
| used to store or transmit those characters. When a URI appears in a | used to store or transmit those characters. When a URI appears in a | |||
| protocol element, the character encoding is defined by that protocol; | protocol element, the character encoding is defined by that protocol; | |||
| absent such a definition, a URI is assumed to be in the same | without such a definition, a URI is assumed to be in the same | |||
| character encoding as the surrounding text. | character encoding as the surrounding text. | |||
| The ABNF notation defines its terminal values to be non-negative | The ABNF notation defines its terminal values to be non-negative | |||
| integers (codepoints) based on the US-ASCII coded character set | integers (codepoints) based on the US-ASCII coded character set | |||
| [ASCII]. Since a URI is a sequence of characters, we must invert | [ASCII]. Because a URI is a sequence of characters, we must invert | |||
| that relation in order to understand the URI syntax. Therefore, the | that relation in order to understand the URI syntax. Therefore, the | |||
| integer values used by the ABNF must be mapped back to their | integer values used by the ABNF must be mapped back to their | |||
| corresponding characters via US-ASCII in order to complete the syntax | corresponding characters via US-ASCII in order to complete the syntax | |||
| rules. | rules. | |||
| A URI is composed from a limited set of characters consisting of | A URI is composed from a limited set of characters consisting of | |||
| digits, letters, and a few graphic symbols. A reserved subset of | digits, letters, and a few graphic symbols. A reserved subset of | |||
| those characters may be used to delimit syntax components within a | those characters may be used to delimit syntax components within a | |||
| URI, while the remaining characters, including both the unreserved | URI while the remaining characters, including both the unreserved set | |||
| set and those reserved characters not acting as delimiters, define | and those reserved characters not acting as delimiters, define each | |||
| each component's identifying data. | component's identifying data. | |||
| 2.1 Percent-Encoding | 2.1. Percent-Encoding | |||
| A percent-encoding mechanism is used to represent a data octet in a | A percent-encoding mechanism is used to represent a data octet in a | |||
| component when that octet's corresponding character is outside the | component when that octet's corresponding character is outside the | |||
| allowed set or is being used as a delimiter of, or within, the | allowed set or is being used as a delimiter of, or within, the | |||
| component. A percent-encoded octet is encoded as a character | component. A percent-encoded octet is encoded as a character | |||
| triplet, consisting of the percent character "%" followed by the two | triplet, consisting of the percent character "%" followed by the two | |||
| hexadecimal digits representing that octet's numeric value. For | hexadecimal digits representing that octet's numeric value. For | |||
| example, "%20" is the percent-encoding for the binary octet | example, "%20" is the percent-encoding for the binary octet | |||
| "00100000" (ABNF: %x20), which in US-ASCII corresponds to the space | "00100000" (ABNF: %x20), which in US-ASCII corresponds to the space | |||
| character (SP). Section 2.4 describes when percent-encoding and | character (SP). Section 2.4 describes when percent-encoding and | |||
| decoding is applied. | decoding is applied. | |||
| pct-encoded = "%" HEXDIG HEXDIG | pct-encoded = "%" HEXDIG HEXDIG | |||
| The uppercase hexadecimal digits 'A' through 'F' are equivalent to | The uppercase hexadecimal digits 'A' through 'F' are equivalent to | |||
| the lowercase digits 'a' through 'f', respectively. Two URIs that | the lowercase digits 'a' through 'f', respectively. If two URIs | |||
| differ only in the case of hexadecimal digits used in percent-encoded | differ only in the case of hexadecimal digits used in percent-encoded | |||
| octets are equivalent. For consistency, URI producers and | octets, they are equivalent. For consistency, URI producers and | |||
| normalizers should use uppercase hexadecimal digits for all | normalizers should use uppercase hexadecimal digits for all percent- | |||
| percent-encodings. | encodings. | |||
| 2.2 Reserved Characters | 2.2. Reserved Characters | |||
| URIs include components and subcomponents that are delimited by | URIs include components and subcomponents that are delimited by | |||
| characters in the "reserved" set. These characters are called | characters in the "reserved" set. These characters are called | |||
| "reserved" because they may (or may not) be defined as delimiters by | "reserved" because they may (or may not) be defined as delimiters by | |||
| the generic syntax, by each scheme-specific syntax, or by the | the generic syntax, by each scheme-specific syntax, or by the | |||
| implementation-specific syntax of a URI's dereferencing algorithm. | implementation-specific syntax of a URI's dereferencing algorithm. | |||
| If data for a URI component would conflict with a reserved | If data for a URI component would conflict with a reserved | |||
| character's purpose as a delimiter, then the conflicting data must be | character's purpose as a delimiter, then the conflicting data must be | |||
| percent-encoded before forming the URI. | percent-encoded before the URI is formed. | |||
| reserved = gen-delims / sub-delims | reserved = gen-delims / sub-delims | |||
| gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" | gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" | |||
| sub-delims = "!" / "$" / "&" / "'" / "(" / ")" | sub-delims = "!" / "$" / "&" / "'" / "(" / ")" | |||
| / "*" / "+" / "," / ";" / "=" | / "*" / "+" / "," / ";" / "=" | |||
| The purpose of reserved characters is to provide a set of delimiting | The purpose of reserved characters is to provide a set of delimiting | |||
| characters that are distinguishable from other data within a URI. | characters that are distinguishable from other data within a URI. | |||
| URIs that differ in the replacement of a reserved character with its | URIs that differ in the replacement of a reserved character with its | |||
| corresponding percent-encoded octet are not equivalent. | corresponding percent-encoded octet are not equivalent. Percent- | |||
| Percent-encoding a reserved character, or decoding a percent-encoded | encoding a reserved character, or decoding a percent-encoded octet | |||
| octet that corresponds to a reserved character, will change how the | that corresponds to a reserved character, will change how the URI is | |||
| URI is interpreted by most applications. Thus, characters in the | interpreted by most applications. Thus, characters in the reserved | |||
| reserved set are protected from normalization and are therefore safe | set are protected from normalization and are therefore safe to be | |||
| to be used by scheme-specific and producer-specific algorithms for | used by scheme-specific and producer-specific algorithms for | |||
| delimiting data subcomponents within a URI. | delimiting data subcomponents within a URI. | |||
| A subset of the reserved characters (gen-delims) are used as | A subset of the reserved characters (gen-delims) is used as | |||
| delimiters of the generic URI components described in Section 3. A | delimiters of the generic URI components described in Section 3. A | |||
| component's ABNF syntax rule will not use the reserved or gen-delims | component's ABNF syntax rule will not use the reserved or gen-delims | |||
| rule names directly; instead, each syntax rule lists the characters | rule names directly; instead, each syntax rule lists the characters | |||
| allowed within that component (i.e., not delimiting it) and any of | allowed within that component (i.e., not delimiting it), and any of | |||
| those characters that are also in the reserved set are "reserved" for | those characters that are also in the reserved set are "reserved" for | |||
| use as subcomponent delimiters within the component. Only the most | use as subcomponent delimiters within the component. Only the most | |||
| common subcomponents are defined by this specification; other | common subcomponents are defined by this specification; other | |||
| subcomponents may be defined by a URI scheme's specification, or by | subcomponents may be defined by a URI scheme's specification, or by | |||
| the implementation-specific syntax of a URI's dereferencing | the implementation-specific syntax of a URI's dereferencing | |||
| algorithm, provided that such subcomponents are delimited by | algorithm, provided that such subcomponents are delimited by | |||
| characters in the reserved set allowed within that component. | characters in the reserved set allowed within that component. | |||
| URI producing applications should percent-encode data octets that | URI producing applications should percent-encode data octets that | |||
| correspond to characters in the reserved set. However, if a reserved | correspond to characters in the reserved set unless these characters | |||
| character is found in a URI component and no delimiting role is known | are specifically allowed by the URI scheme to represent data in that | |||
| for that character, then it should be interpreted as representing the | component. If a reserved character is found in a URI component and | |||
| data octet corresponding to that character's encoding in US-ASCII. | no delimiting role is known for that character, then it must be | |||
| interpreted as representing the data octet corresponding to that | ||||
| character's encoding in US-ASCII. | ||||
| 2.3 Unreserved Characters | 2.3. Unreserved Characters | |||
| Characters that are allowed in a URI but do not have a reserved | Characters that are allowed in a URI but do not have a reserved | |||
| purpose are called unreserved. These include uppercase and lowercase | purpose are called unreserved. These include uppercase and lowercase | |||
| letters, decimal digits, hyphen, period, underscore, and tilde. | letters, decimal digits, hyphen, period, underscore, and tilde. | |||
| unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" | unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" | |||
| URIs that differ in the replacement of an unreserved character with | URIs that differ in the replacement of an unreserved character with | |||
| its corresponding percent-encoded US-ASCII octet are equivalent: they | its corresponding percent-encoded US-ASCII octet are equivalent: they | |||
| identify the same resource. However, URI comparison implementations | identify the same resource. However, URI comparison implementations | |||
| do not always perform normalization prior to comparison Section 6. | do not always perform normalization prior to comparison (see Section | |||
| For consistency, percent-encoded octets in the ranges of ALPHA | 6). For consistency, percent-encoded octets in the ranges of ALPHA | |||
| (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), | (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), | |||
| underscore (%5F), or tilde (%7E) should not be created by URI | underscore (%5F), or tilde (%7E) should not be created by URI | |||
| producers and, when found in a URI, should be decoded to their | producers and, when found in a URI, should be decoded to their | |||
| corresponding unreserved character by URI normalizers. | corresponding unreserved characters by URI normalizers. | |||
| 2.4 When to Encode or Decode | 2.4. When to Encode or Decode | |||
| Under normal circumstances, the only time that octets within a URI | Under normal circumstances, the only time when octets within a URI | |||
| are percent-encoded is during the process of producing the URI from | are percent-encoded is during the process of producing the URI from | |||
| its component parts. It is during that process that an | its component parts. This is when an implementation determines which | |||
| implementation determines which of the reserved characters are to be | of the reserved characters are to be used as subcomponent delimiters | |||
| used as subcomponent delimiters and which can be safely used as data. | and which can be safely used as data. Once produced, a URI is always | |||
| Once produced, a URI is always in its percent-encoded form. | in its percent-encoded form. | |||
| When a URI is dereferenced, the components and subcomponents | When a URI is dereferenced, the components and subcomponents | |||
| significant to the scheme-specific dereferencing process (if any) | significant to the scheme-specific dereferencing process (if any) | |||
| must be parsed and separated before the percent-encoded octets within | must be parsed and separated before the percent-encoded octets within | |||
| those components can be safely decoded, since otherwise the data may | those components can be safely decoded, as otherwise the data may be | |||
| be mistaken for component delimiters. The only exception is for | mistaken for component delimiters. The only exception is for | |||
| percent-encoded octets corresponding to characters in the unreserved | percent-encoded octets corresponding to characters in the unreserved | |||
| set, which can be decoded at any time. For example, the octet | set, which can be decoded at any time. For example, the octet | |||
| corresponding to the tilde ("~") character is often encoded as "%7E" | corresponding to the tilde ("~") character is often encoded as "%7E" | |||
| by older URI processing implementations; the "%7E" can be replaced by | by older URI processing implementations; the "%7E" can be replaced by | |||
| "~" without changing its interpretation. | "~" without changing its interpretation. | |||
| Because the percent ("%") character serves as the indicator for | Because the percent ("%") character serves as the indicator for | |||
| percent-encoded octets, it must be percent-encoded as "%25" in order | percent-encoded octets, it must be percent-encoded as "%25" for that | |||
| for that octet to be used as data within a URI. Implementations must | octet to be used as data within a URI. Implementations must not | |||
| not percent-encode or decode the same string more than once, since | percent-encode or decode the same string more than once, as decoding | |||
| decoding an already decoded string might lead to misinterpreting a | an already decoded string might lead to misinterpreting a percent | |||
| percent data octet as the beginning of a percent-encoding, or vice | data octet as the beginning of a percent-encoding, or vice versa in | |||
| versa in the case of percent-encoding an already percent-encoded | the case of percent-encoding an already percent-encoded string. | |||
| string. | ||||
| 2.5 Identifying Data | 2.5. Identifying Data | |||
| URI characters provide identifying data for each of the URI | URI characters provide identifying data for each of the URI | |||
| components, serving as an external interface for identification | components, serving as an external interface for identification | |||
| between systems. Although the presence and nature of the URI | between systems. Although the presence and nature of the URI | |||
| production interface is hidden from clients that use its URIs, and | production interface is hidden from clients that use its URIs (and is | |||
| thus beyond the scope of the interoperability requirements defined by | thus beyond the scope of the interoperability requirements defined by | |||
| this specification, it is a frequent source of confusion and errors | this specification), it is a frequent source of confusion and errors | |||
| in the interpretation of URI character issues. Implementers need to | in the interpretation of URI character issues. Implementers have to | |||
| be aware that there are multiple character encodings involved in the | be aware that there are multiple character encodings involved in the | |||
| production and transmission of URIs: local name and data encoding, | production and transmission of URIs: local name and data encoding, | |||
| public interface encoding, URI character encoding, data format | public interface encoding, URI character encoding, data format | |||
| encoding, and protocol encoding. | encoding, and protocol encoding. | |||
| The first encoding of identifying data is the one in which the local | Local names, such as file system names, are stored with a local | |||
| names or data are stored. URI producing applications (a.k.a., origin | character encoding. URI producing applications (e.g., origin | |||
| servers) will typically use the local encoding as the basis for | servers) will typically use the local encoding as the basis for | |||
| producing meaningful names. The URI producer will transform the | producing meaningful names. The URI producer will transform the | |||
| local encoding to one that is suitable for a public interface, and | local encoding to one that is suitable for a public interface and | |||
| then transform the public interface encoding into the restricted set | then transform the public interface encoding into the restricted set | |||
| of URI characters (reserved, unreserved, and percent-encodings). | of URI characters (reserved, unreserved, and percent-encodings). | |||
| Those characters are, in turn, encoded as octets to be used as a | Those characters are, in turn, encoded as octets to be used as a | |||
| reference within a data format (e.g., a document charset), and such | reference within a data format (e.g., a document charset), and such | |||
| data formats are often subsequently encoded for transmission over | data formats are often subsequently encoded for transmission over | |||
| Internet protocols. | Internet protocols. | |||
| For most systems, an unreserved character appearing within a URI | For most systems, an unreserved character appearing within a URI | |||
| component is interpreted as representing the data octet corresponding | component is interpreted as representing the data octet corresponding | |||
| to that character's encoding in US-ASCII. Consumers of URIs assume | to that character's encoding in US-ASCII. Consumers of URIs assume | |||
| that the letter "X" corresponds to the octet "01011000", and there is | that the letter "X" corresponds to the octet "01011000", and even | |||
| no harm in making that assumption even when it is incorrect. A | when that assumption is incorrect, there is no harm in making it. A | |||
| system that internally provides identifiers in the form of a | system that internally provides identifiers in the form of a | |||
| different character encoding, such as EBCDIC, will generally perform | different character encoding, such as EBCDIC, will generally perform | |||
| character translation of textual identifiers to UTF-8 [STD63] (or | character translation of textual identifiers to UTF-8 [STD63] (or | |||
| some other superset of the US-ASCII character encoding) at an | some other superset of the US-ASCII character encoding) at an | |||
| internal interface, thereby providing more meaningful identifiers | internal interface, thereby providing more meaningful identifiers | |||
| than simply percent-encoding the original octets. | than those resulting from simply percent-encoding the original | |||
| octets. | ||||
| For example, consider an information service that provides data, | For example, consider an information service that provides data, | |||
| stored locally using an EBCDIC-based filesystem, to clients on the | stored locally using an EBCDIC-based file system, to clients on the | |||
| Internet through an HTTP server. When an author creates a file on | Internet through an HTTP server. When an author creates a file with | |||
| that filesystem with the name "Laguna Beach", their expectation is | the name "Laguna Beach" on that file system, the "http" URI | |||
| that the "http" URI corresponding to that resource would also contain | corresponding to that resource is expected to contain the meaningful | |||
| the meaningful string "Laguna%20Beach". If, however, that server | string "Laguna%20Beach". If, however, that server produces URIs by | |||
| produces URIs using an overly-simplistic raw octet mapping, then the | using an overly simplistic raw octet mapping, then the result would | |||
| result would be a URI containing | be a URI containing "%D3%81%87%A4%95%81@%C2%85%81%83%88". An | |||
| "%D3%81%87%A4%95%81@%C2%85%81%83%88". An internal transcoding | internal transcoding interface fixes this problem by transcoding the | |||
| interface fixes that problem by transcoding the local name to a | local name to a superset of US-ASCII prior to producing the URI. | |||
| superset of US-ASCII prior to producing the URI. Naturally, proper | Naturally, proper interpretation of an incoming URI on such an | |||
| interpretation of an incoming URI on such an interface requires that | interface requires that percent-encoded octets be decoded (e.g., | |||
| percent-encoded octets be decoded (e.g., "%20" to SP) before the | "%20" to SP) before the reverse transcoding is applied to obtain the | |||
| reverse transcoding is applied to obtain the local name. | local name. | |||
| In some cases, the internal interface between a URI component and the | In some cases, the internal interface between a URI component and the | |||
| identifying data that it has been crafted to represent is much less | identifying data that it has been crafted to represent is much less | |||
| direct than a character encoding translation. For example, portions | direct than a character encoding translation. For example, portions | |||
| of a URI might reflect a query on non-ASCII data, numeric coordinates | of a URI might reflect a query on non-ASCII data, or numeric | |||
| on a map, etc. Likewise, a URI scheme may define components with | coordinates on a map. Likewise, a URI scheme may define components | |||
| additional encoding requirements that are applied prior to forming | with additional encoding requirements that are applied prior to | |||
| the component and producing the URI. | forming the component and producing the URI. | |||
| When a new URI scheme defines a component that represents textual | When a new URI scheme defines a component that represents textual | |||
| data consisting of characters from the Unicode character set [UCS], | data consisting of characters from the Universal Character Set [UCS], | |||
| the data should be encoded first as octets according to the UTF-8 | the data should first be encoded as octets according to the UTF-8 | |||
| character encoding [STD63], and then only those octets that do not | character encoding [STD63]; then only those octets that do not | |||
| correspond to characters in the unreserved set should be | correspond to characters in the unreserved set should be percent- | |||
| percent-encoded. For example, the character A would be represented | encoded. For example, the character A would be represented as "A", | |||
| as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be | the character LATIN CAPITAL LETTER A WITH GRAVE would be represented | |||
| represented as "%C3%80", and the character KATAKANA LETTER A would be | as "%C3%80", and the character KATAKANA LETTER A would be represented | |||
| represented as "%E3%82%A2". | as "%E3%82%A2". | |||
| 3. Syntax Components | 3. Syntax Components | |||
| The generic URI syntax consists of a hierarchical sequence of | The generic URI syntax consists of a hierarchical sequence of | |||
| components referred to as the scheme, authority, path, query, and | components referred to as the scheme, authority, path, query, and | |||
| fragment. | fragment. | |||
| URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] | URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] | |||
| hier-part = "//" authority path-abempty | hier-part = "//" authority path-abempty | |||
| / path-absolute | / path-absolute | |||
| / path-rootless | / path-rootless | |||
| / path-empty | / path-empty | |||
| The scheme and path components are required, though path may be empty | The scheme and path components are required, though the path may be | |||
| (no characters). When authority is present, the path must either be | empty (no characters). When authority is present, the path must | |||
| empty or begin with a slash ("/") character. When authority is not | either be empty or begin with a slash ("/") character. When | |||
| present, the path cannot begin with two slash characters ("//"). | authority is not present, the path cannot begin with two slash | |||
| These restrictions result in five different ABNF rules for a path | characters ("//"). These restrictions result in five different ABNF | |||
| (Section 3.3), only one of which will match any given URI reference. | rules for a path (Section 3.3), only one of which will match any | |||
| given URI reference. | ||||
| The following are two example URIs and their component parts: | The following are two example URIs and their component parts: | |||
| foo://example.com:8042/over/there?name=ferret#nose | foo://example.com:8042/over/there?name=ferret#nose | |||
| \_/ \______________/\_________/ \_________/ \__/ | \_/ \______________/\_________/ \_________/ \__/ | |||
| | | | | | | | | | | | | |||
| scheme authority path query fragment | scheme authority path query fragment | |||
| | _____________________|__ | | _____________________|__ | |||
| / \ / \ | / \ / \ | |||
| urn:example:animal:ferret:nose | urn:example:animal:ferret:nose | |||
| 3.1 Scheme | 3.1. Scheme | |||
| Each URI begins with a scheme name that refers to a specification for | Each URI begins with a scheme name that refers to a specification for | |||
| assigning identifiers within that scheme. As such, the URI syntax is | assigning identifiers within that scheme. As such, the URI syntax is | |||
| a federated and extensible naming system wherein each scheme's | a federated and extensible naming system wherein each scheme's | |||
| specification may further restrict the syntax and semantics of | specification may further restrict the syntax and semantics of | |||
| identifiers using that scheme. | identifiers using that scheme. | |||
| Scheme names consist of a sequence of characters beginning with a | Scheme names consist of a sequence of characters beginning with a | |||
| letter and followed by any combination of letters, digits, plus | letter and followed by any combination of letters, digits, plus | |||
| ("+"), period ("."), or hyphen ("-"). Although scheme is | ("+"), period ("."), or hyphen ("-"). Although schemes are case- | |||
| case-insensitive, the canonical form is lowercase and documents that | insensitive, the canonical form is lowercase and documents that | |||
| specify schemes must do so using lowercase letters. An | specify schemes must do so with lowercase letters. An implementation | |||
| implementation should accept uppercase letters as equivalent to | should accept uppercase letters as equivalent to lowercase in scheme | |||
| lowercase in scheme names (e.g., allow "HTTP" as well as "http"), for | names (e.g., allow "HTTP" as well as "http") for the sake of | |||
| the sake of robustness, but should only produce lowercase scheme | robustness but should only produce lowercase scheme names for | |||
| names, for consistency. | consistency. | |||
| scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | |||
| Individual schemes are not specified by this document. The process | Individual schemes are not specified by this document. The process | |||
| for registration of new URI schemes is defined separately by [BCP35]. | for registration of new URI schemes is defined separately by [BCP35]. | |||
| The scheme registry maintains the mapping between scheme names and | The scheme registry maintains the mapping between scheme names and | |||
| their specifications. Advice for designers of new URI schemes can be | their specifications. Advice for designers of new URI schemes can be | |||
| found in [RFC2718]. URI scheme specifications must define their own | found in [RFC2718]. URI scheme specifications must define their own | |||
| syntax such that all strings matching their scheme-specific syntax | syntax so that all strings matching their scheme-specific syntax will | |||
| will also match the <absolute-URI> grammar, as described in | also match the <absolute-URI> grammar, as described in Section 4.3. | |||
| Section 4.3. | ||||
| When presented with a URI that violates one or more scheme-specific | When presented with a URI that violates one or more scheme-specific | |||
| restrictions, the scheme-specific resolution process should flag the | restrictions, the scheme-specific resolution process should flag the | |||
| reference as an error rather than ignore the unused parts; doing so | reference as an error rather than ignore the unused parts; doing so | |||
| reduces the number of equivalent URIs and helps detect abuses of the | reduces the number of equivalent URIs and helps detect abuses of the | |||
| generic syntax that might indicate the URI has been constructed to | generic syntax, which might indicate that the URI has been | |||
| mislead the user (Section 7.6). | constructed to mislead the user (Section 7.6). | |||
| 3.2 Authority | 3.2. Authority | |||
| Many URI schemes include a hierarchical element for a naming | Many URI schemes include a hierarchical element for a naming | |||
| authority, such that governance of the name space defined by the | authority so that governance of the name space defined by the | |||
| remainder of the URI is delegated to that authority (which may, in | remainder of the URI is delegated to that authority (which may, in | |||
| turn, delegate it further). The generic syntax provides a common | turn, delegate it further). The generic syntax provides a common | |||
| means for distinguishing an authority based on a registered name or | means for distinguishing an authority based on a registered name or | |||
| server address, along with optional port and user information. | server address, along with optional port and user information. | |||
| The authority component is preceded by a double slash ("//") and is | The authority component is preceded by a double slash ("//") and is | |||
| terminated by the next slash ("/"), question mark ("?"), or number | terminated by the next slash ("/"), question mark ("?"), or number | |||
| sign ("#") character, or by the end of the URI. | sign ("#") character, or by the end of the URI. | |||
| authority = [ userinfo "@" ] host [ ":" port ] | authority = [ userinfo "@" ] host [ ":" port ] | |||
| URI producers and normalizers should omit the ":" delimiter that | URI producers and normalizers should omit the ":" delimiter that | |||
| separates host from port if the port component is empty. Some | separates host from port if the port component is empty. Some | |||
| schemes do not allow the userinfo and/or port subcomponents. | schemes do not allow the userinfo and/or port subcomponents. | |||
| If a URI contains an authority component, then the path component | If a URI contains an authority component, then the path component | |||
| must either be empty or begin with a slash ("/") character. | must either be empty or begin with a slash ("/") character. Non- | |||
| Non-validating parsers (those that merely separate a URI reference | validating parsers (those that merely separate a URI reference into | |||
| into its major components) will often ignore the subcomponent | its major components) will often ignore the subcomponent structure of | |||
| structure of authority, treating it as an opaque string from the | authority, treating it as an opaque string from the double-slash to | |||
| double-slash to the first terminating delimiter, until such time as | the first terminating delimiter, until such time as the URI is | |||
| the URI is dereferenced. | dereferenced. | |||
| 3.2.1 User Information | 3.2.1. User Information | |||
| The userinfo subcomponent may consist of a user name and, optionally, | The userinfo subcomponent may consist of a user name and, optionally, | |||
| scheme-specific information about how to gain authorization to access | scheme-specific information about how to gain authorization to access | |||
| the resource. The user information, if present, is followed by a | the resource. The user information, if present, is followed by a | |||
| commercial at-sign ("@") that delimits it from the host. | commercial at-sign ("@") that delimits it from the host. | |||
| userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) | userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) | |||
| Use of the format "user:password" in the userinfo field is | Use of the format "user:password" in the userinfo field is | |||
| deprecated. Applications should not render as clear text any data | deprecated. Applications should not render as clear text any data | |||
| after the first colon (":") character found within a userinfo | after the first colon (":") character found within a userinfo | |||
| subcomponent unless the data after the colon is the empty string | subcomponent unless the data after the colon is the empty string | |||
| (indicating no password). Applications may choose to ignore or | (indicating no password). Applications may choose to ignore or | |||
| reject such data when received as part of a reference, and should | reject such data when it is received as part of a reference and | |||
| reject the storage of such data in unencrypted form. The passing of | should reject the storage of such data in unencrypted form. The | |||
| authentication information in clear text has proven to be a security | passing of authentication information in clear text has proven to be | |||
| risk in almost every case where it has been used. | a security risk in almost every case where it has been used. | |||
| Applications that render a URI for the sake of user feedback, such as | Applications that render a URI for the sake of user feedback, such as | |||
| in graphical hypertext browsing, should render userinfo in a way that | in graphical hypertext browsing, should render userinfo in a way that | |||
| is distinguished from the rest of a URI, when feasible. Such | is distinguished from the rest of a URI, when feasible. Such | |||
| rendering will assist the user in cases where the userinfo has been | rendering will assist the user in cases where the userinfo has been | |||
| misleadingly crafted to look like a trusted domain name | misleadingly crafted to look like a trusted domain name | |||
| (Section 7.6). | (Section 7.6). | |||
| 3.2.2 Host | 3.2.2. Host | |||
| The host subcomponent of authority is identified by an IP literal | The host subcomponent of authority is identified by an IP literal | |||
| encapsulated within square brackets, an IPv4 address in | encapsulated within square brackets, an IPv4 address in dotted- | |||
| dotted-decimal form, or a registered name. The host subcomponent is | decimal form, or a registered name. The host subcomponent is case- | |||
| case-insensitive. The presence of a host subcomponent within a URI | insensitive. The presence of a host subcomponent within a URI does | |||
| does not imply that the scheme requires access to the given host on | not imply that the scheme requires access to the given host on the | |||
| the Internet. In many cases, the host syntax is used only for the | Internet. In many cases, the host syntax is used only for the sake | |||
| sake of reusing the existing registration process created and | of reusing the existing registration process created and deployed for | |||
| deployed for DNS, thus obtaining a globally unique name without the | DNS, thus obtaining a globally unique name without the cost of | |||
| cost of deploying another registry. However, such use comes with its | deploying another registry. However, such use comes with its own | |||
| own costs: domain name ownership may change over time for reasons not | costs: domain name ownership may change over time for reasons not | |||
| anticipated by the URI producer. In other cases, the data within the | anticipated by the URI producer. In other cases, the data within the | |||
| host component identifies a registered name that has nothing to do | host component identifies a registered name that has nothing to do | |||
| with an Internet host. We use the name "host" for the ABNF rule | with an Internet host. We use the name "host" for the ABNF rule | |||
| because that is its most common purpose, not its only purpose, and | because that is its most common purpose, not its only purpose. | |||
| thus should not be considered as semantically limiting the data | ||||
| within it. | ||||
| host = IP-literal / IPv4address / reg-name | host = IP-literal / IPv4address / reg-name | |||
| The syntax rule for host is ambiguous because it does not completely | The syntax rule for host is ambiguous because it does not completely | |||
| distinguish between an IPv4address and a reg-name. In order to | distinguish between an IPv4address and a reg-name. In order to | |||
| disambiguate the syntax, we apply the "first-match-wins" algorithm: | disambiguate the syntax, we apply the "first-match-wins" algorithm: | |||
| If host matches the rule for IPv4address, then it should be | If host matches the rule for IPv4address, then it should be | |||
| considered an IPv4 address literal and not a reg-name. Although host | considered an IPv4 address literal and not a reg-name. Although host | |||
| is case-insensitive, producers and normalizers should use lowercase | is case-insensitive, producers and normalizers should use lowercase | |||
| for registered names and hexadecimal addresses for the sake of | for registered names and hexadecimal addresses for the sake of | |||
| uniformity, while only using uppercase letters for percent-encodings. | uniformity, while only using uppercase letters for percent-encodings. | |||
| A host identified by an Internet Protocol literal address, version 6 | A host identified by an Internet Protocol literal address, version 6 | |||
| [RFC3513] or later, is distinguished by enclosing the IP literal | [RFC3513] or later, is distinguished by enclosing the IP literal | |||
| within square brackets ("[" and "]"). This is the only place where | within square brackets ("[" and "]"). This is the only place where | |||
| square bracket characters are allowed in the URI syntax. In | square bracket characters are allowed in the URI syntax. In | |||
| anticipation of future, as-yet-undefined IP literal address formats, | anticipation of future, as-yet-undefined IP literal address formats, | |||
| an optional version flag may be used to indicate such a format | an implementation may use an optional version flag to indicate such a | |||
| explicitly rather than relying on heuristic determination. | format explicitly rather than rely on heuristic determination. | |||
| IP-literal = "[" ( IPv6address / IPvFuture ) "]" | IP-literal = "[" ( IPv6address / IPvFuture ) "]" | |||
| IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" ) | IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" ) | |||
| The version flag does not indicate the IP version; rather, it | The version flag does not indicate the IP version; rather, it | |||
| indicates future versions of the literal format. As such, | indicates future versions of the literal format. As such, | |||
| implementations must not provide the version flag for existing IPv4 | implementations must not provide the version flag for the existing | |||
| and IPv6 literal addresses. If a URI containing an IP-literal that | IPv4 and IPv6 literal address forms described below. If a URI | |||
| starts with "v" (case-insensitive), indicating that the version flag | containing an IP-literal that starts with "v" (case-insensitive), | |||
| is present, is dereferenced by an application that does not know the | indicating that the version flag is present, is dereferenced by an | |||
| meaning of that version flag, then the application should return an | application that does not know the meaning of that version flag, then | |||
| appropriate error for "address mechanism not supported". | the application should return an appropriate error for "address | |||
| mechanism not supported". | ||||
| A host identified by an IPv6 literal address is represented inside | A host identified by an IPv6 literal address is represented inside | |||
| the square brackets without a preceding version flag. The ABNF | the square brackets without a preceding version flag. The ABNF | |||
| provided here is a translation of the text definition of an IPv6 | provided here is a translation of the text definition of an IPv6 | |||
| literal address provided in [RFC3513]. A 128-bit IPv6 address is | literal address provided in [RFC3513]. This syntax does not support | |||
| divided into eight 16-bit pieces. Each piece is represented | IPv6 scoped addressing zone identifiers. | |||
| numerically in case-insensitive hexadecimal, using one to four | ||||
| hexadecimal digits (leading zeroes are permitted). The eight encoded | A 128-bit IPv6 address is divided into eight 16-bit pieces. Each | |||
| pieces are given most-significant first, separated by colon | piece is represented numerically in case-insensitive hexadecimal, | |||
| characters. Optionally, the least-significant two pieces may instead | using one to four hexadecimal digits (leading zeroes are permitted). | |||
| be represented in IPv4 address textual format. A sequence of one or | The eight encoded pieces are given most-significant first, separated | |||
| more consecutive zero-valued 16-bit pieces within the address may be | by colon characters. Optionally, the least-significant two pieces | |||
| elided, omitting all their digits and leaving exactly two consecutive | may instead be represented in IPv4 address textual format. A | |||
| colons in their place to mark the elision. | sequence of one or more consecutive zero-valued 16-bit pieces within | |||
| the address may be elided, omitting all their digits and leaving | ||||
| exactly two consecutive colons in their place to mark the elision. | ||||
| IPv6address = 6( h16 ":" ) ls32 | IPv6address = 6( h16 ":" ) ls32 | |||
| / "::" 5( h16 ":" ) ls32 | / "::" 5( h16 ":" ) ls32 | |||
| / [ h16 ] "::" 4( h16 ":" ) ls32 | / [ h16 ] "::" 4( h16 ":" ) ls32 | |||
| / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 | / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 | |||
| / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 | / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 | |||
| / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 | / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 | |||
| / [ *4( h16 ":" ) h16 ] "::" ls32 | / [ *4( h16 ":" ) h16 ] "::" ls32 | |||
| / [ *5( h16 ":" ) h16 ] "::" h16 | / [ *5( h16 ":" ) h16 ] "::" h16 | |||
| / [ *6( h16 ":" ) h16 ] "::" | / [ *6( h16 ":" ) h16 ] "::" | |||
| skipping to change at page 20, line 38 ¶ | skipping to change at page 20, line 48 ¶ | |||
| IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet | IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet | |||
| dec-octet = DIGIT ; 0-9 | dec-octet = DIGIT ; 0-9 | |||
| / %x31-39 DIGIT ; 10-99 | / %x31-39 DIGIT ; 10-99 | |||
| / "1" 2DIGIT ; 100-199 | / "1" 2DIGIT ; 100-199 | |||
| / "2" %x30-34 DIGIT ; 200-249 | / "2" %x30-34 DIGIT ; 200-249 | |||
| / "25" %x30-35 ; 250-255 | / "25" %x30-35 ; 250-255 | |||
| A host identified by a registered name is a sequence of characters | A host identified by a registered name is a sequence of characters | |||
| that is usually intended for lookup within a locally-defined host or | usually intended for lookup within a locally defined host or service | |||
| service name registry, though the URI's scheme-specific semantics may | name registry, though the URI's scheme-specific semantics may require | |||
| require that a specific registry (or fixed name table) be used | that a specific registry (or fixed name table) be used instead. The | |||
| instead. The most common name registry mechanism is the Domain Name | most common name registry mechanism is the Domain Name System (DNS). | |||
| System (DNS). A registered name intended for lookup in the DNS uses | A registered name intended for lookup in the DNS uses the syntax | |||
| the syntax defined in Section 3.5 of [RFC1034] and Section 2.1 of | defined in Section 3.5 of [RFC1034] and Section 2.1 of [RFC1123]. | |||
| [RFC1123]. Such a name consists of a sequence of domain labels | Such a name consists of a sequence of domain labels separated by ".", | |||
| separated by ".", each domain label starting and ending with an | each domain label starting and ending with an alphanumeric character | |||
| alphanumeric character and possibly also containing "-" characters. | and possibly also containing "-" characters. The rightmost domain | |||
| The rightmost domain label of a fully qualified domain name in DNS | label of a fully qualified domain name in DNS may be followed by a | |||
| may be followed by a single "." and should be followed by one if it | single "." and should be if it is necessary to distinguish between | |||
| is necessary to distinguish between the complete domain name and some | the complete domain name and some local domain. | |||
| local domain. | ||||
| reg-name = *( unreserved / pct-encoded / sub-delims ) | reg-name = *( unreserved / pct-encoded / sub-delims ) | |||
| If the URI scheme defines a default for host, then that default | If the URI scheme defines a default for host, then that default | |||
| applies when the host subcomponent is undefined or when the | applies when the host subcomponent is undefined or when the | |||
| registered name is empty (zero length). For example, the "file" URI | registered name is empty (zero length). For example, the "file" URI | |||
| scheme is defined such that no authority, an empty host, and | scheme is defined so that no authority, an empty host, and | |||
| "localhost" all mean the end-user's machine, whereas the "http" | "localhost" all mean the end-user's machine, whereas the "http" | |||
| scheme considers a missing authority or empty host to be invalid. | scheme considers a missing authority or empty host invalid. | |||
| This specification does not mandate a particular registered name | This specification does not mandate a particular registered name | |||
| lookup technology and therefore does not restrict the syntax of | lookup technology and therefore does not restrict the syntax of reg- | |||
| reg-name beyond that necessary for interoperability. Instead, it | name beyond what is necessary for interoperability. Instead, it | |||
| delegates the issue of registered name syntax conformance to the | delegates the issue of registered name syntax conformance to the | |||
| operating system of each application performing URI resolution, and | operating system of each application performing URI resolution, and | |||
| that operating system decides what it will allow for the purpose of | that operating system decides what it will allow for the purpose of | |||
| host identification. A URI resolution implementation might use DNS, | host identification. A URI resolution implementation might use DNS, | |||
| host tables, yellow pages, NetInfo, WINS, or any other system for | host tables, yellow pages, NetInfo, WINS, or any other system for | |||
| lookup of registered names. However, a globally-scoped naming | lookup of registered names. However, a globally scoped naming | |||
| system, such as DNS fully-qualified domain names, is necessary for | system, such as DNS fully qualified domain names, is necessary for | |||
| URIs that are intended to have global scope. URI producers should | URIs intended to have global scope. URI producers should use names | |||
| use names that conform to the DNS syntax, even when use of DNS is not | that conform to the DNS syntax, even when use of DNS is not | |||
| immediately apparent, and should limit such names to no more than 255 | immediately apparent, and should limit these names to no more than | |||
| characters in length. | 255 characters in length. | |||
| The reg-name syntax allows percent-encoded octets in order to | The reg-name syntax allows percent-encoded octets in order to | |||
| represent non-ASCII registered names in a uniform way that is | represent non-ASCII registered names in a uniform way that is | |||
| independent of the underlying name resolution technology; such | independent of the underlying name resolution technology. Non-ASCII | |||
| non-ASCII characters must first be encoded according to UTF-8 [STD63] | characters must first be encoded according to UTF-8 [STD63], and then | |||
| and then each octet of the corresponding UTF-8 sequence must be | each octet of the corresponding UTF-8 sequence must be percent- | |||
| percent-encoded to be represented as URI characters. URI producing | encoded to be represented as URI characters. URI producing | |||
| applications must not use percent-encoding in host unless it is used | applications must not use percent-encoding in host unless it is used | |||
| to represent a UTF-8 character sequence. When a non-ASCII registered | to represent a UTF-8 character sequence. When a non-ASCII registered | |||
| name represents an internationalized domain name intended for | name represents an internationalized domain name intended for | |||
| resolution via the DNS, the name must be transformed to the IDNA | resolution via the DNS, the name must be transformed to the IDNA | |||
| encoding [RFC3490] prior to name lookup. URI producers should | encoding [RFC3490] prior to name lookup. URI producers should | |||
| provide such registered names in the IDNA encoding, rather than a | provide these registered names in the IDNA encoding, rather than a | |||
| percent-encoding, if they wish to maximize interoperability with | percent-encoding, if they wish to maximize interoperability with | |||
| legacy URI resolvers. | legacy URI resolvers. | |||
| 3.2.3 Port | 3.2.3. Port | |||
| The port subcomponent of authority is designated by an optional port | The port subcomponent of authority is designated by an optional port | |||
| number in decimal following the host and delimited from it by a | number in decimal following the host and delimited from it by a | |||
| single colon (":") character. | single colon (":") character. | |||
| port = *DIGIT | port = *DIGIT | |||
| A scheme may define a default port. For example, the "http" scheme | A scheme may define a default port. For example, the "http" scheme | |||
| defines a default port of "80", corresponding to its reserved TCP | defines a default port of "80", corresponding to its reserved TCP | |||
| port number. The type of port designated by the port number (e.g., | port number. The type of port designated by the port number (e.g., | |||
| TCP, UDP, SCTP, etc.) is defined by the URI scheme. URI producers | TCP, UDP, SCTP) is defined by the URI scheme. URI producers and | |||
| and normalizers should omit the port component and its ":" delimiter | normalizers should omit the port component and its ":" delimiter if | |||
| if port is empty or its value would be the same as the scheme's | port is empty or if its value would be the same as that of the | |||
| default. | scheme's default. | |||
| 3.3 Path | 3.3. Path | |||
| The path component contains data, usually organized in hierarchical | The path component contains data, usually organized in hierarchical | |||
| form, that, along with data in the non-hierarchical query component | form, that, along with data in the non-hierarchical query component | |||
| (Section 3.4), serves to identify a resource within the scope of the | (Section 3.4), serves to identify a resource within the scope of the | |||
| URI's scheme and naming authority (if any). The path is terminated | URI's scheme and naming authority (if any). The path is terminated | |||
| by the first question mark ("?") or number sign ("#") character, or | by the first question mark ("?") or number sign ("#") character, or | |||
| by the end of the URI. | by the end of the URI. | |||
| If a URI contains an authority component, then the path component | If a URI contains an authority component, then the path component | |||
| must either be empty or begin with a slash ("/") character. If a URI | must either be empty or begin with a slash ("/") character. If a URI | |||
| skipping to change at page 23, line 9 ¶ | skipping to change at page 23, line 22 ¶ | |||
| ("/") character. A path is always defined for a URI, though the | ("/") character. A path is always defined for a URI, though the | |||
| defined path may be empty (zero length). Use of the slash character | defined path may be empty (zero length). Use of the slash character | |||
| to indicate hierarchy is only required when a URI will be used as the | to indicate hierarchy is only required when a URI will be used as the | |||
| context for relative references. For example, the URI | context for relative references. For example, the URI | |||
| <mailto:[email protected]> has a path of "[email protected]", whereas | <mailto:[email protected]> has a path of "[email protected]", whereas | |||
| the URI <foo://info.example.com?fred> has an empty path. | the URI <foo://info.example.com?fred> has an empty path. | |||
| The path segments "." and "..", also known as dot-segments, are | The path segments "." and "..", also known as dot-segments, are | |||
| defined for relative reference within the path name hierarchy. They | defined for relative reference within the path name hierarchy. They | |||
| are intended for use at the beginning of a relative-path reference | are intended for use at the beginning of a relative-path reference | |||
| (Section 4.2) for indicating relative position within the | (Section 4.2) to indicate relative position within the hierarchical | |||
| hierarchical tree of names. This is similar to their role within | tree of names. This is similar to their role within some operating | |||
| some operating systems' file directory structure to indicate the | systems' file directory structures to indicate the current directory | |||
| current directory and parent directory, respectively. However, | and parent directory, respectively. However, unlike in a file | |||
| unlike a file system, these dot-segments are only interpreted within | system, these dot-segments are only interpreted within the URI path | |||
| the URI path hierarchy and are removed as part of the resolution | hierarchy and are removed as part of the resolution process (Section | |||
| process (Section 5.2). | 5.2). | |||
| Aside from dot-segments in hierarchical paths, a path segment is | Aside from dot-segments in hierarchical paths, a path segment is | |||
| considered opaque by the generic syntax. URI-producing applications | considered opaque by the generic syntax. URI producing applications | |||
| often use the reserved characters allowed in a segment for the | often use the reserved characters allowed in a segment to delimit | |||
| purpose of delimiting scheme-specific or dereference-handler-specific | scheme-specific or dereference-handler-specific subcomponents. For | |||
| subcomponents. For example, the semicolon (";") and equals ("=") | example, the semicolon (";") and equals ("=") reserved characters are | |||
| reserved characters are often used for delimiting parameters and | often used to delimit parameters and parameter values applicable to | |||
| parameter values applicable to that segment. The comma (",") | that segment. The comma (",") reserved character is often used for | |||
| reserved character is often used for similar purposes. For example, | similar purposes. For example, one URI producer might use a segment | |||
| one URI producer might use a segment like "name;v=1.1" to indicate a | such as "name;v=1.1" to indicate a reference to version 1.1 of | |||
| reference to version 1.1 of "name", whereas another might use a | "name", whereas another might use a segment such as "name,1.1" to | |||
| segment like "name,1.1" to indicate the same. Parameter types may be | indicate the same. Parameter types may be defined by scheme-specific | |||
| defined by scheme-specific semantics, but in most cases the syntax of | semantics, but in most cases the syntax of a parameter is specific to | |||
| a parameter is specific to the implementation of the URI's | the implementation of the URI's dereferencing algorithm. | |||
| dereferencing algorithm. | ||||
| 3.4 Query | 3.4. Query | |||
| The query component contains non-hierarchical data that, along with | The query component contains non-hierarchical data that, along with | |||
| data in the path component (Section 3.3), serves to identify a | data in the path component (Section 3.3), serves to identify a | |||
| resource within the scope of the URI's scheme and naming authority | resource within the scope of the URI's scheme and naming authority | |||
| (if any). The query component is indicated by the first question | (if any). The query component is indicated by the first question | |||
| mark ("?") character and terminated by a number sign ("#") character | mark ("?") character and terminated by a number sign ("#") character | |||
| or by the end of the URI. | or by the end of the URI. | |||
| query = *( pchar / "/" / "?" ) | query = *( pchar / "/" / "?" ) | |||
| The characters slash ("/") and question mark ("?") may represent data | The characters slash ("/") and question mark ("?") may represent data | |||
| within the query component. Beware that some older, erroneous | within the query component. Beware that some older, erroneous | |||
| implementations may not handle such data correctly when used as the | implementations may not handle such data correctly when it is used as | |||
| base URI for relative references (Section 5.1), apparently because | the base URI for relative references (Section 5.1), apparently | |||
| they fail to to distinguish query data from path data when looking | because they fail to distinguish query data from path data when | |||
| for hierarchical separators. However, since query components are | looking for hierarchical separators. However, as query components | |||
| often used to carry identifying information in the form of | are often used to carry identifying information in the form of | |||
| "key=value" pairs, and one frequently used value is a reference to | "key=value" pairs and one frequently used value is a reference to | |||
| another URI, it is sometimes better for usability to avoid | another URI, it is sometimes better for usability to avoid percent- | |||
| percent-encoding those characters. | encoding those characters. | |||
| 3.5 Fragment | 3.5. Fragment | |||
| The fragment identifier component of a URI allows indirect | The fragment identifier component of a URI allows indirect | |||
| identification of a secondary resource by reference to a primary | identification of a secondary resource by reference to a primary | |||
| resource and additional identifying information. The identified | resource and additional identifying information. The identified | |||
| secondary resource may be some portion or subset of the primary | secondary resource may be some portion or subset of the primary | |||
| resource, some view on representations of the primary resource, or | resource, some view on representations of the primary resource, or | |||
| some other resource defined or described by those representations. A | some other resource defined or described by those representations. A | |||
| fragment identifier component is indicated by the presence of a | fragment identifier component is indicated by the presence of a | |||
| number sign ("#") character and terminated by the end of the URI. | number sign ("#") character and terminated by the end of the URI. | |||
| fragment = *( pchar / "/" / "?" ) | fragment = *( pchar / "/" / "?" ) | |||
| The semantics of a fragment identifier are defined by the set of | The semantics of a fragment identifier are defined by the set of | |||
| representations that might result from a retrieval action on the | representations that might result from a retrieval action on the | |||
| primary resource. The fragment's format and resolution is therefore | primary resource. The fragment's format and resolution is therefore | |||
| dependent on the media type [RFC2046] of a potentially retrieved | dependent on the media type [RFC2046] of a potentially retrieved | |||
| representation, even though such a retrieval is only performed if the | representation, even though such a retrieval is only performed if the | |||
| URI is dereferenced. If no such representation exists, then the | URI is dereferenced. If no such representation exists, then the | |||
| semantics of the fragment are considered unknown and, effectively, | semantics of the fragment are considered unknown and are effectively | |||
| unconstrained. Fragment identifier semantics are independent of the | unconstrained. Fragment identifier semantics are independent of the | |||
| URI scheme and thus cannot be redefined by scheme specifications. | URI scheme and thus cannot be redefined by scheme specifications. | |||
| Individual media types may define their own restrictions on, or | Individual media types may define their own restrictions on or | |||
| structure within, the fragment identifier syntax for specifying | structures within the fragment identifier syntax for specifying | |||
| different types of subsets, views, or external references that are | different types of subsets, views, or external references that are | |||
| identifiable as secondary resources by that media type. If the | identifiable as secondary resources by that media type. If the | |||
| primary resource has multiple representations, as is often the case | primary resource has multiple representations, as is often the case | |||
| for resources whose representation is selected based on attributes of | for resources whose representation is selected based on attributes of | |||
| the retrieval request (a.k.a., content negotiation), then whatever is | the retrieval request (a.k.a., content negotiation), then whatever is | |||
| identified by the fragment should be consistent across all of those | identified by the fragment should be consistent across all of those | |||
| representations: each representation should either define the | representations. Each representation should either define the | |||
| fragment such that it corresponds to the same secondary resource, | fragment so that it corresponds to the same secondary resource, | |||
| regardless of how it is represented, or the fragment should be left | regardless of how it is represented, or should leave the fragment | |||
| undefined by the representation (i.e., not found). | undefined (i.e., not found). | |||
| As with any URI, use of a fragment identifier component does not | As with any URI, use of a fragment identifier component does not | |||
| imply that a retrieval action will take place. A URI with a fragment | imply that a retrieval action will take place. A URI with a fragment | |||
| identifier may be used to refer to the secondary resource without any | identifier may be used to refer to the secondary resource without any | |||
| implication that the primary resource is accessible or will ever be | implication that the primary resource is accessible or will ever be | |||
| accessed. | accessed. | |||
| Fragment identifiers have a special role in information retrieval | Fragment identifiers have a special role in information retrieval | |||
| systems as the primary form of client-side indirect referencing, | systems as the primary form of client-side indirect referencing, | |||
| allowing an author to specifically identify those aspects of an | allowing an author to specifically identify aspects of an existing | |||
| existing resource that are only indirectly provided by the resource | resource that are only indirectly provided by the resource owner. As | |||
| owner. As such, the fragment identifier is not used in the | such, the fragment identifier is not used in the scheme-specific | |||
| scheme-specific processing of a URI; instead, the fragment identifier | processing of a URI; instead, the fragment identifier is separated | |||
| is separated from the rest of the URI prior to a dereference, and | from the rest of the URI prior to a dereference, and thus the | |||
| thus the identifying information within the fragment itself is | identifying information within the fragment itself is dereferenced | |||
| dereferenced solely by the user agent and regardless of the URI | solely by the user agent, regardless of the URI scheme. Although | |||
| scheme. Although this separate handling is often perceived to be a | this separate handling is often perceived to be a loss of | |||
| loss of information, particularly in regards to accurate redirection | information, particularly for accurate redirection of references as | |||
| of references as resources move over time, it also serves to prevent | resources move over time, it also serves to prevent information | |||
| information providers from denying reference authors the right to | providers from denying reference authors the right to refer to | |||
| selectively refer to information within a resource. Indirect | information within a resource selectively. Indirect referencing also | |||
| referencing also provides additional flexibility and extensibility to | provides additional flexibility and extensibility to systems that use | |||
| systems that use URIs, since new media types are easier to define and | URIs, as new media types are easier to define and deploy than new | |||
| deploy than new schemes of identification. | schemes of identification. | |||
| The characters slash ("/") and question mark ("?") are allowed to | The characters slash ("/") and question mark ("?") are allowed to | |||
| represent data within the fragment identifier. Beware that some | represent data within the fragment identifier. Beware that some | |||
| older, erroneous implementations may not handle such data correctly | older, erroneous implementations may not handle this data correctly | |||
| when used as the base URI for relative references (Section 5.1). | when it is used as the base URI for relative references (Section | |||
| 5.1). | ||||
| 4. Usage | 4. Usage | |||
| When applications make reference to a URI, they do not always use the | When applications make reference to a URI, they do not always use the | |||
| full form of reference defined by the "URI" syntax rule. In order to | full form of reference defined by the "URI" syntax rule. To save | |||
| save space and take advantage of hierarchical locality, many Internet | space and take advantage of hierarchical locality, many Internet | |||
| protocol elements and media type formats allow an abbreviation of a | protocol elements and media type formats allow an abbreviation of a | |||
| URI, while others restrict the syntax to a particular form of URI. | URI, whereas others restrict the syntax to a particular form of URI. | |||
| We define the most common forms of reference syntax in this | We define the most common forms of reference syntax in this | |||
| specification because they impact and depend upon the design of the | specification because they impact and depend upon the design of the | |||
| generic syntax, requiring a uniform parsing algorithm in order to be | generic syntax, requiring a uniform parsing algorithm in order to be | |||
| interpreted consistently. | interpreted consistently. | |||
| 4.1 URI Reference | 4.1. URI Reference | |||
| URI-reference is used to denote the most common usage of a resource | URI-reference is used to denote the most common usage of a resource | |||
| identifier. | identifier. | |||
| URI-reference = URI / relative-ref | URI-reference = URI / relative-ref | |||
| A URI-reference is either a URI or a relative reference. If the | A URI-reference is either a URI or a relative reference. If the | |||
| URI-reference's prefix does not match the syntax of a scheme followed | URI-reference's prefix does not match the syntax of a scheme followed | |||
| by its colon separator, then the URI-reference is a relative | by its colon separator, then the URI-reference is a relative | |||
| reference. | reference. | |||
| A URI-reference is typically parsed first into the five URI | A URI-reference is typically parsed first into the five URI | |||
| components, in order to determine what components are present and | components, in order to determine what components are present and | |||
| whether or not the reference is relative, after which each component | whether the reference is relative. Then, each component is parsed | |||
| is parsed for its subparts and their validation. The ABNF of | for its subparts and their validation. The ABNF of URI-reference, | |||
| URI-reference, along with the "first-match-wins" disambiguation rule, | along with the "first-match-wins" disambiguation rule, is sufficient | |||
| is sufficient to define a validating parser for the generic syntax. | to define a validating parser for the generic syntax. Readers | |||
| Readers familiar with regular expressions should see Appendix B for | familiar with regular expressions should see Appendix B for an | |||
| an example of a non-validating URI-reference parser that will take | example of a non-validating URI-reference parser that will take any | |||
| any given string and extract the URI components. | given string and extract the URI components. | |||
| 4.2 Relative Reference | 4.2. Relative Reference | |||
| A relative reference takes advantage of the hierarchical syntax | A relative reference takes advantage of the hierarchical syntax | |||
| (Section 1.2.3) in order to express a URI reference relative to the | (Section 1.2.3) to express a URI reference relative to the name space | |||
| name space of another hierarchical URI. | of another hierarchical URI. | |||
| relative-ref = relative-part [ "?" query ] [ "#" fragment ] | relative-ref = relative-part [ "?" query ] [ "#" fragment ] | |||
| relative-part = "//" authority path-abempty | relative-part = "//" authority path-abempty | |||
| / path-absolute | / path-absolute | |||
| / path-noscheme | / path-noscheme | |||
| / path-empty | / path-empty | |||
| The URI referred to by a relative reference, also known as the target | The URI referred to by a relative reference, also known as the target | |||
| URI, is obtained by applying the reference resolution algorithm of | URI, is obtained by applying the reference resolution algorithm of | |||
| Section 5. | Section 5. | |||
| A relative reference that begins with two slash characters is termed | A relative reference that begins with two slash characters is termed | |||
| a network-path reference; such references are rarely used. A | a network-path reference; such references are rarely used. A | |||
| relative reference that begins with a single slash character is | relative reference that begins with a single slash character is | |||
| termed an absolute-path reference. A relative reference that does | termed an absolute-path reference. A relative reference that does | |||
| not begin with a slash character is termed a relative-path reference. | not begin with a slash character is termed a relative-path reference. | |||
| A path segment that contains a colon character (e.g., "this:that") | A path segment that contains a colon character (e.g., "this:that") | |||
| cannot be used as the first segment of a relative-path reference | cannot be used as the first segment of a relative-path reference, as | |||
| because it would be mistaken for a scheme name. Such a segment must | it would be mistaken for a scheme name. Such a segment must be | |||
| be preceded by a dot-segment (e.g., "./this:that") to make a | preceded by a dot-segment (e.g., "./this:that") to make a relative- | |||
| relative-path reference. | path reference. | |||
| 4.3 Absolute URI | 4.3. Absolute URI | |||
| Some protocol elements allow only the absolute form of a URI without | Some protocol elements allow only the absolute form of a URI without | |||
| a fragment identifier. For example, defining a base URI for later | a fragment identifier. For example, defining a base URI for later | |||
| use by relative references calls for an absolute-URI syntax rule that | use by relative references calls for an absolute-URI syntax rule that | |||
| does not allow a fragment. | does not allow a fragment. | |||
| absolute-URI = scheme ":" hier-part [ "?" query ] | absolute-URI = scheme ":" hier-part [ "?" query ] | |||
| URI scheme specifications must define their own syntax such that all | URI scheme specifications must define their own syntax so that all | |||
| strings matching their scheme-specific syntax will also match the | strings matching their scheme-specific syntax will also match the | |||
| <absolute-URI> grammar. Scheme specifications are not responsible | <absolute-URI> grammar. Scheme specifications will not define | |||
| for defining fragment identifier syntax or usage, regardless of its | fragment identifier syntax or usage, regardless of its applicability | |||
| applicability to resources identifiable via that scheme, since | to resources identifiable via that scheme, as fragment identification | |||
| fragment identification is orthogonal to scheme definition. However, | is orthogonal to scheme definition. However, scheme specifications | |||
| scheme specifications are encouraged to include a wide range of | are encouraged to include a wide range of examples, including | |||
| examples, including examples that show use of the scheme's URIs with | examples that show use of the scheme's URIs with fragment identifiers | |||
| fragment identifiers when such usage is appropriate. | when such usage is appropriate. | |||
| 4.4 Same-document Reference | 4.4. Same-Document Reference | |||
| When a URI reference refers to a URI that is, aside from its fragment | When a URI reference refers to a URI that is, aside from its fragment | |||
| component (if any), identical to the base URI (Section 5.1), that | component (if any), identical to the base URI (Section 5.1), that | |||
| reference is called a "same-document" reference. The most frequent | reference is called a "same-document" reference. The most frequent | |||
| examples of same-document references are relative references that are | examples of same-document references are relative references that are | |||
| empty or include only the number sign ("#") separator followed by a | empty or include only the number sign ("#") separator followed by a | |||
| fragment identifier. | fragment identifier. | |||
| When a same-document reference is dereferenced for the purpose of a | When a same-document reference is dereferenced for a retrieval | |||
| retrieval action, the target of that reference is defined to be | action, the target of that reference is defined to be within the same | |||
| within the same entity (representation, document, or message) as the | entity (representation, document, or message) as the reference; | |||
| reference; therefore, a dereference should not result in a new | therefore, a dereference should not result in a new retrieval action. | |||
| retrieval action. | ||||
| Normalization of the base and target URIs prior to their comparison, | Normalization of the base and target URIs prior to their comparison, | |||
| as described in Section 6.2.2 and Section 6.2.3, is allowed but | as described in Sections 6.2.2 and 6.2.3, is allowed but rarely | |||
| rarely performed in practice. Normalization may increase the set of | performed in practice. Normalization may increase the set of same- | |||
| same-document references, which may be of benefit to some caching | document references, which may be of benefit to some caching | |||
| applications. As such, reference authors should not assume that a | applications. As such, reference authors should not assume that a | |||
| slightly different, though equivalent, reference URI will (or will | slightly different, though equivalent, reference URI will (or will | |||
| not) be interpreted as a same-document reference by any given | not) be interpreted as a same-document reference by any given | |||
| application. | application. | |||
| 4.5 Suffix Reference | 4.5. Suffix Reference | |||
| The URI syntax is designed for unambiguous reference to resources and | The URI syntax is designed for unambiguous reference to resources and | |||
| extensibility via the URI scheme. However, as URI identification and | extensibility via the URI scheme. However, as URI identification and | |||
| usage have become commonplace, traditional media (television, radio, | usage have become commonplace, traditional media (television, radio, | |||
| newspapers, billboards, etc.) have increasingly used a suffix of the | newspapers, billboards, etc.) have increasingly used a suffix of the | |||
| URI as a reference, consisting of only the authority and path | URI as a reference, consisting of only the authority and path | |||
| portions of the URI, such as | portions of the URI, such as | |||
| www.w3.org/Addressing/ | www.w3.org/Addressing/ | |||
| or simply a DNS registered name on its own. Such references are | or simply a DNS registered name on its own. Such references are | |||
| primarily intended for human interpretation, rather than for | primarily intended for human interpretation rather than for machines, | |||
| machines, with the assumption that context-based heuristics are | with the assumption that context-based heuristics are sufficient to | |||
| sufficient to complete the URI (e.g., most registered names beginning | complete the URI (e.g., most registered names beginning with "www" | |||
| with "www" are likely to have a URI prefix of "http://"). Although | are likely to have a URI prefix of "http://"). Although there is no | |||
| there is no standard set of heuristics for disambiguating a URI | standard set of heuristics for disambiguating a URI suffix, many | |||
| suffix, many client implementations allow them to be entered by the | client implementations allow them to be entered by the user and | |||
| user and heuristically resolved. | heuristically resolved. | |||
| While this practice of using suffix references is common, it should | Although this practice of using suffix references is common, it | |||
| be avoided whenever possible and never used in situations where | should be avoided whenever possible and should never be used in | |||
| long-term references are expected. The heuristics noted above will | situations where long-term references are expected. The heuristics | |||
| change over time, particularly when a new URI scheme becomes popular, | noted above will change over time, particularly when a new URI scheme | |||
| and are often incorrect when used out of context. Furthermore, they | becomes popular, and are often incorrect when used out of context. | |||
| can lead to security issues along the lines of those described in | Furthermore, they can lead to security issues along the lines of | |||
| [RFC1535]. | those described in [RFC1535]. | |||
| Since a URI suffix has the same syntax as a relative-path reference, | As a URI suffix has the same syntax as a relative-path reference, a | |||
| a suffix reference cannot be used in contexts where a relative | suffix reference cannot be used in contexts where a relative | |||
| reference is expected. As a result, suffix references are limited to | reference is expected. As a result, suffix references are limited to | |||
| those places where there is no defined base URI, such as dialog boxes | places where there is no defined base URI, such as dialog boxes and | |||
| and off-line advertisements. | off-line advertisements. | |||
| 5. Reference Resolution | 5. Reference Resolution | |||
| This section defines the process of resolving a URI reference within | This section defines the process of resolving a URI reference within | |||
| a context that allows relative references, such that the result is a | a context that allows relative references so that the result is a | |||
| string matching the <URI> syntax rule of Section 3. | string matching the <URI> syntax rule of Section 3. | |||
| 5.1 Establishing a Base URI | 5.1. Establishing a Base URI | |||
| The term "relative" implies that there exists a "base URI" against | The term "relative" implies that a "base URI" exists against which | |||
| which the relative reference is applied. Aside from fragment-only | the relative reference is applied. Aside from fragment-only | |||
| references (Section 4.4), relative references are only usable when a | references (Section 4.4), relative references are only usable when a | |||
| base URI is known. A base URI must be established by the parser | base URI is known. A base URI must be established by the parser | |||
| prior to parsing URI references that might be relative. A base URI | prior to parsing URI references that might be relative. A base URI | |||
| must conform to the <absolute-URI> syntax rule (Section 4.3): if the | must conform to the <absolute-URI> syntax rule (Section 4.3). If the | |||
| base URI is obtained from a URI reference, then that reference must | base URI is obtained from a URI reference, then that reference must | |||
| be converted to absolute form and stripped of any fragment component | be converted to absolute form and stripped of any fragment component | |||
| prior to use as a base URI. | prior to its use as a base URI. | |||
| The base URI of a reference can be established in one of four ways, | The base URI of a reference can be established in one of four ways, | |||
| discussed below in order of precedence. The order of precedence can | discussed below in order of precedence. The order of precedence can | |||
| be thought of in terms of layers, where the innermost defined base | be thought of in terms of layers, where the innermost defined base | |||
| URI has the highest precedence. This can be visualized graphically | URI has the highest precedence. This can be visualized graphically | |||
| as: | as follows: | |||
| .----------------------------------------------------------. | .----------------------------------------------------------. | |||
| | .----------------------------------------------------. | | | .----------------------------------------------------. | | |||
| | | .----------------------------------------------. | | | | | .----------------------------------------------. | | | |||
| | | | .----------------------------------------. | | | | | | | .----------------------------------------. | | | | |||
| | | | | .----------------------------------. | | | | | | | | | .----------------------------------. | | | | | |||
| | | | | | <relative-reference> | | | | | | | | | | | <relative-reference> | | | | | | |||
| | | | | `----------------------------------' | | | | | | | | | `----------------------------------' | | | | | |||
| | | | | (5.1.1) Base URI embedded in content | | | | | | | | | (5.1.1) Base URI embedded in content | | | | | |||
| | | | `----------------------------------------' | | | | | | | `----------------------------------------' | | | | |||
| | | | (5.1.2) Base URI of the encapsulating entity | | | | | | | (5.1.2) Base URI of the encapsulating entity | | | | |||
| | | | (message, representation, or none) | | | | | | | (message, representation, or none) | | | | |||
| | | `----------------------------------------------' | | | | | `----------------------------------------------' | | | |||
| | | (5.1.3) URI used to retrieve the entity | | | | | (5.1.3) URI used to retrieve the entity | | | |||
| | `----------------------------------------------------' | | | `----------------------------------------------------' | | |||
| | (5.1.4) Default Base URI (application-dependent) | | | (5.1.4) Default Base URI (application-dependent) | | |||
| `----------------------------------------------------------' | `----------------------------------------------------------' | |||
| 5.1.1 Base URI Embedded in Content | 5.1.1. Base URI Embedded in Content | |||
| Within certain media types, a base URI for relative references can be | Within certain media types, a base URI for relative references can be | |||
| embedded within the content itself such that it can be readily | embedded within the content itself so that it can be readily obtained | |||
| obtained by a parser. This can be useful for descriptive documents, | by a parser. This can be useful for descriptive documents, such as | |||
| such as tables of content, which may be transmitted to others through | tables of contents, which may be transmitted to others through | |||
| protocols other than their usual retrieval context (e.g., E-Mail or | protocols other than their usual retrieval context (e.g., email or | |||
| USENET news). | USENET news). | |||
| It is beyond the scope of this specification to specify how, for each | It is beyond the scope of this specification to specify how, for each | |||
| media type, a base URI can be embedded. The appropriate syntax, when | media type, a base URI can be embedded. The appropriate syntax, when | |||
| available, is described by the data format specification associated | available, is described by the data format specification associated | |||
| with each media type. | with each media type. | |||
| 5.1.2 Base URI from the Encapsulating Entity | 5.1.2. Base URI from the Encapsulating Entity | |||
| If no base URI is embedded, the base URI is defined by the | If no base URI is embedded, the base URI is defined by the | |||
| representation's retrieval context. For a document that is enclosed | representation's retrieval context. For a document that is enclosed | |||
| within another entity, such as a message or archive, the retrieval | within another entity, such as a message or archive, the retrieval | |||
| context is that entity; thus, the default base URI of a | context is that entity. Thus, the default base URI of a | |||
| representation is the base URI of the entity in which the | representation is the base URI of the entity in which the | |||
| representation is encapsulated. | representation is encapsulated. | |||
| A mechanism for embedding a base URI within MIME container types | A mechanism for embedding a base URI within MIME container types | |||
| (e.g., the message and multipart types) is defined by MHTML | (e.g., the message and multipart types) is defined by MHTML | |||
| [RFC2557]. Protocols that do not use the MIME message header syntax, | [RFC2557]. Protocols that do not use the MIME message header syntax, | |||
| but do allow some form of tagged metadata to be included within | but that do allow some form of tagged metadata to be included within | |||
| messages, may define their own syntax for defining a base URI as part | messages, may define their own syntax for defining a base URI as part | |||
| of a message. | of a message. | |||
| 5.1.3 Base URI from the Retrieval URI | 5.1.3. Base URI from the Retrieval URI | |||
| If no base URI is embedded and the representation is not encapsulated | If no base URI is embedded and the representation is not encapsulated | |||
| within some other entity, then, if a URI was used to retrieve the | within some other entity, then, if a URI was used to retrieve the | |||
| representation, that URI shall be considered the base URI. Note that | representation, that URI shall be considered the base URI. Note that | |||
| if the retrieval was the result of a redirected request, the last URI | if the retrieval was the result of a redirected request, the last URI | |||
| used (i.e., the URI that resulted in the actual retrieval of the | used (i.e., the URI that resulted in the actual retrieval of the | |||
| representation) is the base URI. | representation) is the base URI. | |||
| 5.1.4 Default Base URI | 5.1.4. Default Base URI | |||
| If none of the conditions described above apply, then the base URI is | If none of the conditions described above apply, then the base URI is | |||
| defined by the context of the application. Since this definition is | defined by the context of the application. As this definition is | |||
| necessarily application-dependent, failing to define a base URI using | necessarily application-dependent, failing to define a base URI by | |||
| one of the other methods may result in the same content being | using one of the other methods may result in the same content being | |||
| interpreted differently by different types of application. | interpreted differently by different types of applications. | |||
| A sender of a representation containing relative references is | A sender of a representation containing relative references is | |||
| responsible for ensuring that a base URI for those references can be | responsible for ensuring that a base URI for those references can be | |||
| established. Aside from fragment-only references, relative | established. Aside from fragment-only references, relative | |||
| references can only be used reliably in situations where the base URI | references can only be used reliably in situations where the base URI | |||
| is well-defined. | is well defined. | |||
| 5.2 Relative Resolution | 5.2. Relative Resolution | |||
| This section describes an algorithm for converting a URI reference | This section describes an algorithm for converting a URI reference | |||
| that might be relative to a given base URI into the parsed components | that might be relative to a given base URI into the parsed components | |||
| of the reference's target. The components can then be recomposed, as | of the reference's target. The components can then be recomposed, as | |||
| described in Section 5.3, to form the target URI. This algorithm | described in Section 5.3, to form the target URI. This algorithm | |||
| provides definitive results that can be used to test the output of | provides definitive results that can be used to test the output of | |||
| other implementations. Applications may implement relative reference | other implementations. Applications may implement relative reference | |||
| resolution using some other algorithm, provided that the results | resolution by using some other algorithm, provided that the results | |||
| match what would be given by this algorithm. | match what would be given by this one. | |||
| 5.2.1 Pre-parse the Base URI | 5.2.1. Pre-parse the Base URI | |||
| The base URI (Base) is established according to the procedure of | The base URI (Base) is established according to the procedure of | |||
| Section 5.1 and parsed into the five main components described in | Section 5.1 and parsed into the five main components described in | |||
| Section 3. Note that only the scheme component is required to be | Section 3. Note that only the scheme component is required to be | |||
| present in a base URI; the other components may be empty or | present in a base URI; the other components may be empty or | |||
| undefined. A component is undefined if its associated delimiter does | undefined. A component is undefined if its associated delimiter does | |||
| not appear in the URI reference; the path component is never | not appear in the URI reference; the path component is never | |||
| undefined, though it may be empty. | undefined, though it may be empty. | |||
| Normalization of the base URI, as described in Section 6.2.2 and | Normalization of the base URI, as described in Sections 6.2.2 and | |||
| Section 6.2.3, is optional. A URI reference must be transformed to | 6.2.3, is optional. A URI reference must be transformed to its | |||
| its target URI before it can be normalized. | target URI before it can be normalized. | |||
| 5.2.2 Transform References | 5.2.2. Transform References | |||
| For each URI reference (R), the following pseudocode describes an | For each URI reference (R), the following pseudocode describes an | |||
| algorithm for transforming R into its target URI (T): | algorithm for transforming R into its target URI (T): | |||
| -- The URI reference is parsed into the five URI components | -- The URI reference is parsed into the five URI components | |||
| -- | -- | |||
| (R.scheme, R.authority, R.path, R.query, R.fragment) = parse(R); | (R.scheme, R.authority, R.path, R.query, R.fragment) = parse(R); | |||
| -- A non-strict parser may ignore a scheme in the reference | -- A non-strict parser may ignore a scheme in the reference | |||
| -- if it is identical to the base URI's scheme. | -- if it is identical to the base URI's scheme. | |||
| skipping to change at page 32, line 5 ¶ | skipping to change at page 32, line 38 ¶ | |||
| endif; | endif; | |||
| T.query = R.query; | T.query = R.query; | |||
| endif; | endif; | |||
| T.authority = Base.authority; | T.authority = Base.authority; | |||
| endif; | endif; | |||
| T.scheme = Base.scheme; | T.scheme = Base.scheme; | |||
| endif; | endif; | |||
| T.fragment = R.fragment; | T.fragment = R.fragment; | |||
| 5.2.3 Merge Paths | 5.2.3. Merge Paths | |||
| The pseudocode above refers to a "merge" routine for merging a | The pseudocode above refers to a "merge" routine for merging a | |||
| relative-path reference with the path of the base URI. This is | relative-path reference with the path of the base URI. This is | |||
| accomplished as follows: | accomplished as follows: | |||
| o If the base URI has a defined authority component and an empty | o If the base URI has a defined authority component and an empty | |||
| path, then return a string consisting of "/" concatenated with the | path, then return a string consisting of "/" concatenated with the | |||
| reference's path; otherwise, | reference's path; otherwise, | |||
| o Return a string consisting of the reference's path component | o return a string consisting of the reference's path component | |||
| appended to all but the last segment of the base URI's path (i.e., | appended to all but the last segment of the base URI's path (i.e., | |||
| excluding any characters after the right-most "/" in the base URI | excluding any characters after the right-most "/" in the base URI | |||
| path, or excluding the entire base URI path if it does not contain | path, or excluding the entire base URI path if it does not contain | |||
| any "/" characters). | any "/" characters). | |||
| 5.2.4 Remove Dot Segments | 5.2.4. Remove Dot Segments | |||
| The pseudocode also refers to a "remove_dot_segments" routine for | The pseudocode also refers to a "remove_dot_segments" routine for | |||
| interpreting and removing the special "." and ".." complete path | interpreting and removing the special "." and ".." complete path | |||
| segments from a referenced path. This is done after the path is | segments from a referenced path. This is done after the path is | |||
| extracted from a reference, whether or not the path was relative, in | extracted from a reference, whether or not the path was relative, in | |||
| order to remove any invalid or extraneous dot-segments prior to | order to remove any invalid or extraneous dot-segments prior to | |||
| forming the target URI. Although there are many ways to accomplish | forming the target URI. Although there are many ways to accomplish | |||
| this removal process, we describe a simple method using two string | this removal process, we describe a simple method using two string | |||
| buffers. | buffers. | |||
| 1. The input buffer is initialized with the now-appended path | 1. The input buffer is initialized with the now-appended path | |||
| components and the output buffer is initialized to the empty | components and the output buffer is initialized to the empty | |||
| string. | string. | |||
| 2. While the input buffer is not empty, loop: | 2. While the input buffer is not empty, loop as follows: | |||
| A. If the input buffer begins with a prefix of "../" or "./", | A. If the input buffer begins with a prefix of "../" or "./", | |||
| then remove that prefix from the input buffer; otherwise, | then remove that prefix from the input buffer; otherwise, | |||
| B. If the input buffer begins with a prefix of "/./" or "/.", | B. if the input buffer begins with a prefix of "/./" or "/.", | |||
| where "." is a complete path segment, then replace that | where "." is a complete path segment, then replace that | |||
| prefix with "/" in the input buffer; otherwise, | prefix with "/" in the input buffer; otherwise, | |||
| C. If the input buffer begins with a prefix of "/../" or "/..", | C. if the input buffer begins with a prefix of "/../" or "/..", | |||
| where ".." is a complete path segment, then replace that | where ".." is a complete path segment, then replace that | |||
| prefix with "/" in the input buffer and remove the last | prefix with "/" in the input buffer and remove the last | |||
| segment and its preceding "/" (if any) from the output | segment and its preceding "/" (if any) from the output | |||
| buffer; otherwise, | buffer; otherwise, | |||
| D. If the input buffer consists only of "." or "..", then remove | D. if the input buffer consists only of "." or "..", then remove | |||
| that from the input buffer; otherwise, | that from the input buffer; otherwise, | |||
| E. Move the first path segment in the input buffer to the end of | E. move the first path segment in the input buffer to the end of | |||
| the output buffer, including the initial "/" character (if | the output buffer, including the initial "/" character (if | |||
| any) and any subsequent characters up to, but not including, | any) and any subsequent characters up to, but not including, | |||
| the next "/" character or the end of the input buffer. | the next "/" character or the end of the input buffer. | |||
| 3. Finally, the output buffer is returned as the result of | 3. Finally, the output buffer is returned as the result of | |||
| remove_dot_segments. | remove_dot_segments. | |||
| Note that dot-segments are intended for use in URI references to | Note that dot-segments are intended for use in URI references to | |||
| express an identifier relative to the hierarchy of names in the base | express an identifier relative to the hierarchy of names in the base | |||
| URI. The remove_dot_segments algorithm respects that hierarchy by | URI. The remove_dot_segments algorithm respects that hierarchy by | |||
| removing extra dot-segments rather than treating them as an error or | removing extra dot-segments rather than treat them as an error or | |||
| leaving them to be misinterpreted by dereference implementations. | leaving them to be misinterpreted by dereference implementations. | |||
| The following illustrates how the above steps are applied for two | The following illustrates how the above steps are applied for two | |||
| example merged paths, showing the state of the two buffers after each | examples of merged paths, showing the state of the two buffers after | |||
| step. | each step. | |||
| STEP OUTPUT BUFFER INPUT BUFFER | STEP OUTPUT BUFFER INPUT BUFFER | |||
| 1 : /a/b/c/./../../g | 1 : /a/b/c/./../../g | |||
| 2E: /a /b/c/./../../g | 2E: /a /b/c/./../../g | |||
| 2E: /a/b /c/./../../g | 2E: /a/b /c/./../../g | |||
| 2E: /a/b/c /./../../g | 2E: /a/b/c /./../../g | |||
| 2B: /a/b/c /../../g | 2B: /a/b/c /../../g | |||
| 2C: /a/b /../g | 2C: /a/b /../g | |||
| 2C: /a /g | 2C: /a /g | |||
| skipping to change at page 33, line 46 ¶ | skipping to change at page 34, line 35 ¶ | |||
| STEP OUTPUT BUFFER INPUT BUFFER | STEP OUTPUT BUFFER INPUT BUFFER | |||
| 1 : mid/content=5/../6 | 1 : mid/content=5/../6 | |||
| 2E: mid /content=5/../6 | 2E: mid /content=5/../6 | |||
| 2E: mid/content=5 /../6 | 2E: mid/content=5 /../6 | |||
| 2C: mid /6 | 2C: mid /6 | |||
| 2E: mid/6 | 2E: mid/6 | |||
| Some applications may find it more efficient to implement the | Some applications may find it more efficient to implement the | |||
| remove_dot_segments algorithm using two segment stacks rather than | remove_dot_segments algorithm by using two segment stacks rather than | |||
| strings. | strings. | |||
| Note: Beware that some older, erroneous implementations will fail | Note: Beware that some older, erroneous implementations will fail | |||
| to separate a reference's query component from its path component | to separate a reference's query component from its path component | |||
| prior to merging the base and reference paths, resulting in an | prior to merging the base and reference paths, resulting in an | |||
| interoperability failure if the query component contains the | interoperability failure if the query component contains the | |||
| strings "/../" or "/./". | strings "/../" or "/./". | |||
| 5.3 Component Recomposition | 5.3. Component Recomposition | |||
| Parsed URI components can be recomposed to obtain the corresponding | Parsed URI components can be recomposed to obtain the corresponding | |||
| URI reference string. Using pseudocode, this would be: | URI reference string. Using pseudocode, this would be: | |||
| result = "" | result = "" | |||
| if defined(scheme) then | if defined(scheme) then | |||
| append scheme to result; | append scheme to result; | |||
| append ":" to result; | append ":" to result; | |||
| endif; | endif; | |||
| skipping to change at page 34, line 42 ¶ | skipping to change at page 35, line 42 ¶ | |||
| endif; | endif; | |||
| return result; | return result; | |||
| Note that we are careful to preserve the distinction between a | Note that we are careful to preserve the distinction between a | |||
| component that is undefined, meaning that its separator was not | component that is undefined, meaning that its separator was not | |||
| present in the reference, and a component that is empty, meaning that | present in the reference, and a component that is empty, meaning that | |||
| the separator was present and was immediately followed by the next | the separator was present and was immediately followed by the next | |||
| component separator or the end of the reference. | component separator or the end of the reference. | |||
| 5.4 Reference Resolution Examples | 5.4. Reference Resolution Examples | |||
| Within a representation with a well-defined base URI of | Within a representation with a well defined base URI of | |||
| http://a/b/c/d;p?q | http://a/b/c/d;p?q | |||
| a relative reference is transformed to its target URI as follows. | a relative reference is transformed to its target URI as follows. | |||
| 5.4.1 Normal Examples | 5.4.1. Normal Examples | |||
| "g:h" = "g:h" | "g:h" = "g:h" | |||
| "g" = "http://a/b/c/g" | "g" = "http://a/b/c/g" | |||
| "./g" = "http://a/b/c/g" | "./g" = "http://a/b/c/g" | |||
| "g/" = "http://a/b/c/g/" | "g/" = "http://a/b/c/g/" | |||
| "/g" = "http://a/g" | "/g" = "http://a/g" | |||
| "//g" = "http://g" | "//g" = "http://g" | |||
| "?y" = "http://a/b/c/d;p?y" | "?y" = "http://a/b/c/d;p?y" | |||
| "g?y" = "http://a/b/c/g?y" | "g?y" = "http://a/b/c/g?y" | |||
| "#s" = "http://a/b/c/d;p?q#s" | "#s" = "http://a/b/c/d;p?q#s" | |||
| skipping to change at page 35, line 31 ¶ | skipping to change at page 36, line 31 ¶ | |||
| "" = "http://a/b/c/d;p?q" | "" = "http://a/b/c/d;p?q" | |||
| "." = "http://a/b/c/" | "." = "http://a/b/c/" | |||
| "./" = "http://a/b/c/" | "./" = "http://a/b/c/" | |||
| ".." = "http://a/b/" | ".." = "http://a/b/" | |||
| "../" = "http://a/b/" | "../" = "http://a/b/" | |||
| "../g" = "http://a/b/g" | "../g" = "http://a/b/g" | |||
| "../.." = "http://a/" | "../.." = "http://a/" | |||
| "../../" = "http://a/" | "../../" = "http://a/" | |||
| "../../g" = "http://a/g" | "../../g" = "http://a/g" | |||
| 5.4.2 Abnormal Examples | 5.4.2. Abnormal Examples | |||
| Although the following abnormal examples are unlikely to occur in | Although the following abnormal examples are unlikely to occur in | |||
| normal practice, all URI parsers should be capable of resolving them | normal practice, all URI parsers should be capable of resolving them | |||
| consistently. Each example uses the same base as above. | consistently. Each example uses the same base as that above. | |||
| Parsers must be careful in handling cases where there are more ".." | Parsers must be careful in handling cases where there are more ".." | |||
| segments in a relative-path reference than there are hierarchical | segments in a relative-path reference than there are hierarchical | |||
| levels in the base URI's path. Note that the ".." syntax cannot be | levels in the base URI's path. Note that the ".." syntax cannot be | |||
| used to change the authority component of a URI. | used to change the authority component of a URI. | |||
| "../../../g" = "http://a/g" | "../../../g" = "http://a/g" | |||
| "../../../../g" = "http://a/g" | "../../../../g" = "http://a/g" | |||
| Similarly, parsers must remove the dot-segments "." and ".." when | Similarly, parsers must remove the dot-segments "." and ".." when | |||
| skipping to change at page 36, line 25 ¶ | skipping to change at page 37, line 29 ¶ | |||
| "./../g" = "http://a/b/g" | "./../g" = "http://a/b/g" | |||
| "./g/." = "http://a/b/c/g/" | "./g/." = "http://a/b/c/g/" | |||
| "g/./h" = "http://a/b/c/g/h" | "g/./h" = "http://a/b/c/g/h" | |||
| "g/../h" = "http://a/b/c/h" | "g/../h" = "http://a/b/c/h" | |||
| "g;x=1/./y" = "http://a/b/c/g;x=1/y" | "g;x=1/./y" = "http://a/b/c/g;x=1/y" | |||
| "g;x=1/../y" = "http://a/b/c/y" | "g;x=1/../y" = "http://a/b/c/y" | |||
| Some applications fail to separate the reference's query and/or | Some applications fail to separate the reference's query and/or | |||
| fragment components from the path component before merging it with | fragment components from the path component before merging it with | |||
| the base path and removing dot-segments. This error is rarely | the base path and removing dot-segments. This error is rarely | |||
| noticed, since typical usage of a fragment never includes the | noticed, as typical usage of a fragment never includes the hierarchy | |||
| hierarchy ("/") character, and the query component is not normally | ("/") character and the query component is not normally used within | |||
| used within relative references. | relative references. | |||
| "g?y/./x" = "http://a/b/c/g?y/./x" | "g?y/./x" = "http://a/b/c/g?y/./x" | |||
| "g?y/../x" = "http://a/b/c/g?y/../x" | "g?y/../x" = "http://a/b/c/g?y/../x" | |||
| "g#s/./x" = "http://a/b/c/g#s/./x" | "g#s/./x" = "http://a/b/c/g#s/./x" | |||
| "g#s/../x" = "http://a/b/c/g#s/../x" | "g#s/../x" = "http://a/b/c/g#s/../x" | |||
| Some parsers allow the scheme name to be present in a relative | Some parsers allow the scheme name to be present in a relative | |||
| reference if it is the same as the base URI scheme. This is | reference if it is the same as the base URI scheme. This is | |||
| considered to be a loophole in prior specifications of partial URI | considered to be a loophole in prior specifications of partial URI | |||
| [RFC1630]. Its use should be avoided, but is allowed for backward | [RFC1630]. Its use should be avoided but is allowed for backward | |||
| compatibility. | compatibility. | |||
| "http:g" = "http:g" ; for strict parsers | "http:g" = "http:g" ; for strict parsers | |||
| / "http://a/b/c/g" ; for backward compatibility | / "http://a/b/c/g" ; for backward compatibility | |||
| 6. Normalization and Comparison | 6. Normalization and Comparison | |||
| One of the most common operations on URIs is simple comparison: | One of the most common operations on URIs is simple comparison: | |||
| determining if two URIs are equivalent without using the URIs to | determining whether two URIs are equivalent without using the URIs to | |||
| access their respective resource(s). A comparison is performed every | access their respective resource(s). A comparison is performed every | |||
| time a response cache is accessed, a browser checks its history to | time a response cache is accessed, a browser checks its history to | |||
| color a link, or an XML parser processes tags within a namespace. | color a link, or an XML parser processes tags within a namespace. | |||
| Extensive normalization prior to comparison of URIs is often used by | Extensive normalization prior to comparison of URIs is often used by | |||
| spiders and indexing engines to prune a search space or reduce | spiders and indexing engines to prune a search space or to reduce | |||
| duplication of request actions and response storage. | duplication of request actions and response storage. | |||
| URI comparison is performed in respect to some particular purpose, | URI comparison is performed for some particular purpose. Protocols | |||
| and implementations with differing purposes will often be subject to | or implementations that compare URIs for different purposes will | |||
| differing design trade-offs in regards to how much effort should be | often be subject to differing design trade-offs in regards to how | |||
| spent in reducing aliased identifiers. This section describes a | much effort should be spent in reducing aliased identifiers. This | |||
| variety of methods that may be used to compare URIs, the trade-offs | section describes various methods that may be used to compare URIs, | |||
| between them, and the types of applications that might use them. | the trade-offs between them, and the types of applications that might | |||
| use them. | ||||
| 6.1 Equivalence | 6.1. Equivalence | |||
| Since URIs exist to identify resources, presumably they should be | Because URIs exist to identify resources, presumably they should be | |||
| considered equivalent when they identify the same resource. However, | considered equivalent when they identify the same resource. However, | |||
| such a definition of equivalence is not of much practical use, since | this definition of equivalence is not of much practical use, as there | |||
| there is no way for an implementation to compare two resources that | is no way for an implementation to compare two resources unless it | |||
| are not under its own control. For this reason, determination of | has full knowledge or control of them. For this reason, | |||
| equivalence or difference of URIs is based on string comparison, | determination of equivalence or difference of URIs is based on string | |||
| perhaps augmented by reference to additional rules provided by URI | comparison, perhaps augmented by reference to additional rules | |||
| scheme definitions. We use the terms "different" and "equivalent" to | provided by URI scheme definitions. We use the terms "different" and | |||
| describe the possible outcomes of such comparisons, but there are | "equivalent" to describe the possible outcomes of such comparisons, | |||
| many application-dependent versions of equivalence. | but there are many application-dependent versions of equivalence. | |||
| Even though it is possible to determine that two URIs are equivalent, | Even though it is possible to determine that two URIs are equivalent, | |||
| URI comparison is not sufficient to determine if two URIs identify | URI comparison is not sufficient to determine whether two URIs | |||
| different resources. For example, an owner of two different domain | identify different resources. For example, an owner of two different | |||
| names could decide to serve the same resource from both, resulting in | domain names could decide to serve the same resource from both, | |||
| two different URIs. Therefore, comparison methods are designed to | resulting in two different URIs. Therefore, comparison methods are | |||
| minimize false negatives while strictly avoiding false positives. | designed to minimize false negatives while strictly avoiding false | |||
| positives. | ||||
| In testing for equivalence, applications should not directly compare | In testing for equivalence, applications should not directly compare | |||
| relative references; the references should be converted to their | relative references; the references should be converted to their | |||
| respective target URIs before comparison. When URIs are being | respective target URIs before comparison. When URIs are compared to | |||
| compared for the purpose of selecting (or avoiding) a network action, | select (or avoid) a network action, such as retrieval of a | |||
| such as retrieval of a representation, fragment components (if any) | representation, fragment components (if any) should be excluded from | |||
| should be excluded from the comparison. | the comparison. | |||
| 6.2 Comparison Ladder | 6.2. Comparison Ladder | |||
| A variety of methods are used in practice to test URI equivalence. | A variety of methods are used in practice to test URI equivalence. | |||
| These methods fall into a range, distinguished by the amount of | These methods fall into a range, distinguished by the amount of | |||
| processing required and the degree to which the probability of false | processing required and the degree to which the probability of false | |||
| negatives is reduced. As noted above, false negatives cannot be | negatives is reduced. As noted above, false negatives cannot be | |||
| eliminated. In practice, their probability can be reduced, but this | eliminated. In practice, their probability can be reduced, but this | |||
| reduction requires more processing and is not cost-effective for all | reduction requires more processing and is not cost-effective for all | |||
| applications. | applications. | |||
| If this range of comparison practices is considered as a ladder, the | If this range of comparison practices is considered as a ladder, the | |||
| following discussion will climb the ladder, starting with those | following discussion will climb the ladder, starting with practices | |||
| practices that are cheap but have a relatively higher chance of | that are cheap but have a relatively higher chance of producing false | |||
| producing false negatives, and proceeding to those that have higher | negatives, and proceeding to those that have higher computational | |||
| computational cost and lower risk of false negatives. | cost and lower risk of false negatives. | |||
| 6.2.1 Simple String Comparison | 6.2.1. Simple String Comparison | |||
| If two URIs, considered as character strings, are identical, then it | If two URIs, when considered as character strings, are identical, | |||
| is safe to conclude that they are equivalent. This type of | then it is safe to conclude that they are equivalent. This type of | |||
| equivalence test has very low computational cost and is in wide use | equivalence test has very low computational cost and is in wide use | |||
| in a variety of applications, particularly in the domain of parsing. | in a variety of applications, particularly in the domain of parsing. | |||
| Testing strings for equivalence requires some basic precautions. | Testing strings for equivalence requires some basic precautions. | |||
| This procedure is often referred to as "bit-for-bit" or | This procedure is often referred to as "bit-for-bit" or | |||
| "byte-for-byte" comparison, which is potentially misleading. Testing | "byte-for-byte" comparison, which is potentially misleading. Testing | |||
| of strings for equality is normally based on pairwise comparison of | strings for equality is normally based on pair comparison of the | |||
| the characters that make up the strings, starting from the first and | characters that make up the strings, starting from the first and | |||
| proceeding until both strings are exhausted and all characters found | proceeding until both strings are exhausted and all characters are | |||
| to be equal, a pair of characters compares unequal, or one of the | found to be equal, until a pair of characters compares unequal, or | |||
| strings is exhausted before the other. | until one of the strings is exhausted before the other. | |||
| Such character comparisons require that each pair of characters be | This character comparison requires that each pair of characters be | |||
| put in comparable form. For example, should one URI be stored in a | put in comparable form. For example, should one URI be stored in a | |||
| byte array in EBCDIC encoding, and the second be in a Java String | byte array in EBCDIC encoding and the second in a Java String object | |||
| object (UTF-16), bit-for-bit comparisons applied naively will produce | (UTF-16), bit-for-bit comparisons applied naively will produce | |||
| errors. It is better to speak of equality on a | errors. It is better to speak of equality on a character-for- | |||
| character-for-character rather than byte-for-byte or bit-for-bit | character basis rather than on a byte-for-byte or bit-for-bit basis. | |||
| basis. In practical terms, character-by-character comparisons should | In practical terms, character-by-character comparisons should be done | |||
| be done codepoint-by-codepoint after conversion to a common character | codepoint-by-codepoint after conversion to a common character | |||
| encoding. | encoding. | |||
| False negatives are caused by the production and use of URI aliases. | False negatives are caused by the production and use of URI aliases. | |||
| Unnecessary aliases can be reduced, regardless of the comparison | Unnecessary aliases can be reduced, regardless of the comparison | |||
| method, by consistently providing URI references in an | method, by consistently providing URI references in an already- | |||
| already-normalized form (i.e., a form identical to what would be | normalized form (i.e., a form identical to what would be produced | |||
| produced after normalization is applied, as described below). | after normalization is applied, as described below). | |||
| Protocols and data formats often choose to limit some URI comparisons | ||||
| to simple string comparison, based on the theory that people and | Protocols and data formats often limit some URI comparisons to simple | |||
| string comparison, based on the theory that people and | ||||
| implementations will, in their own best interest, be consistent in | implementations will, in their own best interest, be consistent in | |||
| providing URI references, or at least consistent enough to negate any | providing URI references, or at least consistent enough to negate any | |||
| efficiency that might be obtained from further normalization. | efficiency that might be obtained from further normalization. | |||
| 6.2.2 Syntax-based Normalization | 6.2.2. Syntax-Based Normalization | |||
| Implementations may use logic based on the definitions provided by | Implementations may use logic based on the definitions provided by | |||
| this specification to reduce the probability of false negatives. | this specification to reduce the probability of false negatives. | |||
| Such processing is moderately higher in cost than | This processing is moderately higher in cost than character-for- | |||
| character-for-character string comparison. For example, an | character string comparison. For example, an application using this | |||
| application using this approach could reasonably consider the | approach could reasonably consider the following two URIs equivalent: | |||
| following two URIs equivalent: | ||||
| example://a/b/c/%7Bfoo%7D | example://a/b/c/%7Bfoo%7D | |||
| eXAMPLE://a/./b/../b/%63/%7bfoo%7d | eXAMPLE://a/./b/../b/%63/%7bfoo%7d | |||
| Web user agents, such as browsers, typically apply this type of URI | Web user agents, such as browsers, typically apply this type of URI | |||
| normalization when determining whether a cached response is | normalization when determining whether a cached response is | |||
| available. Syntax-based normalization includes such techniques as | available. Syntax-based normalization includes such techniques as | |||
| case normalization, percent-encoding normalization, and removal of | case normalization, percent-encoding normalization, and removal of | |||
| dot-segments. | dot-segments. | |||
| 6.2.2.1 Case Normalization | 6.2.2.1. Case Normalization | |||
| For all URIs, the hexadecimal digits within a percent-encoding | For all URIs, the hexadecimal digits within a percent-encoding | |||
| triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore | triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore | |||
| should be normalized to use uppercase letters for the digits A-F. | should be normalized to use uppercase letters for the digits A-F. | |||
| When a URI uses components of the generic syntax, the component | When a URI uses components of the generic syntax, the component | |||
| syntax equivalence rules always apply; namely, that the scheme and | syntax equivalence rules always apply; namely, that the scheme and | |||
| host are case-insensitive and therefore should be normalized to | host are case-insensitive and therefore should be normalized to | |||
| lowercase. For example, the URI <HTTP://www.EXAMPLE.com/> is | lowercase. For example, the URI <HTTP://www.EXAMPLE.com/> is | |||
| equivalent to <http://www.example.com/>. The other generic syntax | equivalent to <http://www.example.com/>. The other generic syntax | |||
| components are assumed to be case-sensitive unless specifically | components are assumed to be case-sensitive unless specifically | |||
| defined otherwise by the scheme (see Section 6.2.3). | defined otherwise by the scheme (see Section 6.2.3). | |||
| 6.2.2.2 Percent-Encoding Normalization | 6.2.2.2. Percent-Encoding Normalization | |||
| The percent-encoding mechanism (Section 2.1) is a frequent source of | The percent-encoding mechanism (Section 2.1) is a frequent source of | |||
| variance among otherwise identical URIs. In addition to the case | variance among otherwise identical URIs. In addition to the case | |||
| normalization issue noted above, some URI producers percent-encode | normalization issue noted above, some URI producers percent-encode | |||
| octets that do not require percent-encoding, resulting in URIs that | octets that do not require percent-encoding, resulting in URIs that | |||
| are equivalent to their non-encoded counterparts. Such URIs should | are equivalent to their non-encoded counterparts. These URIs should | |||
| be normalized by decoding any percent-encoded octet that corresponds | be normalized by decoding any percent-encoded octet that corresponds | |||
| to an unreserved character, as described in Section 2.3. | to an unreserved character, as described in Section 2.3. | |||
| 6.2.2.3 Path Segment Normalization | 6.2.2.3. Path Segment Normalization | |||
| The complete path segments "." and ".." are intended only for use | The complete path segments "." and ".." are intended only for use | |||
| within relative references (Section 4.1) and are removed as part of | within relative references (Section 4.1) and are removed as part of | |||
| the reference resolution process (Section 5.2). However, some | the reference resolution process (Section 5.2). However, some | |||
| deployed implementations incorrectly assume that reference resolution | deployed implementations incorrectly assume that reference resolution | |||
| is not necessary when the reference is already a URI, and thus fail | is not necessary when the reference is already a URI and thus fail to | |||
| to remove dot-segments when they occur in non-relative paths. URI | remove dot-segments when they occur in non-relative paths. URI | |||
| normalizers should remove dot-segments by applying the | normalizers should remove dot-segments by applying the | |||
| remove_dot_segments algorithm to the path, as described in | remove_dot_segments algorithm to the path, as described in | |||
| Section 5.2.4. | Section 5.2.4. | |||
| 6.2.3 Scheme-based Normalization | 6.2.3. Scheme-Based Normalization | |||
| The syntax and semantics of URIs vary from scheme to scheme, as | The syntax and semantics of URIs vary from scheme to scheme, as | |||
| described by the defining specification for each scheme. | described by the defining specification for each scheme. | |||
| Implementations may use scheme-specific rules, at further processing | Implementations may use scheme-specific rules, at further processing | |||
| cost, to reduce the probability of false negatives. For example, | cost, to reduce the probability of false negatives. For example, | |||
| since the "http" scheme makes use of an authority component, has a | because the "http" scheme makes use of an authority component, has a | |||
| default port of "80", and defines an empty path to be equivalent to | default port of "80", and defines an empty path to be equivalent to | |||
| "/", the following four URIs are equivalent: | "/", the following four URIs are equivalent: | |||
| http://example.com | http://example.com | |||
| http://example.com/ | http://example.com/ | |||
| http://example.com:/ | http://example.com:/ | |||
| http://example.com:80/ | http://example.com:80/ | |||
| In general, a URI that uses the generic syntax for authority with an | In general, a URI that uses the generic syntax for authority with an | |||
| empty path should be normalized to a path of "/"; likewise, an | empty path should be normalized to a path of "/". Likewise, an | |||
| explicit ":port", where the port is empty or the default for the | explicit ":port", for which the port is empty or the default for the | |||
| scheme, is equivalent to one where the port and its ":" delimiter are | scheme, is equivalent to one where the port and its ":" delimiter are | |||
| elided, and thus should be removed by scheme-based normalization. | elided and thus should be removed by scheme-based normalization. For | |||
| For example, the second URI above is the normal form for the "http" | example, the second URI above is the normal form for the "http" | |||
| scheme. | scheme. | |||
| Another case where normalization varies by scheme is in the handling | Another case where normalization varies by scheme is in the handling | |||
| of an empty authority component or empty host subcomponent. For many | of an empty authority component or empty host subcomponent. For many | |||
| scheme specifications, an empty authority or host is considered an | scheme specifications, an empty authority or host is considered an | |||
| error; for others, it is considered equivalent to "localhost" or the | error; for others, it is considered equivalent to "localhost" or the | |||
| end-user's host. When a scheme defines a default for authority and a | end-user's host. When a scheme defines a default for authority and a | |||
| URI reference to that default is desired, the reference should be | URI reference to that default is desired, the reference should be | |||
| normalized to an empty authority for the sake of uniformity, brevity, | normalized to an empty authority for the sake of uniformity, brevity, | |||
| and internationalization. If, however, either the userinfo or port | and internationalization. If, however, either the userinfo or port | |||
| subcomponent is non-empty, then the host should be given explicitly | subcomponents are non-empty, then the host should be given explicitly | |||
| even if it matches the default. | even if it matches the default. | |||
| Normalization should not remove delimiters when their associated | Normalization should not remove delimiters when their associated | |||
| component is empty unless licensed to do so by the scheme | component is empty unless licensed to do so by the scheme | |||
| specification. For example, the URI "http://example.com/?" cannot be | specification. For example, the URI "http://example.com/?" cannot be | |||
| assumed to be equivalent to any of the examples above. Likewise, the | assumed to be equivalent to any of the examples above. Likewise, the | |||
| presence or absence of delimiters within a userinfo subcomponent is | presence or absence of delimiters within a userinfo subcomponent is | |||
| usually significant to its interpretation. The fragment component is | usually significant to its interpretation. The fragment component is | |||
| not subject to any scheme-based normalization; thus, two URIs that | not subject to any scheme-based normalization; thus, two URIs that | |||
| differ only by the suffix "#" are considered different regardless of | differ only by the suffix "#" are considered different regardless of | |||
| the scheme. | the scheme. | |||
| Some schemes define additional subcomponents that consist of | Some schemes define additional subcomponents that consist of case- | |||
| case-insensitive data, giving an implicit license to normalizers to | insensitive data, giving an implicit license to normalizers to | |||
| convert such data to a common case (e.g., all lowercase). For | convert this data to a common case (e.g., all lowercase). For | |||
| example, URI schemes that define a subcomponent of path to contain an | example, URI schemes that define a subcomponent of path to contain an | |||
| Internet hostname, such as the "mailto" URI scheme, cause that | Internet hostname, such as the "mailto" URI scheme, cause that | |||
| subcomponent to be case-insensitive and thus subject to case | subcomponent to be case-insensitive and thus subject to case | |||
| normalization (e.g., "mailto:[email protected]" is equivalent to | normalization (e.g., "mailto:[email protected]" is equivalent to | |||
| "mailto:[email protected]" even though the generic syntax considers the | "mailto:[email protected]", even though the generic syntax considers | |||
| path component to be case-sensitive). | the path component to be case-sensitive). | |||
| Other scheme-specific normalizations are possible. | Other scheme-specific normalizations are possible. | |||
| 6.2.4 Protocol-based Normalization | 6.2.4. Protocol-Based Normalization | |||
| Web spiders, for which substantial effort to reduce the incidence of | Substantial effort to reduce the incidence of false negatives is | |||
| false negatives is often cost-effective, are observed to implement | often cost-effective for web spiders. Therefore, they implement even | |||
| even more aggressive techniques in URI comparison. For example, if | more aggressive techniques in URI comparison. For example, if they | |||
| they observe that a URI such as | observe that a URI such as | |||
| http://example.com/data | http://example.com/data | |||
| redirects to a URI differing only in the trailing slash | redirects to a URI differing only in the trailing slash | |||
| http://example.com/data/ | http://example.com/data/ | |||
| they will likely regard the two as equivalent in the future. This | they will likely regard the two as equivalent in the future. This | |||
| kind of technique is only appropriate when equivalence is clearly | kind of technique is only appropriate when equivalence is clearly | |||
| indicated by both the result of accessing the resources and the | indicated by both the result of accessing the resources and the | |||
| common conventions of their scheme's dereference algorithm (in this | common conventions of their scheme's dereference algorithm (in this | |||
| case, use of redirection by HTTP origin servers to avoid problems | case, use of redirection by HTTP origin servers to avoid problems | |||
| with relative references). | with relative references). | |||
| 7. Security Considerations | 7. Security Considerations | |||
| A URI does not in itself pose a security threat. However, since URIs | A URI does not in itself pose a security threat. However, as URIs | |||
| are often used to provide a compact set of instructions for access to | are often used to provide a compact set of instructions for access to | |||
| network resources, care must be taken to properly interpret the data | network resources, care must be taken to properly interpret the data | |||
| within a URI, to prevent that data from causing unintended access, | within a URI, to prevent that data from causing unintended access, | |||
| and to avoid including data that should not be revealed in plain | and to avoid including data that should not be revealed in plain | |||
| text. | text. | |||
| 7.1 Reliability and Consistency | 7.1. Reliability and Consistency | |||
| There is no guarantee that, having once used a given URI to retrieve | There is no guarantee that once a URI has been used to retrieve | |||
| some information, the same information will be retrievable by that | information, the same information will be retrievable by that URI in | |||
| URI in the future. Nor is there any guarantee that the information | the future. Nor is there any guarantee that the information | |||
| retrievable via that URI in the future will be observably similar to | retrievable via that URI in the future will be observably similar to | |||
| that retrieved in the past. The URI syntax does not constrain how a | that retrieved in the past. The URI syntax does not constrain how a | |||
| given scheme or authority apportions its name space or maintains it | given scheme or authority apportions its namespace or maintains it | |||
| over time. Such a guarantee can only be obtained from the person(s) | over time. Such guarantees can only be obtained from the person(s) | |||
| controlling that name space and the resource in question. A specific | controlling that namespace and the resource in question. A specific | |||
| URI scheme may define additional semantics, such as name persistence, | URI scheme may define additional semantics, such as name persistence, | |||
| if those semantics are required of all naming authorities for that | if those semantics are required of all naming authorities for that | |||
| scheme. | scheme. | |||
| 7.2 Malicious Construction | 7.2. Malicious Construction | |||
| It is sometimes possible to construct a URI such that an attempt to | It is sometimes possible to construct a URI so that an attempt to | |||
| perform a seemingly harmless, idempotent operation, such as the | perform a seemingly harmless, idempotent operation, such as the | |||
| retrieval of a representation, will in fact cause a possibly damaging | retrieval of a representation, will in fact cause a possibly damaging | |||
| remote operation to occur. The unsafe URI is typically constructed | remote operation. The unsafe URI is typically constructed by | |||
| by specifying a port number other than that reserved for the network | specifying a port number other than that reserved for the network | |||
| protocol in question. The client unwittingly contacts a site that is | protocol in question. The client unwittingly contacts a site running | |||
| running a different protocol service and data within the URI contains | a different protocol service, and data within the URI contains | |||
| instructions that, when interpreted according to this other protocol, | instructions that, when interpreted according to this other protocol, | |||
| cause an unexpected operation. A frequent example of such abuse has | cause an unexpected operation. A frequent example of such abuse has | |||
| been the use of a protocol-based scheme with a port component of | been the use of a protocol-based scheme with a port component of | |||
| "25", thereby fooling user agent software into sending an unintended | "25", thereby fooling user agent software into sending an unintended | |||
| or impersonating message via an SMTP server. | or impersonating message via an SMTP server. | |||
| Applications should prevent dereference of a URI that specifies a TCP | Applications should prevent dereference of a URI that specifies a TCP | |||
| port number within the "well-known port" range (0 - 1023) unless the | port number within the "well-known port" range (0 - 1023) unless the | |||
| protocol being used to dereference that URI is compatible with the | protocol being used to dereference that URI is compatible with the | |||
| protocol expected on that well-known port. Although IANA maintains a | protocol expected on that well-known port. Although IANA maintains a | |||
| registry of well-known ports, applications should make such | registry of well-known ports, applications should make such | |||
| restrictions user-configurable to avoid preventing the deployment of | restrictions user-configurable to avoid preventing the deployment of | |||
| new services. | new services. | |||
| When a URI contains percent-encoded octets that match the delimiters | When a URI contains percent-encoded octets that match the delimiters | |||
| for a given resolution or dereference protocol (for example, CR and | for a given resolution or dereference protocol (for example, CR and | |||
| LF characters for the TELNET protocol), such percent-encoded octets | LF characters for the TELNET protocol), these percent-encodings must | |||
| must not be decoded before transmission across that protocol. | not be decoded before transmission across that protocol. Transfer of | |||
| Transfer of the percent-encoding, which might violate the protocol, | the percent-encoding, which might violate the protocol, is less | |||
| is less harmful than allowing decoded octets to be interpreted as | harmful than allowing decoded octets to be interpreted as additional | |||
| additional operations or parameters, perhaps triggering an unexpected | operations or parameters, perhaps triggering an unexpected and | |||
| and possibly harmful remote operation. | possibly harmful remote operation. | |||
| 7.3 Back-end Transcoding | 7.3. Back-End Transcoding | |||
| When a URI is dereferenced, the data within it is often parsed by | When a URI is dereferenced, the data within it is often parsed by | |||
| both the user agent and one or more servers. In HTTP, for example, a | both the user agent and one or more servers. In HTTP, for example, a | |||
| typical user agent will parse a URI into its five major components, | typical user agent will parse a URI into its five major components, | |||
| access the authority's server, and send it the data within the | access the authority's server, and send it the data within the | |||
| authority, path, and query components. A typical server will take | authority, path, and query components. A typical server will take | |||
| that information, parse the path into segments and the query into | that information, parse the path into segments and the query into | |||
| key/value pairs, and then invoke implementation-specific handlers to | key/value pairs, and then invoke implementation-specific handlers to | |||
| respond to the request. As a result, a common security concern for | respond to the request. As a result, a common security concern for | |||
| server implementations that handle a URI, either as a whole or split | server implementations that handle a URI, either as a whole or split | |||
| into separate components, is proper interpretation of the octet data | into separate components, is proper interpretation of the octet data | |||
| represented by the characters and percent-encodings within that URI. | represented by the characters and percent-encodings within that URI. | |||
| Percent-encoded octets must be decoded at some point during the | Percent-encoded octets must be decoded at some point during the | |||
| dereference process. Applications must split the URI into its | dereference process. Applications must split the URI into its | |||
| components and subcomponents prior to decoding the octets, since | components and subcomponents prior to decoding the octets, as | |||
| otherwise the decoded octets might be mistaken for delimiters. | otherwise the decoded octets might be mistaken for delimiters. | |||
| Security checks of the data within a URI should be applied after | Security checks of the data within a URI should be applied after | |||
| decoding the octets. Note, however, that the "%00" percent-encoding | decoding the octets. Note, however, that the "%00" percent-encoding | |||
| (NUL) may require special handling and should be rejected if the | (NUL) may require special handling and should be rejected if the | |||
| application is not expecting to receive raw data within a component. | application is not expecting to receive raw data within a component. | |||
| Special care should be taken when the URI path interpretation process | Special care should be taken when the URI path interpretation process | |||
| involves the use of a back-end filesystem or related system | involves the use of a back-end file system or related system | |||
| functions. Filesystems typically assign an operational meaning to | functions. File systems typically assign an operational meaning to | |||
| special characters, such as the "/", "\", ":", "[", and "]" | special characters, such as the "/", "\", ":", "[", and "]" | |||
| characters, and special device names like ".", "..", "...", "aux", | characters, and to special device names like ".", "..", "...", "aux", | |||
| "lpt", etc. In some cases, merely testing for the existence of such | "lpt", etc. In some cases, merely testing for the existence of such | |||
| a name will cause the operating system to pause or invoke unrelated | a name will cause the operating system to pause or invoke unrelated | |||
| system calls, leading to significant security concerns regarding | system calls, leading to significant security concerns regarding | |||
| denial of service and unintended data transfer. It would be | denial of service and unintended data transfer. It would be | |||
| impossible for this specification to list all such significant | impossible for this specification to list all such significant | |||
| characters and device names; implementers should research the | characters and device names. Implementers should research the | |||
| reserved names and characters for the types of storage device that | reserved names and characters for the types of storage device that | |||
| may be attached to their application and restrict the use of data | may be attached to their applications and restrict the use of data | |||
| obtained from URI components accordingly. | obtained from URI components accordingly. | |||
| 7.4 Rare IP Address Formats | 7.4. Rare IP Address Formats | |||
| Although the URI syntax for IPv4address only allows the common, | Although the URI syntax for IPv4address only allows the common | |||
| dotted-decimal form of IPv4 address literal, many implementations | dotted-decimal form of IPv4 address literal, many implementations | |||
| that process URIs make use of platform-dependent system routines, | that process URIs make use of platform-dependent system routines, | |||
| such as gethostbyname() and inet_aton(), to translate the string | such as gethostbyname() and inet_aton(), to translate the string | |||
| literal to an actual IP address. Unfortunately, such system routines | literal to an actual IP address. Unfortunately, such system routines | |||
| often allow and process a much larger set of formats than those | often allow and process a much larger set of formats than those | |||
| described in Section 3.2.2. | described in Section 3.2.2. | |||
| For example, many implementations allow dotted forms of three | For example, many implementations allow dotted forms of three | |||
| numbers, wherein the last part is interpreted as a 16-bit quantity | numbers, wherein the last part is interpreted as a 16-bit quantity | |||
| and placed in the right-most two bytes of the network address (e.g., | and placed in the right-most two bytes of the network address (e.g., | |||
| a Class B network). Likewise, a dotted form of two numbers means the | a Class B network). Likewise, a dotted form of two numbers means | |||
| last part is interpreted as a 24-bit quantity and placed in the right | that the last part is interpreted as a 24-bit quantity and placed in | |||
| most three bytes of the network address (Class A), and a single | the right-most three bytes of the network address (Class A), and a | |||
| number (without dots) is interpreted as a 32-bit quantity and stored | single number (without dots) is interpreted as a 32-bit quantity and | |||
| directly in the network address. Adding further to the confusion, | stored directly in the network address. Adding further to the | |||
| some implementations allow each dotted part to be interpreted as | confusion, some implementations allow each dotted part to be | |||
| decimal, octal, or hexadecimal, as specified in the C language (i.e., | interpreted as decimal, octal, or hexadecimal, as specified in the C | |||
| a leading 0x or 0X implies hexadecimal; otherwise, a leading 0 | language (i.e., a leading 0x or 0X implies hexadecimal; a leading 0 | |||
| implies octal; otherwise, the number is interpreted as decimal). | implies octal; otherwise, the number is interpreted as decimal). | |||
| These additional IP address formats are not allowed in the URI syntax | These additional IP address formats are not allowed in the URI syntax | |||
| due to differences between platform implementations. However, they | due to differences between platform implementations. However, they | |||
| can become a security concern if an application attempts to filter | can become a security concern if an application attempts to filter | |||
| access to resources based on the IP address in string literal format. | access to resources based on the IP address in string literal format. | |||
| If such filtering is performed, literals should be converted to | If this filtering is performed, literals should be converted to | |||
| numeric form and filtered based on the numeric value, rather than a | numeric form and filtered based on the numeric value, and not on a | |||
| prefix or suffix of the string form. | prefix or suffix of the string form. | |||
| 7.5 Sensitive Information | 7.5. Sensitive Information | |||
| URI producers should not provide a URI that contains a username or | URI producers should not provide a URI that contains a username or | |||
| password which is intended to be secret: URIs are frequently | password that is intended to be secret. URIs are frequently | |||
| displayed by browsers, stored in clear text bookmarks, and logged by | displayed by browsers, stored in clear text bookmarks, and logged by | |||
| user agent history and intermediary applications (proxies). A | user agent history and intermediary applications (proxies). A | |||
| password appearing within the userinfo component is deprecated and | password appearing within the userinfo component is deprecated and | |||
| should be considered an error (or simply ignored) except in those | should be considered an error (or simply ignored) except in those | |||
| rare cases where the 'password' parameter is intended to be public. | rare cases where the 'password' parameter is intended to be public. | |||
| 7.6 Semantic Attacks | 7.6. Semantic Attacks | |||
| Because the userinfo subcomponent is rarely used and appears before | Because the userinfo subcomponent is rarely used and appears before | |||
| the host in the authority component, it can be used to construct a | the host in the authority component, it can be used to construct a | |||
| URI that is intended to mislead a human user by appearing to identify | URI intended to mislead a human user by appearing to identify one | |||
| one (trusted) naming authority while actually identifying a different | (trusted) naming authority while actually identifying a different | |||
| authority hidden behind the noise. For example | authority hidden behind the noise. For example | |||
| ftp://cnn.example.com&[email protected]/top_story.htm | ftp://cnn.example.com&[email protected]/top_story.htm | |||
| might lead a human user to assume that the host is 'cnn.example.com', | might lead a human user to assume that the host is 'cnn.example.com', | |||
| whereas it is actually '10.0.0.1'. Note that a misleading userinfo | whereas it is actually '10.0.0.1'. Note that a misleading userinfo | |||
| subcomponent could be much longer than the example above. | subcomponent could be much longer than the example above. | |||
| A misleading URI, such as the one above, is an attack on the user's | A misleading URI, such as that above, is an attack on the user's | |||
| preconceived notions about the meaning of a URI, rather than an | preconceived notions about the meaning of a URI rather than an attack | |||
| attack on the software itself. User agents may be able to reduce the | on the software itself. User agents may be able to reduce the impact | |||
| impact of such attacks by distinguishing the various components of | of such attacks by distinguishing the various components of the URI | |||
| the URI when rendered, such as by using a different color or tone to | when they are rendered, such as by using a different color or tone to | |||
| render userinfo if any is present, though there is no general | render userinfo if any is present, though there is no panacea. More | |||
| panacea. More information on URI-based semantic attacks can be found | information on URI-based semantic attacks can be found in [Siedzik]. | |||
| in [Siedzik]. | ||||
| 8. IANA Considerations | 8. IANA Considerations | |||
| URI scheme names, as defined by <scheme> in Section 3.1, form a | URI scheme names, as defined by <scheme> in Section 3.1, form a | |||
| registered name space that is managed by IANA according to the | registered namespace that is managed by IANA according to the | |||
| procedures defined in [BCP35]. No IANA actions are required by this | procedures defined in [BCP35]. No IANA actions are required by this | |||
| document. | document. | |||
| 9. Acknowledgments | 9. Acknowledgements | |||
| This specification is derived from RFC 2396 [RFC2396], RFC 1808 | This specification is derived from RFC 2396 [RFC2396], RFC 1808 | |||
| [RFC1808], and RFC 1738 [RFC1738]; the acknowledgments in those | [RFC1808], and RFC 1738 [RFC1738]; the acknowledgements in those | |||
| documents still apply. It also incorporates the update (with | documents still apply. It also incorporates the update (with | |||
| corrections) for IPv6 literals in the host syntax, as defined by | corrections) for IPv6 literals in the host syntax, as defined by | |||
| Robert M. Hinden, Brian E. Carpenter, and Larry Masinter in | Robert M. Hinden, Brian E. Carpenter, and Larry Masinter in | |||
| [RFC2732]. In addition, contributions by Gisle Aas, Reese Anschultz, | [RFC2732]. In addition, contributions by Gisle Aas, Reese Anschultz, | |||
| Daniel Barclay, Tim Bray, Mike Brown, Rob Cameron, Jeremy Carroll, | Daniel Barclay, Tim Bray, Mike Brown, Rob Cameron, Jeremy Carroll, | |||
| Dan Connolly, Adam M. Costello, John Cowan, Jason Diamond, Martin | Dan Connolly, Adam M. Costello, John Cowan, Jason Diamond, Martin | |||
| Duerst, Stefan Eissing, Clive D.W. Feather, Al Gilman, Tony Hammond, | Duerst, Stefan Eissing, Clive D.W. Feather, Al Gilman, Tony Hammond, | |||
| Elliotte Harold, Pat Hayes, Henry Holtzman, Ian B. Jacobs, Michael | Elliotte Harold, Pat Hayes, Henry Holtzman, Ian B. Jacobs, Michael | |||
| Kay, John C. Klensin, Graham Klyne, Dan Kohn, Bruce Lilly, Andrew | Kay, John C. Klensin, Graham Klyne, Dan Kohn, Bruce Lilly, Andrew | |||
| Main, Dave McAlpin, Ira McDonald, Michael Mealling, Ray Merkert, | Main, Dave McAlpin, Ira McDonald, Michael Mealling, Ray Merkert, | |||
| Stephen Pollei, Julian Reschke, Tomas Rokicki, Miles Sabin, Kai | Stephen Pollei, Julian Reschke, Tomas Rokicki, Miles Sabin, Kai | |||
| Schaetzl, Mark Thomson, Ronald Tschalaer, Norm Walsh, Marc Warne, | Schaetzl, Mark Thomson, Ronald Tschalaer, Norm Walsh, Marc Warne, | |||
| Stuart Williams, and Henry Zongaro are gratefully acknowledged. | Stuart Williams, and Henry Zongaro are gratefully acknowledged. | |||
| 10. References | 10. References | |||
| 10.1 Normative References | 10.1. Normative References | |||
| [ASCII] American National Standards Institute, "Coded Character | [ASCII] American National Standards Institute, "Coded Character | |||
| Set -- 7-bit American Standard Code for Information | Set -- 7-bit American Standard Code for Information | |||
| Interchange", ANSI X3.4, 1986. | Interchange", ANSI X3.4, 1986. | |||
| [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax | [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax | |||
| Specifications: ABNF", RFC 2234, November 1997. | Specifications: ABNF", RFC 2234, November 1997. | |||
| [STD63] Yergeau, F., "UTF-8, a transformation format of ISO | [STD63] Yergeau, F., "UTF-8, a transformation format of | |||
| 10646", STD 63, RFC 3629, November 2003. | ISO 10646", STD 63, RFC 3629, November 2003. | |||
| [UCS] International Organization for Standardization, | [UCS] International Organization for Standardization, | |||
| "Information Technology - Universal Multiple-Octet Coded | "Information Technology - Universal Multiple-Octet Coded | |||
| Character Set (UCS)", ISO/IEC 10646:2003, December 2003. | Character Set (UCS)", ISO/IEC 10646:2003, December 2003. | |||
| 10.2 Informative References | 10.2. Informative References | |||
| [BCP19] Freed, N. and J. Postel, "IANA Charset Registration | [BCP19] Freed, N. and J. Postel, "IANA Charset Registration | |||
| Procedures", BCP 19, RFC 2978, October 2000. | Procedures", BCP 19, RFC 2978, October 2000. | |||
| [BCP35] Petke, R. and I. King, "Registration Procedures for URL | [BCP35] Petke, R. and I. King, "Registration Procedures for URL | |||
| Scheme Names", BCP 35, RFC 2717, November 1999. | Scheme Names", BCP 35, RFC 2717, November 1999. | |||
| [RFC0952] Harrenstien, K., Stahl, M. and E. Feinler, "DoD Internet | [RFC0952] Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet | |||
| host table specification", RFC 952, October 1985. | host table specification", RFC 952, October 1985. | |||
| [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", | [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", | |||
| STD 13, RFC 1034, November 1987. | STD 13, RFC 1034, November 1987. | |||
| [RFC1123] Braden, R., "Requirements for Internet Hosts - Application | [RFC1123] Braden, R., "Requirements for Internet Hosts - Application | |||
| and Support", STD 3, RFC 1123, October 1989. | and Support", STD 3, RFC 1123, October 1989. | |||
| [RFC1535] Gavron, E., "A Security Problem and Proposed Correction | [RFC1535] Gavron, E., "A Security Problem and Proposed Correction | |||
| With Widely Deployed DNS Software", RFC 1535, October | With Widely Deployed DNS Software", RFC 1535, | |||
| 1993. | October 1993. | |||
| [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A | [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A | |||
| Unifying Syntax for the Expression of Names and Addresses | Unifying Syntax for the Expression of Names and Addresses | |||
| of Objects on the Network as used in the World-Wide Web", | of Objects on the Network as used in the World-Wide Web", | |||
| RFC 1630, June 1994. | RFC 1630, June 1994. | |||
| [RFC1736] Kunze, J., "Functional Recommendations for Internet | [RFC1736] Kunze, J., "Functional Recommendations for Internet | |||
| Resource Locators", RFC 1736, February 1995. | Resource Locators", RFC 1736, February 1995. | |||
| [RFC1737] Masinter, L. and K. Sollins, "Functional Requirements for | [RFC1737] Sollins, K. and L. Masinter, "Functional Requirements for | |||
| Uniform Resource Names", RFC 1737, December 1994. | Uniform Resource Names", RFC 1737, December 1994. | |||
| [RFC1738] Berners-Lee, T., Masinter, L. and M. McCahill, "Uniform | [RFC1738] Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform | |||
| Resource Locators (URL)", RFC 1738, December 1994. | Resource Locators (URL)", RFC 1738, December 1994. | |||
| [RFC1808] Fielding, R., "Relative Uniform Resource Locators", RFC | [RFC1808] Fielding, R., "Relative Uniform Resource Locators", | |||
| 1808, June 1995. | RFC 1808, June 1995. | |||
| [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail | [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail | |||
| Extensions (MIME) Part Two: Media Types", RFC 2046, | Extensions (MIME) Part Two: Media Types", RFC 2046, | |||
| November 1996. | November 1996. | |||
| [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. | [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. | |||
| [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform | [RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform | |||
| Resource Identifiers (URI): Generic Syntax", RFC 2396, | Resource Identifiers (URI): Generic Syntax", RFC 2396, | |||
| August 1998. | August 1998. | |||
| [RFC2518] Goland, Y., Whitehead, E., Faizi, A., Carter, S. and D. | [RFC2518] Goland, Y., Whitehead, E., Faizi, A., Carter, S., and D. | |||
| Jensen, "HTTP Extensions for Distributed Authoring -- | Jensen, "HTTP Extensions for Distributed Authoring -- | |||
| WEBDAV", RFC 2518, February 1999. | WEBDAV", RFC 2518, February 1999. | |||
| [RFC2557] Palme, F., Hopmann, A., Shelness, N. and E. Stefferud, | [RFC2557] Palme, J., Hopmann, A., and N. Shelness, "MIME | |||
| "MIME Encapsulation of Aggregate Documents, such as HTML | Encapsulation of Aggregate Documents, such as HTML | |||
| (MHTML)", RFC 2557, March 1999. | (MHTML)", RFC 2557, March 1999. | |||
| [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D. and R. Petke, | [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D., and R. Petke, | |||
| "Guidelines for new URL Schemes", RFC 2718, November 1999. | "Guidelines for new URL Schemes", RFC 2718, November 1999. | |||
| [RFC2732] Hinden, R., Carpenter, B. and L. Masinter, "Format for | [RFC2732] Hinden, R., Carpenter, B., and L. Masinter, "Format for | |||
| Literal IPv6 Addresses in URL's", RFC 2732, December 1999. | Literal IPv6 Addresses in URL's", RFC 2732, December 1999. | |||
| [RFC3305] Mealling, M. and R. Denenberg, "Report from the Joint W3C/ | [RFC3305] Mealling, M. and R. Denenberg, "Report from the Joint | |||
| IETF URI Planning Interest Group: Uniform Resource | W3C/IETF URI Planning Interest Group: Uniform Resource | |||
| Identifiers (URIs), URLs, and Uniform Resource Names | Identifiers (URIs), URLs, and Uniform Resource Names | |||
| (URNs): Clarifications and Recommendations", RFC 3305, | (URNs): Clarifications and Recommendations", RFC 3305, | |||
| August 2002. | August 2002. | |||
| [RFC3490] Faltstrom, P., Hoffman, P. and A. Costello, | [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, | |||
| "Internationalizing Domain Names in Applications (IDNA)", | "Internationalizing Domain Names in Applications (IDNA)", | |||
| RFC 3490, March 2003. | RFC 3490, March 2003. | |||
| [RFC3513] Hinden, R. and S. Deering, "Internet Protocol Version 6 | [RFC3513] Hinden, R. and S. Deering, "Internet Protocol Version 6 | |||
| (IPv6) Addressing Architecture", RFC 3513, April 2003. | (IPv6) Addressing Architecture", RFC 3513, April 2003. | |||
| [Siedzik] Siedzik, R., "Semantic Attacks: What's in a URL?", | [Siedzik] Siedzik, R., "Semantic Attacks: What's in a URL?", | |||
| April 2001, <http://www.giac.org/practical/gsec/ | April 2001, <http://www.giac.org/practical/gsec/ | |||
| Richard_Siedzik_GSEC.pdf>. | Richard_Siedzik_GSEC.pdf>. | |||
| Authors' Addresses | ||||
| Tim Berners-Lee | ||||
| World Wide Web Consortium | ||||
| Massachusetts Institute of Technology | ||||
| 77 Massachusetts Avenue | ||||
| Cambridge, MA 02139 | ||||
| USA | ||||
| Phone: +1-617-253-5702 | ||||
| Fax: +1-617-258-5999 | ||||
| EMail: [email protected] | ||||
| URI: http://www.w3.org/People/Berners-Lee/ | ||||
| Roy T. Fielding | ||||
| Day Software | ||||
| 5251 California Ave., Suite 110 | ||||
| Irvine, CA 92617 | ||||
| USA | ||||
| Phone: +1-949-679-2960 | ||||
| Fax: +1-949-679-2972 | ||||
| EMail: [email protected] | ||||
| URI: http://roy.gbiv.com/ | ||||
| Larry Masinter | ||||
| Adobe Systems Incorporated | ||||
| 345 Park Ave | ||||
| San Jose, CA 95110 | ||||
| USA | ||||
| Phone: +1-408-536-3024 | ||||
| EMail: [email protected] | ||||
| URI: http://larry.masinter.net/ | ||||
| Appendix A. Collected ABNF for URI | Appendix A. Collected ABNF for URI | |||
| URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] | URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] | |||
| hier-part = "//" authority path-abempty | hier-part = "//" authority path-abempty | |||
| / path-absolute | / path-absolute | |||
| / path-rootless | / path-rootless | |||
| / path-empty | / path-empty | |||
| URI-reference = URI / relative-ref | URI-reference = URI / relative-ref | |||
| absolute-URI = scheme ":" hier-part [ "?" query ] | absolute-URI = scheme ":" hier-part [ "?" query ] | |||
| relative-ref = relative-part [ "?" query ] [ "#" fragment ] | relative-ref = relative-part [ "?" query ] [ "#" fragment ] | |||
| relative-part = "//" authority path-abempty | relative-part = "//" authority path-abempty | |||
| / path-absolute | / path-absolute | |||
| / path-noscheme | / path-noscheme | |||
| / path-empty | / path-empty | |||
| scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | |||
| authority = [ userinfo "@" ] host [ ":" port ] | authority = [ userinfo "@" ] host [ ":" port ] | |||
| userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) | userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) | |||
| host = IP-literal / IPv4address / reg-name | host = IP-literal / IPv4address / reg-name | |||
| port = *DIGIT | port = *DIGIT | |||
| IP-literal = "[" ( IPv6address / IPvFuture ) "]" | IP-literal = "[" ( IPv6address / IPvFuture ) "]" | |||
| IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" ) | IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" ) | |||
| IPv6address = 6( h16 ":" ) ls32 | IPv6address = 6( h16 ":" ) ls32 | |||
| / "::" 5( h16 ":" ) ls32 | / "::" 5( h16 ":" ) ls32 | |||
| / [ h16 ] "::" 4( h16 ":" ) ls32 | / [ h16 ] "::" 4( h16 ":" ) ls32 | |||
| / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 | / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 | |||
| / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 | / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 | |||
| / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 | / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 | |||
| / [ *4( h16 ":" ) h16 ] "::" ls32 | / [ *4( h16 ":" ) h16 ] "::" ls32 | |||
| / [ *5( h16 ":" ) h16 ] "::" h16 | / [ *5( h16 ":" ) h16 ] "::" h16 | |||
| / [ *6( h16 ":" ) h16 ] "::" | / [ *6( h16 ":" ) h16 ] "::" | |||
| h16 = 1*4HEXDIG | h16 = 1*4HEXDIG | |||
| ls32 = ( h16 ":" h16 ) / IPv4address | ls32 = ( h16 ":" h16 ) / IPv4address | |||
| IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet | IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet | |||
| dec-octet = DIGIT ; 0-9 | ||||
| / %x31-39 DIGIT ; 10-99 | ||||
| / "1" 2DIGIT ; 100-199 | ||||
| / "2" %x30-34 DIGIT ; 200-249 | ||||
| / "25" %x30-35 ; 250-255 | ||||
| dec-octet = DIGIT ; 0-9 | reg-name = *( unreserved / pct-encoded / sub-delims ) | |||
| / %x31-39 DIGIT ; 10-99 | ||||
| / "1" 2DIGIT ; 100-199 | ||||
| / "2" %x30-34 DIGIT ; 200-249 | ||||
| / "25" %x30-35 ; 250-255 | ||||
| reg-name = *( unreserved / pct-encoded / sub-delims ) | path = path-abempty ; begins with "/" or is empty | |||
| / path-absolute ; begins with "/" but not "//" | ||||
| / path-noscheme ; begins with a non-colon segment | ||||
| / path-rootless ; begins with a segment | ||||
| / path-empty ; zero characters | ||||
| path = path-abempty ; begins with "/" or is empty | path-abempty = *( "/" segment ) | |||
| / path-absolute ; begins with "/" but not "//" | path-absolute = "/" [ segment-nz *( "/" segment ) ] | |||
| / path-noscheme ; begins with a non-colon segment | path-noscheme = segment-nz-nc *( "/" segment ) | |||
| / path-rootless ; begins with a segment | path-rootless = segment-nz *( "/" segment ) | |||
| / path-empty ; zero characters | path-empty = 0<pchar> | |||
| path-abempty = *( "/" segment ) | segment = *pchar | |||
| path-absolute = "/" [ segment-nz *( "/" segment ) ] | segment-nz = 1*pchar | |||
| path-noscheme = segment-nz-nc *( "/" segment ) | segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" ) | |||
| path-rootless = segment-nz *( "/" segment ) | ; non-zero-length segment without any colon ":" | |||
| path-empty = 0<pchar> | ||||
| segment = *pchar | pchar = unreserved / pct-encoded / sub-delims / ":" / "@" | |||
| segment-nz = 1*pchar | ||||
| segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" ) | ||||
| ; non-zero-length segment without any colon ":" | ||||
| pchar = unreserved / pct-encoded / sub-delims / ":" / "@" | query = *( pchar / "/" / "?" ) | |||
| query = *( pchar / "/" / "?" ) | fragment = *( pchar / "/" / "?" ) | |||
| fragment = *( pchar / "/" / "?" ) | ||||
| pct-encoded = "%" HEXDIG HEXDIG | pct-encoded = "%" HEXDIG HEXDIG | |||
| unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" | unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" | |||
| reserved = gen-delims / sub-delims | reserved = gen-delims / sub-delims | |||
| gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" | gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" | |||
| sub-delims = "!" / "$" / "&" / "'" / "(" / ")" | sub-delims = "!" / "$" / "&" / "'" / "(" / ")" | |||
| / "*" / "+" / "," / ";" / "=" | / "*" / "+" / "," / ";" / "=" | |||
| Appendix B. Parsing a URI Reference with a Regular Expression | Appendix B. Parsing a URI Reference with a Regular Expression | |||
| Since the "first-match-wins" algorithm is identical to the "greedy" | As the "first-match-wins" algorithm is identical to the "greedy" | |||
| disambiguation method used by POSIX regular expressions, it is | disambiguation method used by POSIX regular expressions, it is | |||
| natural and commonplace to use a regular expression for parsing the | natural and commonplace to use a regular expression for parsing the | |||
| potential five components of a URI reference. | potential five components of a URI reference. | |||
| The following line is the regular expression for breaking-down a | The following line is the regular expression for breaking-down a | |||
| well-formed URI reference into its components. | well-formed URI reference into its components. | |||
| ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? | ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? | |||
| 12 3 4 5 6 7 8 9 | 12 3 4 5 6 7 8 9 | |||
| skipping to change at page 51, line 39 ¶ | skipping to change at page 51, line 29 ¶ | |||
| $3 = //www.ics.uci.edu | $3 = //www.ics.uci.edu | |||
| $4 = www.ics.uci.edu | $4 = www.ics.uci.edu | |||
| $5 = /pub/ietf/uri/ | $5 = /pub/ietf/uri/ | |||
| $6 = <undefined> | $6 = <undefined> | |||
| $7 = <undefined> | $7 = <undefined> | |||
| $8 = #Related | $8 = #Related | |||
| $9 = Related | $9 = Related | |||
| where <undefined> indicates that the component is not present, as is | where <undefined> indicates that the component is not present, as is | |||
| the case for the query component in the above example. Therefore, we | the case for the query component in the above example. Therefore, we | |||
| can determine the value of the four components and fragment as | can determine the value of the five components as | |||
| scheme = $2 | scheme = $2 | |||
| authority = $4 | authority = $4 | |||
| path = $5 | path = $5 | |||
| query = $7 | query = $7 | |||
| fragment = $9 | fragment = $9 | |||
| and, going in the opposite direction, we can recreate a URI reference | Going in the opposite direction, we can recreate a URI reference from | |||
| from its components using the algorithm of Section 5.3. | its components by using the algorithm of Section 5.3. | |||
| Appendix C. Delimiting a URI in Context | Appendix C. Delimiting a URI in Context | |||
| URIs are often transmitted through formats that do not provide a | URIs are often transmitted through formats that do not provide a | |||
| clear context for their interpretation. For example, there are many | clear context for their interpretation. For example, there are many | |||
| occasions when a URI is included in plain text; examples include text | occasions when a URI is included in plain text; examples include text | |||
| sent in electronic mail, USENET news messages, and, most importantly, | sent in email, USENET news, and on printed paper. In such cases, it | |||
| printed on paper. In such cases, it is important to be able to | is important to be able to delimit the URI from the rest of the text, | |||
| delimit the URI from the rest of the text, and in particular from | and in particular from punctuation marks that might be mistaken for | |||
| punctuation marks that might be mistaken for part of the URI. | part of the URI. | |||
| In practice, URIs are delimited in a variety of ways, but usually | In practice, URIs are delimited in a variety of ways, but usually | |||
| within double-quotes "http://example.com/", angle brackets | within double-quotes "http://example.com/", angle brackets | |||
| <http://example.com/>, or just using whitespace | <http://example.com/>, or just by using whitespace: | |||
| http://example.com/ | http://example.com/ | |||
| These wrappers do not form part of the URI. | These wrappers do not form part of the URI. | |||
| In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may | In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may | |||
| need to be added to break a long URI across lines. The whitespace | have to be added to break a long URI across lines. The whitespace | |||
| should be ignored when extracting the URI. | should be ignored when the URI is extracted. | |||
| No whitespace should be introduced after a hyphen ("-") character. | No whitespace should be introduced after a hyphen ("-") character. | |||
| Because some typesetters and printers may (erroneously) introduce a | Because some typesetters and printers may (erroneously) introduce a | |||
| hyphen at the end of line when breaking a line, the interpreter of a | hyphen at the end of line when breaking it, the interpreter of a URI | |||
| URI containing a line break immediately after a hyphen should ignore | containing a line break immediately after a hyphen should ignore all | |||
| all whitespace around the line break, and should be aware that the | whitespace around the line break and should be aware that the hyphen | |||
| hyphen may or may not actually be part of the URI. | may or may not actually be part of the URI. | |||
| Using <> angle brackets around each URI is especially recommended as | Using <> angle brackets around each URI is especially recommended as | |||
| a delimiting style for a reference that contains embedded whitespace. | a delimiting style for a reference that contains embedded whitespace. | |||
| The prefix "URL:" (with or without a trailing space) was formerly | The prefix "URL:" (with or without a trailing space) was formerly | |||
| recommended as a way to help distinguish a URI from other bracketed | recommended as a way to help distinguish a URI from other bracketed | |||
| designators, though it is not commonly used in practice and is no | designators, though it is not commonly used in practice and is no | |||
| longer recommended. | longer recommended. | |||
| For robustness, software that accepts user-typed URI should attempt | For robustness, software that accepts user-typed URI should attempt | |||
| to recognize and strip both delimiters and embedded whitespace. | to recognize and strip both delimiters and embedded whitespace. | |||
| For example, the text: | For example, the text | |||
| Yes, Jim, I found it under "http://www.w3.org/Addressing/", | Yes, Jim, I found it under "http://www.w3.org/Addressing/", | |||
| but you can probably pick it up from <ftp://foo.example. | but you can probably pick it up from <ftp://foo.example. | |||
| com/rfc/>. Note the warning in <http://www.ics.uci.edu/pub/ | com/rfc/>. Note the warning in <http://www.ics.uci.edu/pub/ | |||
| ietf/uri/historical.html#WARNING>. | ietf/uri/historical.html#WARNING>. | |||
| contains the URI references | contains the URI references | |||
| http://www.w3.org/Addressing/ | http://www.w3.org/Addressing/ | |||
| ftp://foo.example.com/rfc/ | ftp://foo.example.com/rfc/ | |||
| http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING | http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING | |||
| Appendix D. Changes from RFC 2396 | Appendix D. Changes from RFC 2396 | |||
| D.1 Additions | D.1. Additions | |||
| An ABNF rule for URI has been introduced to correspond to one common | An ABNF rule for URI has been introduced to correspond to one common | |||
| usage of the term: an absolute URI with optional fragment. | usage of the term: an absolute URI with optional fragment. | |||
| IPv6 (and later) literals have been added to the list of possible | IPv6 (and later) literals have been added to the list of possible | |||
| identifiers for the host portion of an authority component, as | identifiers for the host portion of an authority component, as | |||
| described by [RFC2732], with the addition of "[" and "]" to the | described by [RFC2732], with the addition of "[" and "]" to the | |||
| reserved set and a version flag to anticipate future versions of IP | reserved set and a version flag to anticipate future versions of IP | |||
| literals. Square brackets are now specified as reserved within the | literals. Square brackets are now specified as reserved within the | |||
| authority component and not allowed outside their use as delimiters | authority component and are not allowed outside their use as | |||
| for an IP literal within host. In order to make this change without | delimiters for an IP literal within host. In order to make this | |||
| changing the technical definition of the path, query, and fragment | change without changing the technical definition of the path, query, | |||
| components, those rules were redefined to directly specify the | and fragment components, those rules were redefined to directly | |||
| characters allowed. | specify the characters allowed. | |||
| Since [RFC2732] defers to [RFC3513] for definition of an IPv6 literal | As [RFC2732] defers to [RFC3513] for definition of an IPv6 literal | |||
| address, which unfortunately lacks an ABNF description of | address, which, unfortunately, lacks an ABNF description of | |||
| IPv6address, we created a new ABNF rule for IPv6address that matches | IPv6address, we created a new ABNF rule for IPv6address that matches | |||
| the text representations defined by Section 2.2 of [RFC3513]. | the text representations defined by Section 2.2 of [RFC3513]. | |||
| Likewise, the definition of IPv4address has been improved in order to | Likewise, the definition of IPv4address has been improved in order to | |||
| limit each decimal octet to the range 0-255. | limit each decimal octet to the range 0-255. | |||
| Section 6 (Section 6) on URI normalization and comparison has been | Section 6, on URI normalization and comparison, has been completely | |||
| completely rewritten and extended using input from Tim Bray and | rewritten and extended by using input from Tim Bray and discussion | |||
| discussion within the W3C Technical Architecture Group. | within the W3C Technical Architecture Group. | |||
| D.2 Modifications | D.2. Modifications | |||
| The ad-hoc BNF syntax of RFC 2396 has been replaced with the ABNF of | The ad-hoc BNF syntax of RFC 2396 has been replaced with the ABNF of | |||
| [RFC2234]. This change required all rule names that formerly | [RFC2234]. This change required all rule names that formerly | |||
| included underscore characters to be renamed with a dash instead. In | included underscore characters to be renamed with a dash instead. In | |||
| addition, a number of syntax rules have been eliminated or simplified | addition, a number of syntax rules have been eliminated or simplified | |||
| to make the overall grammar more comprehensible. Specifications that | to make the overall grammar more comprehensible. Specifications that | |||
| refer to the obsolete grammar rules may be understood by replacing | refer to the obsolete grammar rules may be understood by replacing | |||
| those rules according to the following table: | those rules according to the following table: | |||
| +----------------+--------------------------------------------------+ | +----------------+--------------------------------------------------+ | |||
| | obsolete rule | translation | | | obsolete rule | translation | | |||
| +----------------+--------------------------------------------------+ | +----------------+--------------------------------------------------+ | |||
| | absoluteURI | absolute-URI | | | absoluteURI | absolute-URI | | |||
| | relativeURI | relative-part [ "?" query ] | | | relativeURI | relative-part [ "?" query ] | | |||
| | hier_part | ( "//" authority path-abempty / | | | hier_part | ( "//" authority path-abempty / | | |||
| | | path-absolute ) [ "?" query ] | | | | path-absolute ) [ "?" query ] | | |||
| | | | | | | | | |||
| | opaque_part | path-rootless [ "?" query ] | | | opaque_part | path-rootless [ "?" query ] | | |||
| | net_path | "//" authority path-abempty | | | net_path | "//" authority path-abempty | | |||
| | abs_path | path-absolute | | | abs_path | path-absolute | | |||
| | rel_path | path-rootless | | | rel_path | path-rootless | | |||
| | rel_segment | segment-nz-nc | | | rel_segment | segment-nz-nc | | |||
| | reg_name | reg-name | | | reg_name | reg-name | | |||
| | server | authority | | | server | authority | | |||
| | hostport | host [ ":" port ] | | | hostport | host [ ":" port ] | | |||
| | hostname | reg-name | | | hostname | reg-name | | |||
| skipping to change at page 55, line 5 ¶ | skipping to change at page 54, line 42 ¶ | |||
| | | / "(" / ")" | | | | / "(" / ")" | | |||
| | | | | | | | | |||
| | escaped | pct-encoded | | | escaped | pct-encoded | | |||
| | hex | HEXDIG | | | hex | HEXDIG | | |||
| | alphanum | ALPHA / DIGIT | | | alphanum | ALPHA / DIGIT | | |||
| +----------------+--------------------------------------------------+ | +----------------+--------------------------------------------------+ | |||
| Use of the above obsolete rules for the definition of scheme-specific | Use of the above obsolete rules for the definition of scheme-specific | |||
| syntax is deprecated. | syntax is deprecated. | |||
| Section 2 on characters has been rewritten to explain what characters | Section 2, on characters, has been rewritten to explain what | |||
| are reserved, when they are reserved, and why they are reserved even | characters are reserved, when they are reserved, and why they are | |||
| when not used as delimiters by the generic syntax. The mark | reserved, even when they are not used as delimiters by the generic | |||
| characters that are typically unsafe to decode, including the | syntax. The mark characters that are typically unsafe to decode, | |||
| exclamation mark ("!"), asterisk ("*"), single-quote ("'"), and open | including the exclamation mark ("!"), asterisk ("*"), single-quote | |||
| and close parentheses ("(" and ")"), have been moved to the reserved | ("'"), and open and close parentheses ("(" and ")"), have been moved | |||
| set in order to clarify the distinction between reserved and | to the reserved set in order to clarify the distinction between | |||
| unreserved and hopefully answer the most common question of scheme | reserved and unreserved and, hopefully, to answer the most common | |||
| designers. Likewise, the section on percent-encoded characters has | question of scheme designers. Likewise, the section on | |||
| been rewritten, and URI normalizers are now given license to decode | percent-encoded characters has been rewritten, and URI normalizers | |||
| any percent-encoded octets corresponding to unreserved characters. | are now given license to decode any percent-encoded octets | |||
| In general, the terms "escaped" and "unescaped" have been replaced | corresponding to unreserved characters. In general, the terms | |||
| with "percent-encoded" and "decoded", respectively, to reduce | "escaped" and "unescaped" have been replaced with "percent-encoded" | |||
| confusion with other forms of escape mechanisms. | and "decoded", respectively, to reduce confusion with other forms of | |||
| escape mechanisms. | ||||
| The ABNF for URI and URI-reference has been redesigned to make them | The ABNF for URI and URI-reference has been redesigned to make them | |||
| more friendly to LALR parsers and reduce complexity. As a result, | more friendly to LALR parsers and to reduce complexity. As a result, | |||
| the layout form of syntax description has been removed, along with | the layout form of syntax description has been removed, along with | |||
| the uric, uric_no_slash, opaque_part, net_path, abs_path, rel_path, | the uric, uric_no_slash, opaque_part, net_path, abs_path, rel_path, | |||
| path_segments, rel_segment, and mark rules. All references to | path_segments, rel_segment, and mark rules. All references to | |||
| "opaque" URIs have been replaced with a better description of how the | "opaque" URIs have been replaced with a better description of how the | |||
| path component may be opaque to hierarchy. The relativeURI rule has | path component may be opaque to hierarchy. The relativeURI rule has | |||
| been replaced with relative-ref to avoid unnecessary confusion over | been replaced with relative-ref to avoid unnecessary confusion over | |||
| whether or not they are a subset of URI. The ambiguity regarding the | whether they are a subset of URI. The ambiguity regarding the | |||
| parsing of URI-reference as a URI or a relative-ref with a colon in | parsing of URI-reference as a URI or a relative-ref with a colon in | |||
| the first segment has been eliminated through the use of five | the first segment has been eliminated through the use of five | |||
| separate path matching rules. | separate path matching rules. | |||
| The fragment identifier has been moved back into the section on | The fragment identifier has been moved back into the section on | |||
| generic syntax components and within the URI and relative-ref rules, | generic syntax components and within the URI and relative-ref rules, | |||
| though it remains excluded from absolute-URI. The number sign ("#") | though it remains excluded from absolute-URI. The number sign ("#") | |||
| character has been moved back to the reserved set as a result of | character has been moved back to the reserved set as a result of | |||
| reintegrating the fragment syntax. | reintegrating the fragment syntax. | |||
| The ABNF has been corrected to allow the path component to be empty. | The ABNF has been corrected to allow the path component to be empty. | |||
| This also allows an absolute-URI to consist of nothing after the | This also allows an absolute-URI to consist of nothing after the | |||
| "scheme:", as is present in practice with the "dav:" namespace | "scheme:", as is present in practice with the "dav:" namespace | |||
| [RFC2518] and the "about:" scheme used internally by many WWW browser | [RFC2518] and with the "about:" scheme used internally by many WWW | |||
| implementations. The ambiguity regarding the boundary between | browser implementations. The ambiguity regarding the boundary | |||
| authority and path has been eliminated through the use of five | between authority and path has been eliminated through the use of | |||
| separate path matching rules. | five separate path matching rules. | |||
| Registry-based naming authorities that use the generic syntax are now | Registry-based naming authorities that use the generic syntax are now | |||
| defined within the host rule. This change allows current | defined within the host rule. This change allows current | |||
| implementations, where whatever name provided is simply fed to the | implementations, where whatever name provided is simply fed to the | |||
| local name resolution mechanism, to be consistent with the | local name resolution mechanism, to be consistent with the | |||
| specification and removes the need to re-specify DNS name formats | specification. It also removes the need to re-specify DNS name | |||
| here. It also allows the host component to contain percent-encoded | formats here. Furthermore, it allows the host component to contain | |||
| octets, which is necessary to enable internationalized domain names | percent-encoded octets, which is necessary to enable | |||
| to be provided in URIs, processed in their native character encodings | internationalized domain names to be provided in URIs, processed in | |||
| at the application layers above URI processing, and passed to an IDNA | their native character encodings at the application layers above URI | |||
| library as a registered name in the UTF-8 character encoding. The | processing, and passed to an IDNA library as a registered name in the | |||
| server, hostport, hostname, domainlabel, toplabel, and alphanum rules | UTF-8 character encoding. The server, hostport, hostname, | |||
| have been removed. | domainlabel, toplabel, and alphanum rules have been removed. | |||
| The resolving relative references algorithm of [RFC2396] has been | The resolving relative references algorithm of [RFC2396] has been | |||
| rewritten using pseudocode for this revision to improve clarity and | rewritten with pseudocode for this revision to improve clarity and | |||
| fix the following issues: | fix the following issues: | |||
| o [RFC2396] section 5.2, step 6a, failed to account for a base URI | o [RFC2396] section 5.2, step 6a, failed to account for a base URI | |||
| with no path. | with no path. | |||
| o Restored the behavior of [RFC1808] where, if the reference | o Restored the behavior of [RFC1808] where, if the reference | |||
| contains an empty path and a defined query component, then the | contains an empty path and a defined query component, the target | |||
| target URI inherits the base URI's path component. | URI inherits the base URI's path component. | |||
| o The determination of whether a URI reference is a same-document | o The determination of whether a URI reference is a same-document | |||
| reference has been decoupled from the URI parser, simplifying the | reference has been decoupled from the URI parser, simplifying the | |||
| URI processing interface within applications in a way consistent | URI processing interface within applications in a way consistent | |||
| with the internal architecture of deployed URI processing | with the internal architecture of deployed URI processing | |||
| implementations. The determination is now based on comparison to | implementations. The determination is now based on comparison to | |||
| the base URI after transforming a reference to absolute form, | the base URI after transforming a reference to absolute form, | |||
| rather than on the format of the reference itself. This change | rather than on the format of the reference itself. This change | |||
| may result in more references being considered "same-document" | may result in more references being considered "same-document" | |||
| under this specification than would be under the rules given in | under this specification than there would be under the rules given | |||
| RFC 2396, especially when normalization is used to reduce aliases. | in RFC 2396, especially when normalization is used to reduce | |||
| However, it does not change the status of existing same-document | aliases. However, it does not change the status of existing | |||
| references. | same-document references. | |||
| o Separated the path merge routine into two routines: merge, for | o Separated the path merge routine into two routines: merge, for | |||
| describing combination of the base URI path with a relative-path | describing combination of the base URI path with a relative-path | |||
| reference, and remove_dot_segments, for describing how to remove | reference, and remove_dot_segments, for describing how to remove | |||
| the special "." and ".." segments from a composed path. The | the special "." and ".." segments from a composed path. The | |||
| remove_dot_segments algorithm is now applied to all URI reference | remove_dot_segments algorithm is now applied to all URI reference | |||
| paths in order to match common implementations and improve the | paths in order to match common implementations and to improve the | |||
| normalization of URIs in practice. This change only impacts the | normalization of URIs in practice. This change only impacts the | |||
| parsing of abnormal references and same-scheme references wherein | parsing of abnormal references and same-scheme references wherein | |||
| the base URI has a non-hierarchical path. | the base URI has a non-hierarchical path. | |||
| Appendix E. Instructions to RFC Editor | ||||
| Prior to publication as an RFC, please remove this section and the | ||||
| "Editorial Note" that appears after the Abstract. If [BCP35] or any | ||||
| of the normative references are updated prior to publication, the | ||||
| associated reference in this document can be safely updated as well. | ||||
| This document has been produced using the xml2rfc tool set; the XML | ||||
| version can be obtained via the URI listed in the editorial note. | ||||
| Index | Index | |||
| A | A | |||
| ABNF 11 | ABNF 11 | |||
| absolute 26 | absolute 27 | |||
| absolute-path 26 | absolute-path 26 | |||
| absolute-URI 26 | absolute-URI 27 | |||
| access 9 | access 9 | |||
| authority 16, 17 | authority 17, 18 | |||
| B | B | |||
| base URI 28 | base URI 28 | |||
| C | C | |||
| character encoding 4 | character encoding 4 | |||
| character 4 | character 4 | |||
| characters 11 | characters 8, 11 | |||
| coded character set 4 | coded character set 4 | |||
| D | D | |||
| dec-octet 20 | dec-octet 20 | |||
| dereference 9 | dereference 9 | |||
| dot-segments 22 | dot-segments 23 | |||
| F | F | |||
| fragment 16, 24 | fragment 16, 24 | |||
| G | G | |||
| gen-delims 12 | gen-delims 13 | |||
| generic syntax 6 | generic syntax 6 | |||
| H | H | |||
| h16 19 | h16 20 | |||
| hier-part 16 | hier-part 16 | |||
| hierarchical 10 | hierarchical 10 | |||
| host 18 | host 18 | |||
| I | I | |||
| identifier 5 | identifier 5 | |||
| IP-literal 19 | IP-literal 19 | |||
| IPv4 20 | IPv4 20 | |||
| IPv4address 20 | IPv4address 19, 20 | |||
| IPv6 19 | IPv6 19 | |||
| IPv6address 19, 20 | IPv6address 19, 20 | |||
| IPvFuture 19 | IPvFuture 19 | |||
| L | L | |||
| locator 7 | locator 7 | |||
| ls32 19 | ls32 20 | |||
| M | M | |||
| merge 32 | merge 32 | |||
| N | N | |||
| name 7 | name 7 | |||
| network-path 26 | network-path 26 | |||
| P | P | |||
| path 16, 22 | path 16, 22, 26 | |||
| path-abempty 22 | path-abempty 22 | |||
| path-absolute 22 | path-absolute 22 | |||
| path-empty 22 | path-empty 22 | |||
| path-noscheme 22 | path-noscheme 22 | |||
| path-rootless 22 | path-rootless 22 | |||
| path-abempty 16 | path-abempty 16, 22, 26 | |||
| path-absolute 16 | path-absolute 16, 22, 26 | |||
| path-empty 16 | path-empty 16, 22, 26 | |||
| path-rootless 16 | path-rootless 16, 22 | |||
| pchar 22 | pchar 23 | |||
| pct-encoded 12 | pct-encoded 12 | |||
| percent-encoding 12 | percent-encoding 12 | |||
| port 21 | port 22 | |||
| Q | Q | |||
| query 16, 23 | query 16, 23 | |||
| R | R | |||
| reg-name 20 | reg-name 21 | |||
| registered name 20 | registered name 20 | |||
| relative 10, 28 | relative 10, 28 | |||
| relative-path 26 | relative-path 26 | |||
| relative-ref 26 | relative-ref 26 | |||
| remove_dot_segments 32 | remove_dot_segments 33 | |||
| representation 9 | representation 9 | |||
| reserved 12 | reserved 12 | |||
| resolution 9, 28 | resolution 9, 28 | |||
| resource 5 | resource 5 | |||
| retrieval 9 | retrieval 9 | |||
| S | S | |||
| same-document 27 | same-document 27 | |||
| sameness 9 | sameness 9 | |||
| scheme 16, 16 | scheme 16, 17 | |||
| segment 22 | segment 22, 23 | |||
| segment-nz 22 | segment-nz 23 | |||
| segment-nz-nc 22 | segment-nz-nc 23 | |||
| sub-delims 12 | sub-delims 13 | |||
| suffix 27 | suffix 27 | |||
| T | T | |||
| transcription 7 | transcription 8 | |||
| U | U | |||
| uniform 4 | uniform 4 | |||
| unreserved 13 | ||||
| URI grammar | ||||
| absolute-URI 26 | ||||
| ALPHA 11 | ||||
| authority 16, 17 | ||||
| CR 11 | ||||
| dec-octet 20 | ||||
| DIGIT 11 | ||||
| DQUOTE 11 | ||||
| fragment 16, 24, 26 | ||||
| gen-delims 12 | ||||
| h16 19 | ||||
| HEXDIG 11 | ||||
| hier-part 16 | ||||
| host 17, 18 | ||||
| IP-literal 19 | ||||
| IPv4address 20 | ||||
| IPv6address 19, 20 | ||||
| IPvFuture 19 | ||||
| LF 11 | ||||
| ls32 19 | ||||
| mark 13 | ||||
| OCTET 11 | ||||
| path 22 | ||||
| path-abempty 16, 22 | ||||
| path-absolute 16, 22 | ||||
| path-empty 16, 22 | ||||
| path-noscheme 22 | ||||
| path-rootless 16, 22 | ||||
| pchar 22, 23, 24 | ||||
| pct-encoded 12 | ||||
| port 17, 21 | ||||
| query 16, 23, 26, 26 | ||||
| reg-name 20 | ||||
| relative-ref 25, 26 | ||||
| reserved 12 | ||||
| scheme 16, 16, 26 | ||||
| segment 22 | ||||
| segment-nz 22 | ||||
| segment-nz-nc 22 | ||||
| SP 11 | ||||
| sub-delims 12 | ||||
| unreserved 13 | unreserved 13 | |||
| URI 16, 25 | URI grammar | |||
| absolute-URI 27 | ||||
| ALPHA 11 | ||||
| authority 18 | ||||
| CR 11 | ||||
| dec-octet 20 | ||||
| DIGIT 11 | ||||
| DQUOTE 11 | ||||
| fragment 24 | ||||
| gen-delims 13 | ||||
| h16 20 | ||||
| HEXDIG 11 | ||||
| hier-part 16 | ||||
| host 19 | ||||
| IP-literal 19 | ||||
| IPv4address 20 | ||||
| IPv6address 20 | ||||
| IPvFuture 19 | ||||
| LF 11 | ||||
| ls32 20 | ||||
| OCTET 11 | ||||
| path 22 | ||||
| path-abempty 22 | ||||
| path-absolute 22 | ||||
| path-empty 22 | ||||
| path-noscheme 22 | ||||
| path-rootless 22 | ||||
| pchar 23 | ||||
| pct-encoded 12 | ||||
| port 22 | ||||
| query 24 | ||||
| reg-name 21 | ||||
| relative-ref 26 | ||||
| reserved 13 | ||||
| scheme 17 | ||||
| segment 23 | ||||
| segment-nz 23 | ||||
| segment-nz-nc 23 | ||||
| SP 11 | ||||
| sub-delims 13 | ||||
| unreserved 13 | ||||
| URI 16 | ||||
| URI-reference 25 | ||||
| userinfo 18 | ||||
| URI 16 | ||||
| URI-reference 25 | URI-reference 25 | |||
| userinfo 17, 18 | URL 7 | |||
| URI 16 | URN 7 | |||
| URI-reference 25 | userinfo 18 | |||
| URL 7 | ||||
| URN 7 | ||||
| userinfo 17, 18 | ||||
| Intellectual Property Statement | Authors' Addresses | |||
| Tim Berners-Lee | ||||
| World Wide Web Consortium | ||||
| Massachusetts Institute of Technology | ||||
| 77 Massachusetts Avenue | ||||
| Cambridge, MA 02139 | ||||
| USA | ||||
| Phone: +1-617-253-5702 | ||||
| Fax: +1-617-258-5999 | ||||
| EMail: [email protected] | ||||
| URI: http://www.w3.org/People/Berners-Lee/ | ||||
| Roy T. Fielding | ||||
| Day Software | ||||
| 5251 California Ave., Suite 110 | ||||
| Irvine, CA 92617 | ||||
| USA | ||||
| Phone: +1-949-679-2960 | ||||
| Fax: +1-949-679-2972 | ||||
| EMail: [email protected] | ||||
| URI: http://roy.gbiv.com/ | ||||
| Larry Masinter | ||||
| Adobe Systems Incorporated | ||||
| 345 Park Ave | ||||
| San Jose, CA 95110 | ||||
| USA | ||||
| Phone: +1-408-536-3024 | ||||
| EMail: [email protected] | ||||
| URI: http://larry.masinter.net/ | ||||
| Full Copyright Statement | ||||
| Copyright (C) The Internet Society (2005). | ||||
| This document is subject to the rights, licenses and restrictions | ||||
| contained in BCP 78, and except as set forth therein, the authors | ||||
| retain all their rights. | ||||
| This document and the information contained herein are provided on an | ||||
| "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS | ||||
| OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET | ||||
| ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, | ||||
| INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE | ||||
| INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED | ||||
| WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. | ||||
| Intellectual Property | ||||
| The IETF takes no position regarding the validity or scope of any | The IETF takes no position regarding the validity or scope of any | |||
| Intellectual Property Rights or other rights that might be claimed to | Intellectual Property Rights or other rights that might be claimed to | |||
| pertain to the implementation or use of the technology described in | pertain to the implementation or use of the technology described in | |||
| this document or the extent to which any license under such rights | this document or the extent to which any license under such rights | |||
| might or might not be available; nor does it represent that it has | might or might not be available; nor does it represent that it has | |||
| made any independent effort to identify any such rights. Information | made any independent effort to identify any such rights. Information | |||
| on the procedures with respect to rights in RFC documents can be | on the IETF's procedures with respect to rights in IETF Documents can | |||
| found in BCP 78 and BCP 79. | be found in BCP 78 and BCP 79. | |||
| Copies of IPR disclosures made to the IETF Secretariat and any | Copies of IPR disclosures made to the IETF Secretariat and any | |||
| assurances of licenses to be made available, or the result of an | assurances of licenses to be made available, or the result of an | |||
| attempt made to obtain a general license or permission for the use of | attempt made to obtain a general license or permission for the use of | |||
| such proprietary rights by implementers or users of this | such proprietary rights by implementers or users of this | |||
| specification can be obtained from the IETF on-line IPR repository at | specification can be obtained from the IETF on-line IPR repository at | |||
| http://www.ietf.org/ipr. | http://www.ietf.org/ipr. | |||
| The IETF invites any interested party to bring to its attention any | The IETF invites any interested party to bring to its attention any | |||
| copyrights, patents or patent applications, or other proprietary | copyrights, patents or patent applications, or other proprietary | |||
| rights that may cover technology that may be required to implement | rights that may cover technology that may be required to implement | |||
| this standard. Please address the information to the IETF at | this standard. Please address the information to the IETF at ietf- | |||
| [email protected]. | [email protected]. | |||
| Disclaimer of Validity | ||||
| This document and the information contained herein are provided on an | ||||
| "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS | ||||
| OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET | ||||
| ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, | ||||
| INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE | ||||
| INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED | ||||
| WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. | ||||
| Copyright Statement | ||||
| Copyright (C) The Internet Society (2004). This document is subject | ||||
| to the rights, licenses and restrictions contained in BCP 78, and | ||||
| except as set forth therein, the authors retain all their rights. | ||||
| Acknowledgment | Acknowledgement | |||
| Funding for the RFC Editor function is currently provided by the | Funding for the RFC Editor function is currently provided by the | |||
| Internet Society. | Internet Society. | |||
| End of changes. 326 change blocks. | ||||
| 1075 lines changed or deleted | 1041 lines changed or added | |||
This html diff was produced by rfcdiff 1.46. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||