Diff: normuri.txt - normiri.txt

	normuri.txt	normiri.txt
	6. Normalization and Comparison

	One of the most common operations on URIs is simple comparison:	5. Normalization and Comparison
	determining if two URIs are equivalent without using the URIs to
	access their respective resource(s). A comparison is performed every
	time a response cache is accessed, a browser checks its history to
	color a link, or an XML parser processes tags within a namespace.
	Extensive normalization prior to comparison of URIs is often used by
	spiders and indexing engines to prune a search space or reduce
	duplication of request actions and response storage.

	URI comparison is performed in respect to some particular purpose,	Note: The structure and much of the material for this section is
		taken from section 6 of [RFCYYYY]; the differences are due to the
		specifics of IRIs.

		One of the most common operations on IRIs is simple comparison:
		determining if two IRIs are equivalent without using the IRIs or the
		mapped URIs to access their respective resource(s). A comparison is
		performed every time a response cache is accessed, a browser checks
		its history to color a link, or an XML parser processes tags within a
		namespace. Extensive normalization prior to comparison of IRIs may
		be used by spiders and indexing engines to prune a search space or
		reduce duplication of request actions and response storage.

		IRI comparison is performed in respect to some particular purpose,
	and implementations with differing purposes will often be subject to	and implementations with differing purposes will often be subject to
	differing design trade-offs in regards to how much effort should be	differing design trade-offs in regards to how much effort should be
	spent in reducing aliased identifiers. This section describes a	spent in reducing aliased identifiers. This section describes a
	variety of methods that may be used to compare URIs, the trade-offs	variety of methods that may be used to compare IRIs, the trade-offs
	between them, and the types of applications that might use them.	between them, and the types of applications that might use them.

	6.1 Equivalence	5.1 Equivalence

	Since URIs exist to identify resources, presumably they should be	Since IRIs exist to identify resources, presumably they should be
	considered equivalent when they identify the same resource. However,	considered equivalent when they identify the same resource. However,
	such a definition of equivalence is not of much practical use, since	such a definition of equivalence is not of much practical use, since
	there is no way for an implementation to compare two resources that	there is no way for an implementation to compare two resources that
	are not under its own control. For this reason, determination of	are not under its own control. For this reason, determination of
	equivalence or difference of URIs is based on string comparison,	equivalence or difference of IRIs is based on string comparison,
	perhaps augmented by reference to additional rules provided by URI	perhaps augmented by reference to additional rules provided by URI
	scheme definitions. We use the terms "different" and "equivalent" to	scheme definitions. We use the terms "different" and "equivalent" to
	describe the possible outcomes of such comparisons, but there are	describe the possible outcomes of such comparisons, but there are
	many application-dependent versions of equivalence.	many applicationdependent versions of equivalence.

	Even though it is possible to determine that two URIs are equivalent,	Even though it is possible to determine that two IRIs are equivalent,
	URI comparison is not sufficient to determine if two URIs identify	IRI comparison is not sufficient to determine if two IRIs identify
	different resources. For example, an owner of two different domain	different resources. For example, an owner of two different domain
	names could decide to serve the same resource from both, resulting in	names could decide to serve the same resource from both, resulting in
	two different URIs. Therefore, comparison methods are designed to	two different IRIs. Therefore, comparison methods are designed to
	minimize false negatives while strictly avoiding false positives.	minimize false negatives while strictly avoiding false positives.

	In testing for equivalence, applications should not directly compare	In testing for equivalence, applications should not directly compare
	relative references; the references should be converted to their	relative references; the references should be converted to their
	respective target URIs before comparison. When URIs are being	respective target IRIs before comparison. When IRIs are being
	compared for the purpose of selecting (or avoiding) a network action,	compared for the purpose of selecting (or avoiding) a network action,
	such as retrieval of a representation, fragment components (if any)	such as retrieval of a representation, fragment components (if any)
	should be excluded from the comparison.	should be excluded from the comparison.

	6.2 Comparison Ladder	Applications using IRIs as identity tokens with no relationship to a
		protocol MUST use the Simple String Comparison (see Section 5.3.1).
		All other applications MUST select one of the comparison practices
		from the Comparison Ladder (see Section 5.3, or, after IRI-to-URI
		conversion, select one of the comparison practices from the URI
		comparison ladder [RFCYYYY], Section 6.2.

	A variety of methods are used in practice to test URI equivalence.	5.2 Preparation for Comparison

		Any kind of IRI comparison REQUIRES that all escapings or encodings
		in the protocol or format that carries an IRI are resolved. This is
		usually done when parsing the protocol or format. Examples of such
		escapings or encodings are entities and numeric character references
		in [HTML4] and [XML1]. As an example, http://example.org/rosé
		(in HTML), http://example.org/rosé (in HTML or XML), and
		http://example.org/rosé (in HTML or XML) all get resolved into
		what is denoted in this document (see Section 1.4) as
		http://example.org/rosé (the "é" here standing for the
		actual e-acute character, to compensate for the fact that this
		document cannot contain non-ASCII characters).

		Similar considerations apply to encodings such as Transfer Codings in
		HTTP (see [RFC2616]) and Content Transfer Encodings in MIME[RFC2045],
		although in these cases, the encoding is not based on characters, but
		on octets, and additional care is required to make sure that
		characters, and not just arbitrary octets, are compared (see Section
		5.3.1).

		5.3 Comparison Ladder

		A variety of methods are used in practice to test IRI equivalence.
	These methods fall into a range, distinguished by the amount of	These methods fall into a range, distinguished by the amount of
	processing required and the degree to which the probability of false	processing required and the degree to which the probability of false
	negatives is reduced. As noted above, false negatives cannot be	negatives is reduced. As noted above, false negatives cannot be
	eliminated. In practice, their probability can be reduced, but this	eliminated. In practice, their probability can be reduced, but this
	reduction requires more processing and is not cost-effective for all	reduction requires more processing and is not cost-effective for all
	applications.	applications.

	If this range of comparison practices is considered as a ladder, the	If this range of comparison practices is considered as a ladder, the
	following discussion will climb the ladder, starting with those	following discussion will climb the ladder, starting with those
	practices that are cheap but have a relatively higher chance of	practices that are cheap but have a relatively higher chance of
	producing false negatives, and proceeding to those that have higher	producing false negatives, and proceeding to those that have higher
	computational cost and lower risk of false negatives.	computational cost and lower risk of false negatives.

	6.2.1 Simple String Comparison	5.3.1 Simple String Comparison

	If two URIs, considered as character strings, are identical, then it	If two IRIs, considered as character strings, are identical, then it
	is safe to conclude that they are equivalent. This type of	is safe to conclude that they are equivalent. This type of
	equivalence test has very low computational cost and is in wide use	equivalence test has very low computational cost and is in wide use
	in a variety of applications, particularly in the domain of parsing.	in a variety of applications, particularly in the domain of parsing
		and when a definitive answer to the question of IRI equivalence is
		needed that is independent of the scheme used and can be calculated
		quickly and without accessing a network. An example of such a case
		is XML Namespaces ([XMLNamespace]).

	Testing strings for equivalence requires some basic precautions.	Testing strings for equivalence requires some basic precautions.
	This procedure is often referred to as "bit-for-bit" or	This procedure is often referred to as "bit-for-bit" or
	"byte-for-byte" comparison, which is potentially misleading. Testing	"byte-for-byte" comparison, which is potentially misleading. Testing
	of strings for equality is normally based on pairwise comparison of	of strings for equality is normally based on pairwise comparison of
	the characters that make up the strings, starting from the first and	the characters that make up the strings, starting from the first and
	proceeding until both strings are exhausted and all characters found	proceeding until both strings are exhausted and all characters found
	to be equal, a pair of characters compares unequal, or one of the	to be equal, a pair of characters compares unequal, or one of the
	strings is exhausted before the other.	strings is exhausted before the other.

	Such character comparisons require that each pair of characters be	Such character comparisons require that each pair of characters be
	put in comparable form. For example, should one URI be stored in a	put in comparable encoding form. For example, should one IRI be
	byte array in EBCDIC encoding, and the second be in a Java String	stored in a byte array in UTF-8 encoding form, and the second be in a
	object (UTF-16), bit-for-bit comparisons applied naively will produce	UTF-16 encoding form, bit-for-bit comparisons applied naively will
	errors. It is better to speak of equality on a	produce errors. It is better to speak of equality on a
	character-for-character rather than byte-for-byte or bit-for-bit	character-for-character rather than byte-for-byte or bit-for-bit
	basis. In practical terms, character-by-character comparisons should	basis. In practical terms, character-by-character comparisons should
	be done codepoint-by-codepoint after conversion to a common character	be done codepoint-by-codepoint after conversion to a common character
	encoding.	encoding form. When comparing character-by-character, the comparison
		function MUST NOT map IRIs to URIs, because such a mapping would
		create additional spurious equivalences. It follows that IRIs SHOULD
		NOT be modified when being transported if there is any chance that
		this IRI might be used as an identifier.

	False negatives are caused by the production and use of URI aliases.	False negatives are caused by the production and use of IRI aliases.
	Unnecessary aliases can be reduced, regardless of the comparison	Unnecessary aliases can be reduced, regardless of the comparison
	method, by consistently providing URI references in an	method, by consistently providing IRI references in an
	already-normalized form (i.e., a form identical to what would be	already-normalized form (i.e., a form identical to what would be
	produced after normalization is applied, as described below).	produced after normalization is applied, as described below).
	Protocols and data formats often choose to limit some URI comparisons	Protocols and data formats often choose to limit some IRI comparisons
	to simple string comparison, based on the theory that people and	to simple string comparison, based on the theory that people and
	implementations will, in their own best interest, be consistent in	implementations will, in their own best interest, be consistent in
	providing URI references, or at least consistent enough to negate any	providing IRI references, or at least consistent enough to negate any
	efficiency that might be obtained from further normalization.	efficiency that might be obtained from further normalization.

	6.2.2 Syntax-based Normalization	5.3.2 Syntax-based Normalization

	Implementations may use logic based on the definitions provided by	Implementations may use logic based on the definitions provided by
	this specification to reduce the probability of false negatives.	this specification to reduce the probability of false negatives.
	Such processing is moderately higher in cost than	Such processing is moderately higher in cost than
	character-for-character string comparison. For example, an	character-for-character string comparison. For example, an
	application using this approach could reasonably consider the	application using this approach could reasonably consider the
	following two URIs equivalent:	following two IRIs equivalent:

	example://a/b/c/%7Bfoo%7D	example://a/b/c/%7Bfoo%7D/rosé
	eXAMPLE://a/./b/../b/%63/%7bfoo%7d	eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9

	Web user agents, such as browsers, typically apply this type of URI	Web user agents, such as browsers, typically apply this type of IRI
	normalization when determining whether a cached response is	normalization when determining whether a cached response is
	available. Syntax-based normalization includes such techniques as	available. Syntax-based normalization includes such techniques as
	case normalization, percent-encoding normalization, and removal of	case normalization, character normalization, percent-encoding
	dot-segments.	normalization, and removal of dot-segments.

	6.2.2.1 Case Normalization	5.3.2.1 Case Normalization

	For all URIs, the hexadecimal digits within a percent-encoding	For all IRIs, the hexadecimal digits within a percent-encoding
	triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore	triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
	should be normalized to use uppercase letters for the digits A-F.	should be normalized to use uppercase letters for the digits A-F.

	When a URI uses components of the generic syntax, the component	When an IRI uses components of the generic syntax, the component
	syntax equivalence rules always apply; namely, that the scheme and	syntax equivalence rules always apply; namely, that the scheme and
	host are case-insensitive and therefore should be normalized to	US-ASCII only host are case-insensitive and therefore should be
	lowercase. For example, the URI <HTTP://www.EXAMPLE.com/> is	normalized to lowercase. For example, the URI
	equivalent to <http://www.example.com/>. The other generic syntax	<HTTP://www.EXAMPLE.com/> is equivalent to <http://www.example.com/>.
		Case equivalence for non-ASCII characters in IRI components that are
		IDNs are discussed in Section 5.3.3. The other generic syntax
	components are assumed to be case-sensitive unless specifically	components are assumed to be case-sensitive unless specifically
	defined otherwise by the scheme (see Section 6.2.3).	defined otherwise by the scheme.

	6.2.2.2 Percent-Encoding Normalization	Creating schemes that allow case-insensitive syntax components
		containing non US-ASCII characters should be avoided because such a
		case normalization may be cultural dependant and is always a complex
		operation. The only exception concerns non-ASCII host names for
		which the character normalization includes a mapping step derived
		from case folding.

	The percent-encoding mechanism (Section 2.1) is a frequent source of	5.3.2.2 Character Normalization
	variance among otherwise identical URIs. In addition to the case
	normalization issue noted above, some URI producers percent-encode
	octets that do not require percent-encoding, resulting in URIs that
	are equivalent to their non-encoded counterparts. Such URIs should
	be normalized by decoding any percent-encoded octet that corresponds
	to an unreserved character, as described in Section 2.3.

	6.2.2.3 Path Segment Normalization	The Unicode Standard [UNIV4] defines various equivalences between
		sequences of characters for various purposes. Unicode Standard Annex
		#15 [UTR15] defines various Normalization Forms for these
		equivalences, in particular Normalization Form C (NFC, Canonical
		Decomposition, followed by Canonical Composition) and Normalization
		Form KC (NFKC, Compatibility Decomposition, followed by Canonical
		Composition).

		Equivalence of IRIs MUST rely on the assumption that IRIs are
		appropriately pre-character-normalized, rather than applying
		character normalization when comparing two IRIs. The exceptions are
		conversion from a non-digital form, and conversion from a
		non-UCS-based character encoding to an UCS-based character encoding.
		In these cases, NFC or a normalizing transcoder using NFC MUST be
		used for interoperability. To avoid false negatives and problems
		with transcoding, IRIs SHOULD be created using NFC. Using NFKC may
		avoid even more problems, for example by choosing half-width Latin
		letters instead of full-width, and full-width Katakana instead of
		half-width.

		As an example, http://www.example.org/résumé.html (in XML
		Notation) is in NFC. On the other hand,
		http://www.example.org/résumé.html is not in NFC. The
		former uses precombined e-acute characters, the latter uses 'e'
		characters followed by combining acute accents. Both usages are
		defined to be canonically equivalent in [UNIV4].

		Note: Because it is unknown how a particular sequence of characters
		is being treated with respect to character normalization, it would
		be inappropriate to allow third parties to normalize an IRI
		arbitrarily. This does not contradict the recommendation that
		when a resource is created, its IRI should be as
		character-normalized as possible (i.e. NFC or even NFKC). This
		is similar to the upper-case/lower-case problems in
		character-normalized as possible (i.e. NFC or even NFKC). URIs.
		Some parts of a URI are case-insensitive (domain name). For
		others, it is unclear whether they are case-sensitive or
		case-insensitive, or something in between (e.g. case-sensitive,
		but if the wrong case is used, a multiple choice selection is
		provided instead of a direct negative result). The best recipe is
		that the creator uses a reasonable capitalization, and when
		transferring the URI, that capitalization is never changed.

		Various IRI schemes may allow the usage of Internationalized Domain
		Names (IDN) [RFC3490] either in the ireg-name part or elsewhere.
		Character Normalization also applies to IDNs, as discussed in Section
		5.3.3.

		5.3.2.3 Percent-Encoding Normalization

		The percent-encoding mechanism (Section 2.1 of [RFCYYYY]) is a
		frequent source of variance among otherwise identical IRIs. In
		addition to the case normalization issue noted above, some IRI
		producers percent-encode octets that do not require percent-encoding,
		resulting in IRIs that are equivalent to their nonencoded
		counterparts. Such IRIs should be normalized by decoding any
		percent-encoded octet sequence that corresponds to an unreserved
		character, as described in Section 2.3 of [RFCYYYY].

		For actual resolution, differences in percent-encoding (except for
		the percent-encoding of reserved characters) MUST always result in
		the same resource. For example, http://example.org/~user,
		http://example.org/%7euser and http://example.org/%7Euser must
		resolve to the same resource.

		If this kind of equivalence is to be tested, the percent-encoding of
		both IRIs to be compared has to be aligned, for example by converting
		both IRIs to URIs (see Section 3.1), eliminating escape differences
		in the resulting URIs, and making sure that the case of the
		hexadecimal characters in the percent-encoding is always the same
		(preferably upper case). If the IRI is to be passed to another
		application, or used further in some other way, its original form
		MUST be preserved; the conversion described here should be performed
		only for the purpose of local comparison.

		5.3.2.4 Path Segment Normalization

	The complete path segments "." and ".." are intended only for use	The complete path segments "." and ".." are intended only for use
	within relative references (Section 4.1) and are removed as part of	within relative references (Section 4.1 of [RFCYYYY]) and are removed
	the reference resolution process (Section 5.2). However, some	as part of the reference resolution process (Section 5.2 of
	deployed implementations incorrectly assume that reference resolution	[RFCYYYY]). However, some implementations may incorrectly assume
	is not necessary when the reference is already a URI, and thus fail	that reference resolution is not necessary when the reference is
	to remove dot-segments when they occur in non-relative paths. URI	already an IRI, and thus fail to remove dot-segments when they occur
	normalizers should remove dot-segments by applying the	in non-relative paths. IRI normalizers should remove dot-segments by
	remove_dot_segments algorithm to the path, as described in	applying the remove_dot_segments algorithm to the path, as described
	Section 5.2.4.	in Section 5.2.4 of [RFCYYYY].

	6.2.3 Scheme-based Normalization	5.3.3 Scheme-based Normalization

	The syntax and semantics of URIs vary from scheme to scheme, as	The syntax and semantics of IRIs vary from scheme to scheme, as
	described by the defining specification for each scheme.	described by the defining specification for each scheme.
	Implementations may use scheme-specific rules, at further processing	Implementations may use scheme-specific rules, at further processing
	cost, to reduce the probability of false negatives. For example,	cost, to reduce the probability of false negatives. For example,
	since the "http" scheme makes use of an authority component, has a	since the "http" scheme makes use of an authority component, has a
	default port of "80", and defines an empty path to be equivalent to	default port of "80", and defines an empty path to be equivalent to
	"/", the following four URIs are equivalent:	"/", the following four IRIs are equivalent:

	http://example.com	http://example.com
	http://example.com/	http://example.com/
	http://example.com:/	http://example.com:/
	http://example.com:80/	http://example.com:80/
		In general, an IRI that uses the generic syntax for authority with an
	In general, a URI that uses the generic syntax for authority with an
	empty path should be normalized to a path of "/"; likewise, an	empty path should be normalized to a path of "/"; likewise, an
	explicit ":port", where the port is empty or the default for the	explicit ":port", where the port is empty or the default for the
	scheme, is equivalent to one where the port and its ":" delimiter are	scheme, is equivalent to one where the port and its ":" delimiter are
	elided, and thus should be removed by scheme-based normalization.	elided, and thus should be removed by scheme-based normalization.
	For example, the second URI above is the normal form for the "http"	For example, the second IRI above is the normal form for the "http"
	scheme.	scheme.

	Another case where normalization varies by scheme is in the handling	Another case where normalization varies by scheme is in the handling
	of an empty authority component or empty host subcomponent. For many	of an empty authority component or empty host subcomponent. For many
	scheme specifications, an empty authority or host is considered an	scheme specifications, an empty authority or host is considered an
	error; for others, it is considered equivalent to "localhost" or the	error; for others, it is considered equivalent to "localhost" or the
	end-user's host. When a scheme defines a default for authority and a	end-user's host. When a scheme defines a default for authority and
	URI reference to that default is desired, the reference should be	an IRI reference to that default is desired, the reference should be
	normalized to an empty authority for the sake of uniformity, brevity,	normalized to an empty authority for the sake of uniformity, brevity,
	and internationalization. If, however, either the userinfo or port	and internationalization. If, however, either the userinfo or port
	subcomponent is non-empty, then the host should be given explicitly	subcomponent is non-empty, then the host should be given explicitly
	even if it matches the default.	even if it matches the default.

	Normalization should not remove delimiters when their associated	Normalization should not remove delimiters when their associated
	component is empty unless licensed to do so by the scheme	component is empty unless licensed to do so by the scheme
	specification. For example, the URI "http://example.com/?" cannot be	specification. For example, the IRI "http://example.com/?" cannot be
	assumed to be equivalent to any of the examples above. Likewise, the	assumed to be equivalent to any of the examples above. Likewise, the
	presence or absence of delimiters within a userinfo subcomponent is	presence or absence of delimiters within a userinfo subcomponent is
	usually significant to its interpretation. The fragment component is	usually significant to its interpretation. The fragment component is
	not subject to any scheme-based normalization; thus, two URIs that	not subject to any scheme-based normalization; thus, two IRIs that
	differ only by the suffix "#" are considered different regardless of	differ only by the suffix "#" are considered different regardless of
	the scheme.	the scheme.

	Some schemes define additional subcomponents that consist of	Some IRI schemes may allow the usage of Internationalized Domain
	case-insensitive data, giving an implicit license to normalizers to	Names (IDN) [RFC3490] either in their ireg-name part or elsewhere.
	convert such data to a common case (e.g., all lowercase). For	When in use in IRIs, those names SHOULD be validated using the
	example, URI schemes that define a subcomponent of path to contain an	ToASCII operation defined in [RFC3490], with the flags
	Internet hostname, such as the "mailto" URI scheme, cause that	"UseSTD3ASCIIRules" and "AllowUnassigned". An IRI containing an
	subcomponent to be case-insensitive and thus subject to case	invalid IDN cannot successfully be resolved. Validated IDN
	normalization (e.g., "mailto:Joe@Example.COM" is equivalent to	components of IRIs SHOULD be character normalized using the Nameprep
	"mailto:Joe@example.com" even though the generic syntax considers the	process [RFC3491]; however, for legibility purposes, they SHOULD NOT
	path component to be case-sensitive).	be converted into ASCII Compatible Encoding (ACE).

		Scheme-based normalization may also consider IDN components and their
		conversions to punycode as equivalent. As an example,
		http://résumé.example.org may be considered equivalent to
		http://xn--rsum-bpad.example.org

	Other scheme-specific normalizations are possible.	Other scheme-specific normalizations are possible.

	6.2.4 Protocol-based Normalization	5.3.4 Protocol-based Normalization

	Web spiders, for which substantial effort to reduce the incidence of	Web spiders, for which substantial effort to reduce the incidence of
	false negatives is often cost-effective, are observed to implement	false negatives is often cost-effective, are observed to implement
	even more aggressive techniques in URI comparison. For example, if	even more aggressive techniques in IRI comparison. For example, if
	they observe that a URI such as	they observe that an IRI such as

	http://example.com/data	http://example.com/data

	redirects to a URI differing only in the trailing slash	redirects to an IRI differing only in the trailing slash

	http://example.com/data/	http://example.com/data/

	they will likely regard the two as equivalent in the future. This	they will likely regard the two as equivalent in the future. This
	kind of technique is only appropriate when equivalence is clearly	kind of technique is only appropriate when equivalence is clearly
	indicated by both the result of accessing the resources and the	indicated by both the result of accessing the resources and the
	common conventions of their scheme's dereference algorithm (in this	common conventions of their scheme's dereference algorithm (in this
	case, use of redirection by HTTP origin servers to avoid problems	case, use of redirection by HTTP origin servers to avoid problems
	with relative references).	with relative references).

End of changes.
This html diff was produced by rfcdiff 1.16, available from http://www.levkowetz.com/ietf/tools/rfcdiff/