| normuri.txt | normiri.txt | |||
|---|---|---|---|---|
| 6. Normalization and Comparison | ||||
| One of the most common operations on URIs is simple comparison: | 5. Normalization and Comparison | |||
| determining if two URIs are equivalent without using the URIs to | ||||
| access their respective resource(s). A comparison is performed every | ||||
| time a response cache is accessed, a browser checks its history to | ||||
| color a link, or an XML parser processes tags within a namespace. | ||||
| Extensive normalization prior to comparison of URIs is often used by | ||||
| spiders and indexing engines to prune a search space or reduce | ||||
| duplication of request actions and response storage. | ||||
| URI comparison is performed in respect to some particular purpose, | Note: The structure and much of the material for this section is | |||
| taken from section 6 of [RFCYYYY]; the differences are due to the | ||||
| specifics of IRIs. | ||||
| One of the most common operations on IRIs is simple comparison: | ||||
| determining if two IRIs are equivalent without using the IRIs or the | ||||
| mapped URIs to access their respective resource(s). A comparison is | ||||
| performed every time a response cache is accessed, a browser checks | ||||
| its history to color a link, or an XML parser processes tags within a | ||||
| namespace. Extensive normalization prior to comparison of IRIs may | ||||
| be used by spiders and indexing engines to prune a search space or | ||||
| reduce duplication of request actions and response storage. | ||||
| IRI comparison is performed in respect to some particular purpose, | ||||
| and implementations with differing purposes will often be subject to | and implementations with differing purposes will often be subject to | |||
| differing design trade-offs in regards to how much effort should be | differing design trade-offs in regards to how much effort should be | |||
| spent in reducing aliased identifiers. This section describes a | spent in reducing aliased identifiers. This section describes a | |||
| variety of methods that may be used to compare URIs, the trade-offs | variety of methods that may be used to compare IRIs, the trade-offs | |||
| between them, and the types of applications that might use them. | between them, and the types of applications that might use them. | |||
| 6.1 Equivalence | 5.1 Equivalence | |||
| Since URIs exist to identify resources, presumably they should be | Since IRIs exist to identify resources, presumably they should be | |||
| considered equivalent when they identify the same resource. However, | considered equivalent when they identify the same resource. However, | |||
| such a definition of equivalence is not of much practical use, since | such a definition of equivalence is not of much practical use, since | |||
| there is no way for an implementation to compare two resources that | there is no way for an implementation to compare two resources that | |||
| are not under its own control. For this reason, determination of | are not under its own control. For this reason, determination of | |||
| equivalence or difference of URIs is based on string comparison, | equivalence or difference of IRIs is based on string comparison, | |||
| perhaps augmented by reference to additional rules provided by URI | perhaps augmented by reference to additional rules provided by URI | |||
| scheme definitions. We use the terms "different" and "equivalent" to | scheme definitions. We use the terms "different" and "equivalent" to | |||
| describe the possible outcomes of such comparisons, but there are | describe the possible outcomes of such comparisons, but there are | |||
| many application-dependent versions of equivalence. | many applicationdependent versions of equivalence. | |||
| Even though it is possible to determine that two URIs are equivalent, | Even though it is possible to determine that two IRIs are equivalent, | |||
| URI comparison is not sufficient to determine if two URIs identify | IRI comparison is not sufficient to determine if two IRIs identify | |||
| different resources. For example, an owner of two different domain | different resources. For example, an owner of two different domain | |||
| names could decide to serve the same resource from both, resulting in | names could decide to serve the same resource from both, resulting in | |||
| two different URIs. Therefore, comparison methods are designed to | two different IRIs. Therefore, comparison methods are designed to | |||
| minimize false negatives while strictly avoiding false positives. | minimize false negatives while strictly avoiding false positives. | |||
| In testing for equivalence, applications should not directly compare | In testing for equivalence, applications should not directly compare | |||
| relative references; the references should be converted to their | relative references; the references should be converted to their | |||
| respective target URIs before comparison. When URIs are being | respective target IRIs before comparison. When IRIs are being | |||
| compared for the purpose of selecting (or avoiding) a network action, | compared for the purpose of selecting (or avoiding) a network action, | |||
| such as retrieval of a representation, fragment components (if any) | such as retrieval of a representation, fragment components (if any) | |||
| should be excluded from the comparison. | should be excluded from the comparison. | |||
| 6.2 Comparison Ladder | Applications using IRIs as identity tokens with no relationship to a | |||
| protocol MUST use the Simple String Comparison (see Section 5.3.1). | ||||
| All other applications MUST select one of the comparison practices | ||||
| from the Comparison Ladder (see Section 5.3, or, after IRI-to-URI | ||||
| conversion, select one of the comparison practices from the URI | ||||
| comparison ladder [RFCYYYY], Section 6.2. | ||||
| A variety of methods are used in practice to test URI equivalence. | 5.2 Preparation for Comparison | |||
| Any kind of IRI comparison REQUIRES that all escapings or encodings | ||||
| in the protocol or format that carries an IRI are resolved. This is | ||||
| usually done when parsing the protocol or format. Examples of such | ||||
| escapings or encodings are entities and numeric character references | ||||
| in [HTML4] and [XML1]. As an example, http://example.org/rosé | ||||
| (in HTML), http://example.org/rosé (in HTML or XML), and | ||||
| http://example.org/rosé (in HTML or XML) all get resolved into | ||||
| what is denoted in this document (see Section 1.4) as | ||||
| http://example.org/rosé (the "é" here standing for the | ||||
| actual e-acute character, to compensate for the fact that this | ||||
| document cannot contain non-ASCII characters). | ||||
| Similar considerations apply to encodings such as Transfer Codings in | ||||
| HTTP (see [RFC2616]) and Content Transfer Encodings in MIME[RFC2045], | ||||
| although in these cases, the encoding is not based on characters, but | ||||
| on octets, and additional care is required to make sure that | ||||
| characters, and not just arbitrary octets, are compared (see Section | ||||
| 5.3.1). | ||||
| 5.3 Comparison Ladder | ||||
| A variety of methods are used in practice to test IRI equivalence. | ||||
| These methods fall into a range, distinguished by the amount of | These methods fall into a range, distinguished by the amount of | |||
| processing required and the degree to which the probability of false | processing required and the degree to which the probability of false | |||
| negatives is reduced. As noted above, false negatives cannot be | negatives is reduced. As noted above, false negatives cannot be | |||
| eliminated. In practice, their probability can be reduced, but this | eliminated. In practice, their probability can be reduced, but this | |||
| reduction requires more processing and is not cost-effective for all | reduction requires more processing and is not cost-effective for all | |||
| applications. | applications. | |||
| If this range of comparison practices is considered as a ladder, the | If this range of comparison practices is considered as a ladder, the | |||
| following discussion will climb the ladder, starting with those | following discussion will climb the ladder, starting with those | |||
| practices that are cheap but have a relatively higher chance of | practices that are cheap but have a relatively higher chance of | |||
| producing false negatives, and proceeding to those that have higher | producing false negatives, and proceeding to those that have higher | |||
| computational cost and lower risk of false negatives. | computational cost and lower risk of false negatives. | |||
| 6.2.1 Simple String Comparison | 5.3.1 Simple String Comparison | |||
| If two URIs, considered as character strings, are identical, then it | If two IRIs, considered as character strings, are identical, then it | |||
| is safe to conclude that they are equivalent. This type of | is safe to conclude that they are equivalent. This type of | |||
| equivalence test has very low computational cost and is in wide use | equivalence test has very low computational cost and is in wide use | |||
| in a variety of applications, particularly in the domain of parsing. | in a variety of applications, particularly in the domain of parsing | |||
| and when a definitive answer to the question of IRI equivalence is | ||||
| needed that is independent of the scheme used and can be calculated | ||||
| quickly and without accessing a network. An example of such a case | ||||
| is XML Namespaces ([XMLNamespace]). | ||||
| Testing strings for equivalence requires some basic precautions. | Testing strings for equivalence requires some basic precautions. | |||
| This procedure is often referred to as "bit-for-bit" or | This procedure is often referred to as "bit-for-bit" or | |||
| "byte-for-byte" comparison, which is potentially misleading. Testing | "byte-for-byte" comparison, which is potentially misleading. Testing | |||
| of strings for equality is normally based on pairwise comparison of | of strings for equality is normally based on pairwise comparison of | |||
| the characters that make up the strings, starting from the first and | the characters that make up the strings, starting from the first and | |||
| proceeding until both strings are exhausted and all characters found | proceeding until both strings are exhausted and all characters found | |||
| to be equal, a pair of characters compares unequal, or one of the | to be equal, a pair of characters compares unequal, or one of the | |||
| strings is exhausted before the other. | strings is exhausted before the other. | |||
| Such character comparisons require that each pair of characters be | Such character comparisons require that each pair of characters be | |||
| put in comparable form. For example, should one URI be stored in a | put in comparable encoding form. For example, should one IRI be | |||
| byte array in EBCDIC encoding, and the second be in a Java String | stored in a byte array in UTF-8 encoding form, and the second be in a | |||
| object (UTF-16), bit-for-bit comparisons applied naively will produce | UTF-16 encoding form, bit-for-bit comparisons applied naively will | |||
| errors. It is better to speak of equality on a | produce errors. It is better to speak of equality on a | |||
| character-for-character rather than byte-for-byte or bit-for-bit | character-for-character rather than byte-for-byte or bit-for-bit | |||
| basis. In practical terms, character-by-character comparisons should | basis. In practical terms, character-by-character comparisons should | |||
| be done codepoint-by-codepoint after conversion to a common character | be done codepoint-by-codepoint after conversion to a common character | |||
| encoding. | encoding form. When comparing character-by-character, the comparison | |||
| function MUST NOT map IRIs to URIs, because such a mapping would | ||||
| create additional spurious equivalences. It follows that IRIs SHOULD | ||||
| NOT be modified when being transported if there is any chance that | ||||
| this IRI might be used as an identifier. | ||||
| False negatives are caused by the production and use of URI aliases. | False negatives are caused by the production and use of IRI aliases. | |||
| Unnecessary aliases can be reduced, regardless of the comparison | Unnecessary aliases can be reduced, regardless of the comparison | |||
| method, by consistently providing URI references in an | method, by consistently providing IRI references in an | |||
| already-normalized form (i.e., a form identical to what would be | already-normalized form (i.e., a form identical to what would be | |||
| produced after normalization is applied, as described below). | produced after normalization is applied, as described below). | |||
| Protocols and data formats often choose to limit some URI comparisons | Protocols and data formats often choose to limit some IRI comparisons | |||
| to simple string comparison, based on the theory that people and | to simple string comparison, based on the theory that people and | |||
| implementations will, in their own best interest, be consistent in | implementations will, in their own best interest, be consistent in | |||
| providing URI references, or at least consistent enough to negate any | providing IRI references, or at least consistent enough to negate any | |||
| efficiency that might be obtained from further normalization. | efficiency that might be obtained from further normalization. | |||
| 6.2.2 Syntax-based Normalization | 5.3.2 Syntax-based Normalization | |||
| Implementations may use logic based on the definitions provided by | Implementations may use logic based on the definitions provided by | |||
| this specification to reduce the probability of false negatives. | this specification to reduce the probability of false negatives. | |||
| Such processing is moderately higher in cost than | Such processing is moderately higher in cost than | |||
| character-for-character string comparison. For example, an | character-for-character string comparison. For example, an | |||
| application using this approach could reasonably consider the | application using this approach could reasonably consider the | |||
| following two URIs equivalent: | following two IRIs equivalent: | |||
| example://a/b/c/%7Bfoo%7D | example://a/b/c/%7Bfoo%7D/rosé | |||
| eXAMPLE://a/./b/../b/%63/%7bfoo%7d | eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9 | |||
| Web user agents, such as browsers, typically apply this type of URI | Web user agents, such as browsers, typically apply this type of IRI | |||
| normalization when determining whether a cached response is | normalization when determining whether a cached response is | |||
| available. Syntax-based normalization includes such techniques as | available. Syntax-based normalization includes such techniques as | |||
| case normalization, percent-encoding normalization, and removal of | case normalization, character normalization, percent-encoding | |||
| dot-segments. | normalization, and removal of dot-segments. | |||
| 6.2.2.1 Case Normalization | 5.3.2.1 Case Normalization | |||
| For all URIs, the hexadecimal digits within a percent-encoding | For all IRIs, the hexadecimal digits within a percent-encoding | |||
| triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore | triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore | |||
| should be normalized to use uppercase letters for the digits A-F. | should be normalized to use uppercase letters for the digits A-F. | |||
| When a URI uses components of the generic syntax, the component | When an IRI uses components of the generic syntax, the component | |||
| syntax equivalence rules always apply; namely, that the scheme and | syntax equivalence rules always apply; namely, that the scheme and | |||
| host are case-insensitive and therefore should be normalized to | US-ASCII only host are case-insensitive and therefore should be | |||
| lowercase. For example, the URI <HTTP://www.EXAMPLE.com/> is | normalized to lowercase. For example, the URI | |||
| equivalent to <http://www.example.com/>. The other generic syntax | <HTTP://www.EXAMPLE.com/> is equivalent to <http://www.example.com/>. | |||
| Case equivalence for non-ASCII characters in IRI components that are | ||||
| IDNs are discussed in Section 5.3.3. The other generic syntax | ||||
| components are assumed to be case-sensitive unless specifically | components are assumed to be case-sensitive unless specifically | |||
| defined otherwise by the scheme (see Section 6.2.3). | defined otherwise by the scheme. | |||
| 6.2.2.2 Percent-Encoding Normalization | Creating schemes that allow case-insensitive syntax components | |||
| containing non US-ASCII characters should be avoided because such a | ||||
| case normalization may be cultural dependant and is always a complex | ||||
| operation. The only exception concerns non-ASCII host names for | ||||
| which the character normalization includes a mapping step derived | ||||
| from case folding. | ||||
| The percent-encoding mechanism (Section 2.1) is a frequent source of | 5.3.2.2 Character Normalization | |||
| variance among otherwise identical URIs. In addition to the case | ||||
| normalization issue noted above, some URI producers percent-encode | ||||
| octets that do not require percent-encoding, resulting in URIs that | ||||
| are equivalent to their non-encoded counterparts. Such URIs should | ||||
| be normalized by decoding any percent-encoded octet that corresponds | ||||
| to an unreserved character, as described in Section 2.3. | ||||
| 6.2.2.3 Path Segment Normalization | The Unicode Standard [UNIV4] defines various equivalences between | |||
| sequences of characters for various purposes. Unicode Standard Annex | ||||
| #15 [UTR15] defines various Normalization Forms for these | ||||
| equivalences, in particular Normalization Form C (NFC, Canonical | ||||
| Decomposition, followed by Canonical Composition) and Normalization | ||||
| Form KC (NFKC, Compatibility Decomposition, followed by Canonical | ||||
| Composition). | ||||
| Equivalence of IRIs MUST rely on the assumption that IRIs are | ||||
| appropriately pre-character-normalized, rather than applying | ||||
| character normalization when comparing two IRIs. The exceptions are | ||||
| conversion from a non-digital form, and conversion from a | ||||
| non-UCS-based character encoding to an UCS-based character encoding. | ||||
| In these cases, NFC or a normalizing transcoder using NFC MUST be | ||||
| used for interoperability. To avoid false negatives and problems | ||||
| with transcoding, IRIs SHOULD be created using NFC. Using NFKC may | ||||
| avoid even more problems, for example by choosing half-width Latin | ||||
| letters instead of full-width, and full-width Katakana instead of | ||||
| half-width. | ||||
| As an example, http://www.example.org/résumé.html (in XML | ||||
| Notation) is in NFC. On the other hand, | ||||
| http://www.example.org/résumé.html is not in NFC. The | ||||
| former uses precombined e-acute characters, the latter uses 'e' | ||||
| characters followed by combining acute accents. Both usages are | ||||
| defined to be canonically equivalent in [UNIV4]. | ||||
| Note: Because it is unknown how a particular sequence of characters | ||||
| is being treated with respect to character normalization, it would | ||||
| be inappropriate to allow third parties to normalize an IRI | ||||
| arbitrarily. This does not contradict the recommendation that | ||||
| when a resource is created, its IRI should be as | ||||
| character-normalized as possible (i.e. NFC or even NFKC). This | ||||
| is similar to the upper-case/lower-case problems in | ||||
| character-normalized as possible (i.e. NFC or even NFKC). URIs. | ||||
| Some parts of a URI are case-insensitive (domain name). For | ||||
| others, it is unclear whether they are case-sensitive or | ||||
| case-insensitive, or something in between (e.g. case-sensitive, | ||||
| but if the wrong case is used, a multiple choice selection is | ||||
| provided instead of a direct negative result). The best recipe is | ||||
| that the creator uses a reasonable capitalization, and when | ||||
| transferring the URI, that capitalization is never changed. | ||||
| Various IRI schemes may allow the usage of Internationalized Domain | ||||
| Names (IDN) [RFC3490] either in the ireg-name part or elsewhere. | ||||
| Character Normalization also applies to IDNs, as discussed in Section | ||||
| 5.3.3. | ||||
| 5.3.2.3 Percent-Encoding Normalization | ||||
| The percent-encoding mechanism (Section 2.1 of [RFCYYYY]) is a | ||||
| frequent source of variance among otherwise identical IRIs. In | ||||
| addition to the case normalization issue noted above, some IRI | ||||
| producers percent-encode octets that do not require percent-encoding, | ||||
| resulting in IRIs that are equivalent to their nonencoded | ||||
| counterparts. Such IRIs should be normalized by decoding any | ||||
| percent-encoded octet sequence that corresponds to an unreserved | ||||
| character, as described in Section 2.3 of [RFCYYYY]. | ||||
| For actual resolution, differences in percent-encoding (except for | ||||
| the percent-encoding of reserved characters) MUST always result in | ||||
| the same resource. For example, http://example.org/~user, | ||||
| http://example.org/%7euser and http://example.org/%7Euser must | ||||
| resolve to the same resource. | ||||
| If this kind of equivalence is to be tested, the percent-encoding of | ||||
| both IRIs to be compared has to be aligned, for example by converting | ||||
| both IRIs to URIs (see Section 3.1), eliminating escape differences | ||||
| in the resulting URIs, and making sure that the case of the | ||||
| hexadecimal characters in the percent-encoding is always the same | ||||
| (preferably upper case). If the IRI is to be passed to another | ||||
| application, or used further in some other way, its original form | ||||
| MUST be preserved; the conversion described here should be performed | ||||
| only for the purpose of local comparison. | ||||
| 5.3.2.4 Path Segment Normalization | ||||
| The complete path segments "." and ".." are intended only for use | The complete path segments "." and ".." are intended only for use | |||
| within relative references (Section 4.1) and are removed as part of | within relative references (Section 4.1 of [RFCYYYY]) and are removed | |||
| the reference resolution process (Section 5.2). However, some | as part of the reference resolution process (Section 5.2 of | |||
| deployed implementations incorrectly assume that reference resolution | [RFCYYYY]). However, some implementations may incorrectly assume | |||
| is not necessary when the reference is already a URI, and thus fail | that reference resolution is not necessary when the reference is | |||
| to remove dot-segments when they occur in non-relative paths. URI | already an IRI, and thus fail to remove dot-segments when they occur | |||
| normalizers should remove dot-segments by applying the | in non-relative paths. IRI normalizers should remove dot-segments by | |||
| remove_dot_segments algorithm to the path, as described in | applying the remove_dot_segments algorithm to the path, as described | |||
| Section 5.2.4. | in Section 5.2.4 of [RFCYYYY]. | |||
| 6.2.3 Scheme-based Normalization | 5.3.3 Scheme-based Normalization | |||
| The syntax and semantics of URIs vary from scheme to scheme, as | The syntax and semantics of IRIs vary from scheme to scheme, as | |||
| described by the defining specification for each scheme. | described by the defining specification for each scheme. | |||
| Implementations may use scheme-specific rules, at further processing | Implementations may use scheme-specific rules, at further processing | |||
| cost, to reduce the probability of false negatives. For example, | cost, to reduce the probability of false negatives. For example, | |||
| since the "http" scheme makes use of an authority component, has a | since the "http" scheme makes use of an authority component, has a | |||
| default port of "80", and defines an empty path to be equivalent to | default port of "80", and defines an empty path to be equivalent to | |||
| "/", the following four URIs are equivalent: | "/", the following four IRIs are equivalent: | |||
| http://example.com | http://example.com | |||
| http://example.com/ | http://example.com/ | |||
| http://example.com:/ | http://example.com:/ | |||
| http://example.com:80/ | http://example.com:80/ | |||
| In general, an IRI that uses the generic syntax for authority with an | ||||
| In general, a URI that uses the generic syntax for authority with an | ||||
| empty path should be normalized to a path of "/"; likewise, an | empty path should be normalized to a path of "/"; likewise, an | |||
| explicit ":port", where the port is empty or the default for the | explicit ":port", where the port is empty or the default for the | |||
| scheme, is equivalent to one where the port and its ":" delimiter are | scheme, is equivalent to one where the port and its ":" delimiter are | |||
| elided, and thus should be removed by scheme-based normalization. | elided, and thus should be removed by scheme-based normalization. | |||
| For example, the second URI above is the normal form for the "http" | For example, the second IRI above is the normal form for the "http" | |||
| scheme. | scheme. | |||
| Another case where normalization varies by scheme is in the handling | Another case where normalization varies by scheme is in the handling | |||
| of an empty authority component or empty host subcomponent. For many | of an empty authority component or empty host subcomponent. For many | |||
| scheme specifications, an empty authority or host is considered an | scheme specifications, an empty authority or host is considered an | |||
| error; for others, it is considered equivalent to "localhost" or the | error; for others, it is considered equivalent to "localhost" or the | |||
| end-user's host. When a scheme defines a default for authority and a | end-user's host. When a scheme defines a default for authority and | |||
| URI reference to that default is desired, the reference should be | an IRI reference to that default is desired, the reference should be | |||
| normalized to an empty authority for the sake of uniformity, brevity, | normalized to an empty authority for the sake of uniformity, brevity, | |||
| and internationalization. If, however, either the userinfo or port | and internationalization. If, however, either the userinfo or port | |||
| subcomponent is non-empty, then the host should be given explicitly | subcomponent is non-empty, then the host should be given explicitly | |||
| even if it matches the default. | even if it matches the default. | |||
| Normalization should not remove delimiters when their associated | Normalization should not remove delimiters when their associated | |||
| component is empty unless licensed to do so by the scheme | component is empty unless licensed to do so by the scheme | |||
| specification. For example, the URI "http://example.com/?" cannot be | specification. For example, the IRI "http://example.com/?" cannot be | |||
| assumed to be equivalent to any of the examples above. Likewise, the | assumed to be equivalent to any of the examples above. Likewise, the | |||
| presence or absence of delimiters within a userinfo subcomponent is | presence or absence of delimiters within a userinfo subcomponent is | |||
| usually significant to its interpretation. The fragment component is | usually significant to its interpretation. The fragment component is | |||
| not subject to any scheme-based normalization; thus, two URIs that | not subject to any scheme-based normalization; thus, two IRIs that | |||
| differ only by the suffix "#" are considered different regardless of | differ only by the suffix "#" are considered different regardless of | |||
| the scheme. | the scheme. | |||
| Some schemes define additional subcomponents that consist of | Some IRI schemes may allow the usage of Internationalized Domain | |||
| case-insensitive data, giving an implicit license to normalizers to | Names (IDN) [RFC3490] either in their ireg-name part or elsewhere. | |||
| convert such data to a common case (e.g., all lowercase). For | When in use in IRIs, those names SHOULD be validated using the | |||
| example, URI schemes that define a subcomponent of path to contain an | ToASCII operation defined in [RFC3490], with the flags | |||
| Internet hostname, such as the "mailto" URI scheme, cause that | "UseSTD3ASCIIRules" and "AllowUnassigned". An IRI containing an | |||
| subcomponent to be case-insensitive and thus subject to case | invalid IDN cannot successfully be resolved. Validated IDN | |||
| normalization (e.g., "mailto:Joe@Example.COM" is equivalent to | components of IRIs SHOULD be character normalized using the Nameprep | |||
| "mailto:Joe@example.com" even though the generic syntax considers the | process [RFC3491]; however, for legibility purposes, they SHOULD NOT | |||
| path component to be case-sensitive). | be converted into ASCII Compatible Encoding (ACE). | |||
| Scheme-based normalization may also consider IDN components and their | ||||
| conversions to punycode as equivalent. As an example, | ||||
| http://résumé.example.org may be considered equivalent to | ||||
| http://xn--rsum-bpad.example.org | ||||
| Other scheme-specific normalizations are possible. | Other scheme-specific normalizations are possible. | |||
| 6.2.4 Protocol-based Normalization | 5.3.4 Protocol-based Normalization | |||
| Web spiders, for which substantial effort to reduce the incidence of | Web spiders, for which substantial effort to reduce the incidence of | |||
| false negatives is often cost-effective, are observed to implement | false negatives is often cost-effective, are observed to implement | |||
| even more aggressive techniques in URI comparison. For example, if | even more aggressive techniques in IRI comparison. For example, if | |||
| they observe that a URI such as | they observe that an IRI such as | |||
| http://example.com/data | http://example.com/data | |||
| redirects to a URI differing only in the trailing slash | redirects to an IRI differing only in the trailing slash | |||
| http://example.com/data/ | http://example.com/data/ | |||
| they will likely regard the two as equivalent in the future. This | they will likely regard the two as equivalent in the future. This | |||
| kind of technique is only appropriate when equivalence is clearly | kind of technique is only appropriate when equivalence is clearly | |||
| indicated by both the result of accessing the resources and the | indicated by both the result of accessing the resources and the | |||
| common conventions of their scheme's dereference algorithm (in this | common conventions of their scheme's dereference algorithm (in this | |||
| case, use of redirection by HTTP origin servers to avoid problems | case, use of redirection by HTTP origin servers to avoid problems | |||
| with relative references). | with relative references). | |||
| End of changes. | ||||
This html diff was produced by rfcdiff 1.16, available from http://www.levkowetz.com/ietf/tools/rfcdiff/ | ||||