Character Model for the World Wide Web 1.0: Resource Identifiers

W3C Working Group Note

This version:
https://www.w3.org/TR/2019/NOTE-charmod-resid-20190502/
Latest published version:
https://www.w3.org/TR/charmod-resid/
Previous version:
https://www.w3.org/TR/2004/CR-charmod-resid-20041122/
Editor:
(W3C)
Former editors:
Martin J. Dürst
François Yergeau
Misha Wolf
Tex Texin

Abstract

This was originally intended to be an Architectural Specification providing authors of specifications, software developers, and content developers with a common reference for the use of resource identifiers building on the Universal Character Set, defined jointly by the Unicode Standard and ISO/IEC 10646.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.

This document has been retired.

Many of the materials in this document are stale and out of date; the W3C is maintaining this version solely as a historical reference, and currently has no plans to work on it further.

Section 3 of this document was formerly Section 7 of the Character Model for the World Wide Web 1.0: Fundamentals Last Call Working Draft published 25 February 2004. A more detailed change log is given in Appendix C, Changes.

For topics such as use of the terms 'character', 'encoding' and 'string', a reference processing model, choice and identification of character encodings, character escaping, and string indexing, see Character Model for the World Wide Web 1.0: Fundamentals [CharMod]. For normalization and string identity matching, see Character Model for the World Wide Web: String Matching [CharNorm].

This document was published by the Internationalization Working Group as a Working Group Note.

Comments regarding this document are welcome. Please send them to www-international@w3.org (archives).

Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the W3C Patent Policy. The group does not expect this document to become a W3C Recommendation.

This document is governed by the 1 March 2019 W3C Process Document.

1. Introduction

The goal of the Character Model for the World Wide Web is to facilitate use of the Web by all people, regardless of their language, script, writing system, and cultural conventions, in accordance with the W3C goal of universal access. One basic prerequisite to achieve this goal is to be able to transmit and process the characters used around the world in a well-defined and well- understood way.

The main target audience of this specification is W3C specification developers. This specification and parts of it can be referenced from other W3C specifications. It defines conformance criteria for W3C specifications as well as other specifications.

The character model described in this specification provides authors of specifications, software developers, and content developers with a common reference for consistent, interoperable text manipulation on the World Wide Web. Working together, these three groups can build a more international Web.

The topic addressed in this part of the Character Model for the World Wide Web is resource identifiers.

The topic addressed in this part of the Character Model for the World Wide Web is the character encoding of resource identifiers. A resource identifier is a compact string of characters for identifying an abstract or physical resource.

Other parts of the Character Model address the fundamental aspects of the model ([CharMod]) and normalization and string identity matching ([CharNorm]). For more background information, please see [CharMod].

Topics as yet not addressed or barely touched include fuzzy matching, and language tagging. Some of these topics may be addressed in a future versions or parts of this specification.

At the core of the model is the Universal Character Set (UCS), defined jointly by the Unicode Standard [Unicode] and ISO/IEC 10646 [ISO/IEC 10646]. In this document, Unicode is used as a synonym for the Universal Character Set. The model will allow Web documents authored in the world's scripts (and on different platforms) to be exchanged, read, and searched by Web users around the world.

2. Conformance

This section explains the conditions that specifications, software, and Web content have to fulfill to be able to claim conformance to this specification.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY" and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC 2119].

Note

NOTE: RFC 2119 makes it clear that requirements that use SHOULD are not optional and must be complied with unless there are specific reasons not to: "This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course."

This specification defines conformance criteria for specifications. All conformance criteria are preceded by '[S]' where 'S' stands for specifications.

A specification conforms to this document if it:

  1. does not violate any conformance criteria preceded by [S],

  2. documents the reason for any deviation from criteria where the imperative is SHOULD, SHOULD NOT, or RECOMMENDED,

  3. where applicable, requires implementations conforming to the specification to conform to this document,

  4. where applicable, requires content conforming to the specification to conform to this document.

Note

NOTE: Requirements placed on specifications might indirectly cause requirements to be placed on implementations or content that claim to conform to those specifications. Likewise, requirements placed on content may affect implementations designed to produce such content, and so on.

Where this specification places requirements on processing, it is to be understood as a way to specify the desired external behavior. Implementations can use other means of achieving the same results, as long as observable behavior is not affected.

3. Character Encoding in Resource Identifiers

According to the definition in RFC 2396 [RFC 2396], URI references are restricted to a subset of US-ASCII, with an escaping mechanism to encode arbitrary byte values, using the %HH convention. However, the %HH convention by itself is of limited use because there is no definitive mapping from characters to bytes. Also, non-ASCII characters cannot be used directly. Internationalized Resource Identifiers (IRIs) [I-D IRI] solves both problems with an uniform approach that conforms to the Reference Processing Model.

C058 [S] Specifications that define protocol or format elements (e.g. HTTP headers, XML attributes, etc.) which are to be interpreted as URI references (or specific subsets of URI references, such as absolute URI references, URIs, etc.) SHOULD use Internationalized Resource Identifiers (IRIs) [I-D IRI] (or an appropriate subset thereof).

C059 [S] Specifications MUST define when the conversion from IRI references to URI references (or subsets thereof) takes place, in accordance with Internationalized Resource Identifiers (IRIs) [I-D IRI].

Note

NOTE: Many current specifications already contain provisions in accordance with Internationalized Resource Identifiers (IRIs) [I-D IRI]. For XML 1.0 [XML 1.0], see Section 4.2.2, External Entities. XML Schema Part 2: Datatypes [XML Schema-2] provides the anyURI datatype (see Section 3.2.17). The XML Linking Language (XLink) [XLink] provides the href attribute (see Section 5.4, Locator Attribute). Further information and links can be found at Internationalization: URIs and other identifiers [Info URI-I18N].

Note

NOTE: Document formats should allow IRIs to be used; handlers for protocols that do not currently support IRIs can convert the IRI to a URI when the IRI is dereferenced.

C060 [S] Specifications that define new syntax for URIs, such as a new URI scheme or a new kind of fragment identifier, MUST specify that characters outside the US-ASCII repertoire are encoded using UTF-8 and %HH-escaping.

This is in accordance with Guidelines for new URL Schemes [RFC 2718], Section 2.2.5.

C061 [S] Specifications that define new syntax for URIs SHOULD also define the normalization requirements for the syntax they introduce.

A. References

A.1 Normative References

I-D IRI
Martin Dürst, Michel Suignard, Internationalized Resource Identifiers (IRIs), Internet-Draft, September 2004. (See https://www.w3.org/International/iri-edit/draft-duerst-iri-10.txt.) [NOTE: This reference will be updated once the IRI draft is available as an RFC.]
ISO/IEC 10646
ISO/IEC 10646:2003, Information technology -- Universal Multiple-Octet Coded Character Set (UCS), as, from time to time, amended, replaced by a new edition or expanded by the addition of new parts. (See https://www.iso.org/home.html for the latest version.)
RFC 2119
S. Bradner, Key words for use in RFCs to Indicate Requirement Levels, IETF RFC 2119. (See https://www.ietf.org/rfc/rfc2119.txt.)
RFC 2396
T. Berners-Lee, R. Fielding, L. Masinter, Uniform Resource Identifiers (URI): Generic Syntax, IETF RFC 2396, August 1998. (See https://www.ietf.org/rfc/rfc2396.txt.) [NOTE: This reference will be updated once the successor to this document, draft-fielding-uri-rfc2396bis-07.txt, is available as an RFC.]
Unicode
The Unicode Consortium, The Unicode Standard, Version 4, ISBN 0-321-18578-1, as updated from time to time by the publication of new versions. (See http://www.unicode.org/standard/versions for the latest version and additional information on versions of the standard and of the Unicode Character Database).

A.2 Other References

CharMod
Martin J. Dürst, François Yergeau, Richard Ishida, Misha Wolf, Tex Texin, Character Model for the World Wide Web 1.0: Fundamentals, W3C Proposed Recommendation 22 November 2004. (See https://www.w3.org/TR/charmod/.)
CharNorm
Martin J. Dürst, François Yergeau, Richard Ishida, Misha Wolf, Tex Texin, Addison Phillips Character Model for the World Wide Web 1.0: Normalization, W3C Working Draft 25 February 2004. (See https://www.w3.org/TR/charmod-norm/.)
Info URI-I18N
Internationalization: URIs and other identifiers. (See https://www.w3.org/International/O-URL-and-ident.)
RFC 2718
L. Masinter, H. Alvestrand, D. Zigmond, R. Petke, Guidelines for new URL Schemes, IETF RFC 2718, November 1999. (See https://www.ietf.org/rfc/rfc2718.txt.)
Steve DeRose, Eve Maler, David Orchard, Eds, XML Linking Language (XLink) Version 1.0, W3C Recommendation 27 June 2001. (See https://www.w3.org/TR/xlink.)
XML 1.0
Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, François Yergeau, Eds., Extensible Markup Language (XML) 1.0, W3C Recommendation first published 10 February 1998, revised 4 February 2004. (See https://www.w3.org/TR/REC-xml/.)
XML Schema-2
Paul V. Biron , Ashok Malhotra , Eds., XML Schema Part 2: Datatypes Second Edition, W3C Recommendation first published 2 May 2001, revised 28 October 2004. (See https://www.w3.org/TR/xmlschema-2.)

B. List of conformance criteria (Non-Normative)

Below is a list of the conformance criteria in this specification, in document order. This list can be used to check specifications for conformance to this specification.

When doing so, the following points should be kept in mind:

C058 [S] Specifications that define protocol or format elements (e.g. HTTP headers, XML attributes, etc.) which are to be interpreted as URI references (or specific subsets of URI references, such as absolute URI references, URIs, etc.) SHOULD use Internationalized Resource Identifiers (IRIs) [I-D IRI] (or an appropriate subset thereof).
C059 [S] Specifications MUST define when the conversion from IRI references to URI references (or subsets thereof) takes place, in accordance with Internationalized Resource Identifiers (IRIs) [I-D IRI].
C060 [S] Specifications that define new syntax for URIs, such as a new URI scheme or a new kind of fragment identifier, MUST specify that characters outside the US-ASCII repertoire are encoded using UTF-8 and %HH-escaping.
C061 [S] Specifications that define new syntax for URIs SHOULD also define the normalization requirements for the syntax they introduce.

C. Changes (Non-Normative)

This document is based Section 7 of the Character Model for the World Wide Web 1.0: Fundamentals Last Call Working Draft published 25 February 2004. Changes between Section 7 of that document and Section 3 of the present document are as follows:

In addition, the remaining parts of this document have changed compared to the corresponding parts of the abovementioned Last Call Working Draft, as follows: The Introduction has been shortened to concentrate on the material in this document. The Conformance section has been reduced to take into account that this document only contains conformance criteria for specifications. The References section has been shortened by removing unrelated references. Where necessary, references have been updated. The reference to [I-D IRI] was moved to the Normative References subsection. A List of Conformance and this section on Changes have been added.

D. Acknowledgements (Non-Normative)

Tim Berners-Lee and James Clark provided important details. Asmus Freytag , Addison Phillips, and in early stages Ian Jacobs, provided significant help in the authoring and editing process. The W3C I18N WG and IG, as well as many others, provided many helpful comments and suggestions.