Implementation Report for Character Model for the World Wide Web 1.0: Fundamentals

number text implementations
C001 [S] [I] [C] Specifications, software and content MUST NOT require or depend on a one-to-one correspondence between characters and the sounds of a language. SSML
C002 [S] [I] [C] Specifications, software and content MUST NOT require or depend on a one-to-one mapping between characters and units of displayed text. CSS, XSL-FO, SVG
C003 [S] [I] [C] Protocols, data formats and APIs MUST store, interchange or process text data in logical order. everything that uses Unicode, very widely implemented
C075 [I] Independent of whether some implementation uses logical selection or visual selection, characters selected MUST be kept in logical order in storage. many implementations, in particular editors; SVG
C004 [S] Specifications of protocols and APIs that involve selection of ranges SHOULD provide for discontiguous logicalselections, at least to the extent necessary to support implementation of visual selection on screen on top of those protocols and APIs. XPointer
C005 [S] [I] Specifications and software MUST NOT require nor depend on a single keystroke resulting in a single character, nor that a single character be input with a single keystroke (even with modifiers), nor that keyboards are the same all over the world. DOM Events,...
C006 [S] [I] Software that sorts or searches text for users SHOULD do so on the basis of appropriate collation units and ordering rules for the relevant language and/or application. XQuery, various OSes
C007 [S] [I] Where searching or sorting is done dynamically, particularly in a multilingual environment, the 'relevant language' SHOULD be determined to be that of the current user, and may thus differ from user to user. XQuery, various OSes
C066 [S] [I] Software that allows users to sort or search text SHOULD allow the user to select alternative rules for collation units and ordering. XQuery, various OSes
C008 [S] [I] Specifications and implementations of sorting and searching algorithms SHOULD accommodate text that contains any character in Unicode. several implementations of UCA and others (C008 mainly warns about some old problem)
C009 [S] [I] Specifications, software and content MUST NOT require or depend on a one-to-one relationship between characters and units of physical storage. XSLT, XQuery,...
C010 [S] When specifications use the term 'character' the specifications MUST define which meaning they intend. XML
C067 [S] Specifications SHOULD use specific terms, when available, instead of the general term 'character'. various specs
C013 [S] [C] Textual data objects defined by protocol or format specifications MUST be in a single character encoding. HTML, XML, CSS,...
C014 [S] All specifications that involve processing of text MUST specify the processing of text according to the Reference Processing Model, namely:
  1. Specifications MUST define text in terms of Unicode characters, not bytes or glyphs.
  2. For their textual data objects specifications MAY allow use of any character encoding which can be transcoded to a Unicode encoding form.
  3. Specifications MAY choose to disallow or deprecate some character encodings and to make others mandatory. Independent of the actual character encoding, the specified behavior MUST be the same as if the processing happened as follows:
    • The character encoding of any textual data object received by the application implementing the specification MUST be determined and the data object MUST be interpreted as a sequence of Unicode characters - this MUST be equivalent to transcoding the data object to some Unicode encoding form, adjusting any character encoding label if necessary, and receiving it in that Unicode encoding form.
    • All processing MUST take place on this sequence of Unicode characters.
    • If text is output by the application, the sequence of Unicode characters MUST be encoded using a character encoding chosen among those allowed by the specification.

  4. If a specification is such that multiple textual data objects are involved (such as an XML document referring to external parsed entities), it MAY choose to allow these data objects to be in different character encodings. In all cases, the Reference Processing Model MUST be applied to all textual data objects.
HTML, CSS, XML, XSLT, XQuery,...
C070 [S] Specifications SHOULD NOT arbitrarily exclude code points from the full range of Unicode code points from U+0000 to U+10FFFF inclusive. HTML, XML, CSS
C077 [S] Specifications MUST NOT allow code points above U+10FFFF. HTML, XML, CSS
C079 [S] Specifications SHOULD NOT allow the use of codepoints reserved by Unicode for internal use. discouraged by XML1.1
C078 [S] Specifications MUST NOT allow the use of surrogate code points. HTML, XML, CSS
C015 [S] Specifications MUST either specify a unique character encoding, or provide character encoding identification mechanisms such that the encoding of text can be reliably identified. HTML, XML, CSS,...
C016 [S] When designing a new protocol, format or API, specifications SHOULD require a unique character encoding. DOM, IRI->URI conversion, some IETF protocols
C017 [S] When basing a protocol, format, or API on a protocol, format, or API that already has rules for character encoding, specifications SHOULD use rather than change these rules. HTML->MIME, XML->MIME, RFC3023-based media types
C018 [S] When a unique character encoding is required, the character encoding MUST be UTF-8, UTF-16 or UTF-32. DOM, IRIs, some IETF protocols
C020 [S] Specifications SHOULD avoid using the terms 'character set' and 'charset' to refer to a character encoding, except when the latter is used to refer to the MIME charset parameter or its IANA-registered values. The term 'character encoding', or in specific cases the terms 'character encoding form' or 'character encoding scheme', are RECOMMENDED. lots of specs
C021 [S] If the unique encoding approach is not taken, specifications SHOULD require the use of the IANA charset registry names, and in particular the names identified in the registry as 'MIME preferred names', to designate character encodings in protocols, data formats and APIs. recommended by XML
C022 [S] [I] [C] Character encodings that are not in the IANA registry SHOULD NOT be used, except by private agreement. XML
C023 [S] [I] [C] If an unregistered character encoding is used, the convention of using 'x-' at the beginning of the name MUST be followed. XML
C049 [I] [C] The character encoding of content SHOULD be chosen so that it maximizes the opportunity to directly represent characters (ie. minimizes the need to represent characters by markup means such as character escapes) while avoiding obscure encodings that are unlikely to be understood by recipients. wide practice on the Web
C034 [C] If facilities are offered for identifying character encoding, content MUST make use of them; where the facilities offered for character encoding identification include defaults (e.g. in XML 1.0 [XML 1.0]), relying on such defaults is sufficient to satisfy this identification requirement. wide (but not yet wide enough) practice on the Web
C024 [I] [C] Content and software that label text data MUST use one of the names required by the appropriate specification (e.g. the XML specification when editing XML text) and SHOULD use the MIME preferred name of a character encoding to label data in that character encoding. wide practice
C025 [I] [C] An IANA-registered charset name MUST NOT be used to label text data in a character encoding other than the one identified in the IANA registration of that name. wide practice
C026 [S] If the unique encoding approach is not chosen, specifications MUST designate at least one of the UTF-8 and UTF-16 encoding forms of Unicode as admissible character encodings and SHOULD choose at least one of UTF-8 or UTF-16 as required encoding forms (encoding forms that MUST be supported by implementations of the specification). XML
C027 [S] Specifications that require a default encoding MUST define either UTF-8 or UTF-16 as the default, or both if they define suitable means of distinguishing them. XML
C028 [S] Specifications MUST NOT propose the use of heuristics to determine the encoding of data. none known
C029 [I] Receiving software MUST determine the encoding of data from available information according to appropriate specifications. widely implemented (although it could be better)
C030 [I] When an IANA-registered charset name is recognized, receiving software MUST interpret the received data according to the encoding associated with the name in the IANA registry. widely implemented
C031 [I] When no charset is provided receiving software MUST adhere to the default character encoding(s) specified in the specification. widely implemented
C035 [S] Specifications MUST define conflict-resolution mechanisms (e.g. priorities) for cases where there is multiple or conflicting information about character encoding. HTML, XML
C033 [I] Software MUST completely implement the mechanisms for character encoding identification and conflict resolution. browsers, XML parsers
C073 [C] Publicly interchanged content SHOULD NOT use codepoints in the private use area. most Web pages
C076 [C] Content MUST NOT use a code point for any purpose other than that defined by its character encoding. most Web pages
C038 [S] Specifications MUST NOT require the use of private use area characters with particular assignments. most specs (bad exception that we are trying to avoid repeating: MathML 1.0)
C039 [S] Specifications MUST NOT require the use of mechanisms for definingagreements of private use code points. all known specs
C040 [S] [I] Specifications and implementations SHOULD NOT disallow the use of private use code points by private agreement. HTML, XML
C041 [S] Specifications MAY define markup to allow the transmission of symbols not in Unicode or to identify specific variants of Unicode characters. SVG, MathML
C068 [S] Specifications SHOULD allow the inclusion of or reference to pictures and graphics where appropriate, to eliminate the need to (mis)use character-oriented mechanisms for pictures or graphics. HTML, SVG
C042 [S] Specifications SHOULD NOT invent a new escaping mechanism if an appropriate one already exists. XHTML, SVG, SMIL,...
C043 [S] The number of different ways to escape a character SHOULD be minimized (ideally to one). CSS
C044 [S] Escape syntax SHOULD require either explicit end delimiters or a fixed number of characters in each character escape. Escape syntaxes where the end is determined by any character outside the set of characters admissible in the character escape itself SHOULD be avoided. HTML, XML
C045 [S] Whenever specifications define character escapes that allow the representation of characters using a number, the number MUST represent the Unicode code point of the character and SHOULD be in hexadecimal notation. HTML, XML, CSS
C046 [S] Escaped characters SHOULD be acceptable wherever their unescaped forms are; this does not preclude that syntax-significant characters, when escaped, lose their significance in the syntax. In particular, if a character is acceptable in identifiers and comments, then its escaped form should also be acceptable. CSS, would have been ideal for XML
C047 [I] [C] Escapes SHOULD only be used when the characters to be expressed are not directly representable in the format or the character encoding of the document, or when the visual representation of the character is unclear. most content on the Web
C048 [I] [C] Content SHOULD use the hexadecimal form of character escapes rather than the decimal form when there are both. several implementations, lots of content
C050 [S] Specifications SHOULD exclude compatibility characters in the syntactic elements (markup, delimiters, identifiers) of the formats they define. XML 1.0
C011 [S] Specifications SHOULD NOT define a string as a 'byte string'. all W3C specs
C012 [S] The 'character string' definition SHOULD be used by most specifications. HTML, XML, XSLT,...
C051 [S] [I] The character string is RECOMMENDED as a basis for string indexing. XSLT, XQuery
C052 [S] [I] A code unit string MAY be used as a basis for string indexing if this results in a significant improvement in the efficiency of internal operations when compared to the use of character string. DOM
C071 [S] [I] Grapheme clusters MAY be used as a basis for string indexing in applications where user interaction is the primary concern. not too much implemented yet
C074 [S] Specifications that define indexing in terms of grapheme clusters MUST either: a) define grapheme clusters in terms of default grapheme clusters as defined in Unicode Standard Annex #29, Text Boundaries [UTR #29], or b) define specifically how tailoring is applied to the indexing operation. not too much implemented yet
C072 [S] [I] The use of byte strings for indexing is NOT RECOMMENDED. all W3C specs
C053 [S] Specifications that need a way to identify substrings or point within a string SHOULD provide ways other than string indexing to perform this operation. regular expressions,...
C054 [I] [C] Users of specifications (software developers, content developers) SHOULD whenever possible prefer ways other than string indexing to identify substrings or point within a string. XSLT?
C055 [S] Specifications SHOULD understand and process single characters as substrings, and treat indices as boundary positions between counting units, regardless of the choice of counting units. XSLT/XQuery (for first part)
C056 [S] Specifications of APIs SHOULD NOT specify single characters or single 'units of encoding' asargumentor return types. DOM
C057 [S] When the positions between the units are counted for string indexing, starting with an index of 0 for the position at the start of the string is the RECOMMENDED solution, with the last index then being equal to the number of counting units in the string. many examples in programming languages, unfortunately not XSLT
C062 [S]  Since specifications in general need both a definition for their characters and the semantics associated with these characters, specifications SHOULD include a reference to the Unicode Standard, whether or not they include a reference to ISO/IEC 10646. many specs
C063 [S]  A generic reference to the Unicode Standard MUST be made if it is desired that characters allocated after a specification is published are usable with that specification. A specific reference to the Unicode Standard MAY be included to ensure that functionality depending on a particular version is available and will not change over time. XML 1.1
C064 [S]  All generic references to the Unicode Standard [Unicode] MUST refer to the latest version of the Unicode Standard available at the date of publication of the containing specification. XML 1.1
C065 [S]  All generic references to ISO/IEC 10646 [ISO/IEC 10646] MUST refer to the latest version of ISO/IEC 10646 available at the date of publication of the containing specification. XML 1.1