W3CDocument Formats DomainInternational | Group Home Page | Member-Confidential!

Public Last Call #2 Comments
Character Model for the World Wide Web 1.0
W3C Working Draft 30 April 2002

Only comments that are publicly visible are listed below.
Such comments sometimes contain links to other items that are not publicly visible.

Useful links

Character Model: Last Call #2 WD | Last Call #1 WD

Related documents: Comment submission form | Public Last Call #1 Comments

Related mail archives: www-i18n-comments

Last Call Comments

#See keyFromForRefDescription
IDS
C018NaNaNDan Connolly
-
4.4Example unclear
  • Comment (received 2002-04-30) -- split/concat example looks good, I think (LCI-191)

    I'm not quite sure I understand it... maybe it's just my browser's poor rendering... but that string, 'cz', just looks like a normal ASCII string. I don't see how the accent thingy pops up when you delete the 'z'.

  • Decision: Not applicable.

    Rationale: We have classified this as 'not applicable' because you have not asked for any actual changes of the document. Your problem with understanding the cz example may have been due to rendering issues, but we doubt that, because the (non-combining) cedilla that we used in the draft to represent the combining cedilla is part of iso-8859-1, which was rendered well since the very first browsers. We suspect that you may have overlooked the cedilla, a little low hook after the z. We will try to take your suggestion for a test into consideration for our next phase.

  • Our response (sent 2003-12-11) -- Notification

C023EASJeremy Carroll
-
4.4'normalization-sensitive' unclear
  • Comment (received 2002-05-14) -- normalization-sensitive unclear

    I found the definition of normalization-sensitive in section 4.4 unclear.

  • Decision: Accepted.

  • Decision: Add examples (i) of operations which are not normalization-sensitive, and (ii) illustrating what we mean by inputs and outputs.

  • Our response (sent 2003-05-01) -- Notification

C024NaNaSJeremy Carroll
-
3.2Is UTF-7 a unicode encoding form?
  • Comment (received 2002-05-14) -- UTF-7

    Is UTF-7 a unicode encoding form? (I am pretty ignorant about UTF-7 but I believe it exists and is a UCS).

  • Decision: Not applicable.

    We have classified this comment as 'not applicable', because it is only a question.

    Our answer is that yes and no. UTF-7 can be considered an unicode encoding form, or not. It is an unicode encoding form to the extent that it encodes a sequence of unicode characters. However, it does not map a character to an identifiable sequence of bytes, and has a number of other rather undesirable properties. It was designed for use in very special cases such as Email, but has widely been replaced by UTF-8, and is no longer recommended for use, to the extent that we decided that the most adequate way to handle it in the Character Model was to completely ignore it.

C025TANOlu Ibidunni
-
3.1.5In 4th example, should 'o' be 'ö'?
  • See also the following comments: C020 C154

  • Comment (received 2002-05-14) -- Typo

    The last sentence of the fourth example in section 3.1.5 has the character 'o' should it be 'ö' instead.

  • Decision: Accepted.

  • Our response (sent 2003-02-17) -- Notification

  • Notification sent but bounced. See bounced note.

C026EANIan Jacobs
-
3.1.3mapping between character codes and units of displayed text
  • See also the following comments: C096

  • Comment (received 2002-05-24) -- mapping between character codes and units of displayed text

    In '[S] [I] Specifications and software MUST NOT assume a one-to-one mapping between character codes and units of displayed text.', does 'character codes' mean characters or character codes or both? Please clarify.

  • Decision: Accepted.

  • Decision: Change 'character codes' to 'characters'.

  • Our response (sent 2003-02-17) -- Notification

C027NaNaOJoseph Reagle
XML Sig WG
VariousXML Sig WG comments
C028NaNaNJeremy Carroll
RDF Core WG
VariousEndorsement from RDF Core
  • Comment (received 2002-05-27) -- Endorsement from RDF Core

    For the sections 3.4, 4, 6, 9, C, D RDF Core endorses the last call working draft. We have found earlier drafts helpful in identifying how best to meet our responsibilities to RDF users world wide. (However, we do not intend to address all the requirements of these sections in the version of the RDF recommendations currently in working draft).

  • Decision: Not applicable.

  • Rationale: We thank you for your endorsement. We have classified this comment as 'not applicable' because it does not suggest or imply any changes. We would like to note that the Character Model is written so as to make clear that specifications do not have to follow all the requirements, just those that apply in their specific case.

  • Our response (sent 2003-02-13) -- Notification

C029NaNaNJeremy Carroll
RDF Core WG
2breadth of scope
  • Comment (received 2002-05-27) -- breadth of scope

    Concerning sections 1 and 2 RDF Core is concerned that the scope of charmod is overly broad. In particular, there appears to be no acknowledgement that some languages being defined by W3C working groups may not be intended as web languages and hence not have a need to address internationalization issues. There may be an implicit (and false) assumption that all W3C recommendations specify (only) web languages with processing models.

  • Our response (sent 2002-05-27) -- Re: breadth of scope

  • Comment (received 2002-05-28) -- RE: breadth of scope

  • Decision: Not applicable.

  • Rationale: We have classified this comment as 'not applicable', because it is too general. Each CharMod requirement applies only where applicable. For example, if a specification doesn't deal with sorting, then requirements related to sorting do not apply. Also, specifications that don't deal with text (e.g. a bitmap format) would therefore not have any applicable requrements (except e.g. for textual comments and other metainformation embedded in the format). We would also like to point out that the term 'processing model' is taken very widely here. Even if a specification does not have an explicitly defined processing model, it implicitly defines how to process (e.g. match) characters. As an example, RDF conforms to the processing model, on the level of the abstract syntax by virtue of the fact that the abstract syntax is expressed in Unicode, and on the level of RDF/XML by virtue of being based on XML.

  • Our response (sent 2003-02-13) -- Notification

C030ENNJeremy Carroll
RDF Core WG
3.5non-universality of processing model
  • Comment (received 2002-05-27) -- non-universality of processing model

    For the section 3.5 RDF Core WG notes that the language is somewhat offputting for us as specification developers given that our specification explicitly does not have a processing model. We have no particular suggestions about this, nor would we object if the I18N WG chose not to address this issue.

  • Our response (sent 2002-05-27) -- Re: non-universality of processing model

  • Comment (received 2002-05-28) -- RE: non-universality of processing model

  • Decision: Noted.

    Rationale: We have classified this comment as 'Noted', because it did not contain any suggestions for changes.

    However, in order to address the misunderstanding that we think this comment exposes, we have added some text (just before C014):

    "Also, while this document uses the term Reference <emph>Processing</emph> Model and describes its properties in terms of processing, the model also applies to specifications that do not explicitly define a processing model."

    We hope that this clarifies the situation for RDF: Even if there is no processing model for RDF, on the level of text processing, RDF conforms to the Charmod Reference Processing Model because of the way the abstract syntax is defined in terms of Unicode characters and because of the way XML is used.

  • Our response (sent 2003-02-13) -- Notification

C031SPNJeremy Carroll
RDF Core WG
8no dependency on IRI draft
  • See also the following comments: C059 C170

  • Comment (received 2002-05-27) -- no dependency on IRI draft

    The main concern of the RDF Core WG is section 8. Any normative section of charmod MUST NOT depend on the IETF IRI draft which is not finished and is not yet stable. We draw attention to 'SHOULD use Internationalized Resource Identifiers (IRI) [I-D IRI]'. The IRI draft is only a draft, the reference to it is not normative, and the strength of this SHOULD dependency appears excessive ('not optional'). In particular, the IRI draft does not adequately address IRI equality (not merely functional equivalence in retrieval). Moreover, the bidi section presents a learning curve which developers are unlikely to want to climb before IRI has consensus around it; We have found the text in Xlink section 5.4 and XML Erratum 26 adequately clear for some of the IRI questions, particularly those that are most pressing for RDF and believe that charmod should merely:

    - reiterate such text;

    - reiterate the early uniform normalization model for the iris when regarded as unicode strings

  • Decision: Partially accepted.

    Rationale: Our plan is that the IRI ID, referenced in this section, will have been submitted for Proposed Standard by the time CharMod moves to the next stage. IRI equality is fully addressed in the latest IRI ID version.

  • Our response (sent 2003-02-13) -- Notification

C032NaNaOJeremy Carroll
RDF Core WG
VariousOverview of RDF Core feedback
  • Comment (received 2002-05-27) -- Overview of RDF Core feedback

    The RDF Core WG has made feedback concerning the following sections of charmod:

    > 1. Introduction

    > 2. Conformance

    > 3.4 Strings

    > 3.5 Reference Processing Model

    > 4. Early Uniform Normalization

    > 6. String Identity Matching

    > 8. Characeter Encoding in URI References

    > 9. Referencing the Unicode Standard

    > A.2 Other References

    > C. Composing Characters

    > D. Resources for Normalization

    [...]

    RDF Core makes no comments on the other sections.

  • This comment lists the sections that have been commented on by the RDF Core WG. Please see the specific comments listed below.

  • This comment has been split into the following comments: C028 C029 C030 C031

C033EPNIan Jacobs
-
3.1.6Use of word 'byte'
  • Comment (received 2002-05-28) -- Use of word 'byte'

    Proposal: Change 'the word bytes is generally considered to mean 8-bit bytes' to 'the word bytes is generally considered to mean 8 bits.'

  • Decision: Improve the sentence.

  • Decision: Partially accepted.

  • Rationale for 'Partially accepted': We think we can improve on the suggested wording.

  • Our response (sent 2003-02-17) -- Notification

C034SASJoseph Reagle
XML Sig WG
3.6.3Private Use Code Points: Disagreement with our approach
C035SANJoseph Reagle
XML Sig WG
Various'All W3C specs must conform.'
  • Comment (received 2002-05-24) -- Re: 2nd Last Call for the Character Model for the WWW

    Please do not state 'All W3C specs must conform.' I think you should:

    a. state in the STATUS that the intent is that this will be used for W3C specifications.

    b. state, 'any spec wishing to conform must ...' and how, when, and what unforeseen exceptions might be permitted becomes is a matter of W3C policy -- perhaps following Ian's suggestion.

  • Decision: Accepted

  • Rationale: We have originally rejected this comment. We have later, after extensive discussions, been instructed by W3C Management that it is inappropriate for a W3C spec to directly enforce requirements on other specifications, and have removed the relevant language. We have also been instructed to request a finding from the TAG corresponding to the text that we removed.

C036EANJoseph Reagle
XML Sig WG
3.1.3Define 'logical order'
C037EANJoseph Reagle
XML Sig WG
4.1.1Character 'í' hard to distinguish from 'i', particularly when italicized
C038EANJim Melton
XML Query WG
2Conformance of new vs. old specs
  • See also the following comments: C051 C088 C089 C135

  • Comment (received 2002-05-31) -- Conformance of new vs. old specs

    Section 2, 'Conformance', contains the following statements:

    [S] Every W3C specification MUST:

    1. conform to the requirements applicable to specifications,

    2. specify that implementations MUST conform to the requirements applicable to software, and

    3. specify that content created according to that specification MUST conform to the requirements applicable to content.

    [S] If an existing W3C specification does not conform to the requirements in this document, then the next version of that specification SHOULD be modified in order to conform.

    It seems strange that 'Every...specification MUST...conform to the requirements', but that existing specifications that do not conform 'SHOULD be modified'. While we assume that the intent is to require conformance by *new* specifications without mandating updates to existing specifications solely for conformance reasons, the wording is certainly surprising and could be made clearer.

  • Decision: Accepted

    You point out a clear inconsistency, which we have fixed a while ago. We have later been told that it is inappropriate for a W3C spec to directly enforce requirements on other specifications, and have removed the relevant language altogether. We have been instructed to request a finding from the TAG corresponding to the text that we removed. We will make sure that, if relevant, the inconsistency you pointed out will not reappear.

C039EANJim Melton
XML Query WG
3.1.5Determining relevant language for sorting
  • Comment (received 2002-05-31) -- determining relevant language for sorting

    Section 3.1.5, 'Units of collation', contains the statement 'Note that, where searching or sorting is done dynamically, particularly in a multilingual environment, the 'relevant language' should be determined to be that of the current user, and may thus differ from user to user.'

    While we agree that the user's language is frequently the most reasonable choice to be used in determining a collation to be used for various operations, it is most certainly not *always* the best choice. This is particularly true when massive amounts of data have been placed into a repository of some sort (e.g., a database) using the semantics of the data itself. For example, database systems often enhance retrieval performance through the use of special structures ('indexes') that are created long in advance of knowing what user might be retrieving the data. In such cases, it might be determined that the 'best' default language is the language of the data, or of the repository, or of some other entity.

    To ensure that the statement in Section 3.1.5 is interpreted to allow this situation, the phrase 'should be determined' should ;^) be replaced with 'SHOULD be determined'. By formalizing the term ('SHOULD'), the Character Model document properly recognizes that some applications require the ability to have different defaults.

  • Decision: Accepted

  • Our response (sent 2003-05-01) -- Notification

C040EANJim Melton
XML Query WG
3.1.7How to avoid use of the term 'character'?
  • See also the following comments: C004 C138 C166

  • Comment (received 2002-05-31) -- How to avoid use of the term 'character'?

    Section 3.1.7, 'Summary', contains a paragraph that reads: '[S] When specifications use the term 'character' it MUST be clear which of the possible meanings they intend. [S] Specifications SHOULD avoid the use of the term 'character' if a more specific term is available.'

    This paragraph would be considerably more useful if it either contained a list of the possible meanings or contained a pointer to another location in the document that provides such a list.

  • Decision: Accepted

  • Decision: We'll add clarification.

  • Our response (sent 2003-05-01) -- Notification

C041SANJim Melton
XML Query WG
3.2Proprietary charset identifiers
  • See also the following comments: C139

  • Comment (received 2002-05-31) -- proprietary charset identifiers

    Section 3.2, 'Digital Encoding of Characters', list element 4, contains the phrase '... is identified by an IANA charset identifier.'

    In fact, there are a great many CESes that are identified by charset identifiers that are not assigned by IANA at all, but that are 'created' by proprietary means (e.g., corporations). The Character Model specification must not prohibit the use of CESes identified by charset identifiers assigned through other means.

    To correct this, simply change '...is identified by an IANA charset identifier.' to '...is identified by a unique identifier, such as an IANA charset identifier.'

  • Decision: Rejected.

  • See our response (below) for our rationale.

  • Our response (sent 2002-06-05) -- Re: proprietary charset identifiers

    [...] Please tell us, at your earliest convenience, whether you are satisfied with our decision or not. If not, please provide additional rationale.

  • Comment (received 2002-06-13) -- Re: proprietary charset identifiers

    [...] I assure you that I am not satisfied with the decision [...]

  • Decision: Accepted.

  • Note: We've made the requested change and will ask the XML Query WG whether they have further concerns about section 3.6.2:

    [S] If the unique encoding approach is not taken, specifications SHOULD mandate the use of the IANA charset registry names [...]

    If they do have such concerns, they need to raise a separate comment.

    After some discussion (see the last message on this from Jim Melton) we have decided to accept the comment as it was made. We have changed "... is identified by an IANA charset identifier." to "... is identified by a unique identifier, such as an IANA charset identifier."

    However, our exchange suggests that the XML Query WG may also not be okay with some of the wording in Section 3.6.2, which (among else) says: "[S] If the unique encoding approach is not taken, specifications SHOULD mandate the use of the IANA charset registry names [...]"; if this is the case, please indicate so at as soon as possible, or we will have to assume that this is okay with you.

C042SA Jim Melton
XML Query WG
4.4Discussion of subsequent items
  • Comment (received 2002-05-31) -- DISCUSSION OF SUBSEQUENT ITEMS

    Section 4.4, 'Responsibility for Normalization', is the section in which the Query WG is most interested. Intense discussions have taken place in the past over the subject of when normalization should, should not, must, or must not be performed, and what components of an environment have responsibilities in that area. Implementers of data repositories that might contain vast quantities of data (e.g., database systems) have expressed particular concerns about this, observing that some applications involve the need to store data very quickly, but retrieve it in a less urgent fashion, while other applications place severe demands on retrieval but have fewer constraints on storing data. In other words, the demands of users of applications, not a rigid policy, must govern *some* aspects of the decision about when normalization is performed and by whom.

  • Decision: Accepted.

C043SA Jim Melton
XML Query WG
4.4Objection to prohibition against receiver from normalizing text
  • Comment (received 2002-05-29) -- Objection to prohibition against receiver from normalizing text

    Section 4.4, 'Responsibility for Normalization', specifies one requirement (on web content) that states '[C] In order to conform to this specification, all text content on the Web MUST be in include-normalized form and SHOULD be in fully-normalized form.' While this statement expresses a clearly desirable situation, it is 100% guaranteed that the web will *never* contain only include-normalized text. The character model MUST (irony noted) recognize that fact and give guidance for dealing with such data. A preferred alternative, currently prohibited by the Character Model document, is to allow a *consuming* application to do the normalization. [The Character Model] currently prohibits such action, as addressed in the next comment. This prohibition is unreasonable and we believe that its inclusion will dramatically inhibit adoption of the Character Model by products and implementors.

  • Decision: Accepted.

C044SA Jim Melton
XML Query WG
4.4Prohibition against normalizing suspect text
  • Comment (received 2002-05-31) -- Prohibition against normalizing suspect text

    In Section 4.4, 'Responsibility for Normalization', another requirement (on implementations and on specifications) states: '[S] [I] A text-processing component that receives suspect text MUST NOT perform any normalization-sensitive operations unless it has first confirmed through inspection that the text is in normalized form, and MUST NOT normalize the suspect text. Private agreements MAY, however, be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed.'

    We do not object to the observation that normalization-sensitive operations are best performed on normalized text. However, the requirement (as stated) clearly prohibits a consuming application from normalize non-normalized text that it receives. We are quite opposed to this prohibition for a variety of reasons.

    One problem is that the requirement doesn't make it clear what the consuming application's behavior must be, but it seems reasonable to conclude that the consuming application must reject the un-normalized text. Of course, such an application that silently rejects such text is unlikely to be considered user-friendly, so we might guess that an error can be raised in some manner. But that makes the application very unhelpful in general, since users (e.g., of web browsers) often wish to access text regardless of how rigidly it conforms to the Character Model's requirements.

    The 'Private agreements' clause starts off in a promising manner, but then requires that the results of such agreements remain unhelpful to users of the applications, since the application is not allowed to produce 'observable results' based on handling un-normalized text.

    So, what can be done to support applications (and specifications) that must deal with text cannot always be guaranteed to be normalized? We very much want certain classes of applications to be allowed to do normalization on un-normalized text and we are willing to participate in discussions that identify those classes of applications.

    A rather cynical way out of this dilemma that can be imagined is for an application (e.g., a database management system) to 'read' suspect text and then 'create' brand new normalized text that just happens to be character-for-character identical to the un-normalized text it received. That obviously implies degenerating into games just to get around conformance requirements; instead, we must fix the specifications and requirements themselves.

  • Decision: Accepted.

C045SA Jim Melton
XML Query WG
4.4Prohibition against interim unnormalized states
  • Comment (received 2002-05-31) -- Prohibition against interim unnormalized states

    Section 4.4, 'Responsibility for Normalization', contains a requirement that states: '[I] A text-processing component which modifies text and performs normalization-sensitive operations MUST behave as if normalization took place after each modification, so that any subsequent normalization-sensitive operations always behave as if they were dealing with normalized text.'

    We believe that many implementors, on grounds of performance considerations, disagree with the requirement that normalization take place after each operation. While we recognize that the Note following the quoted requirement suggests a way to ease the performance issue (using what we call 'local normalization'), we believe that a couple of good examples will help ease implementors' concerns. More importantly, we believe that some application requirements are best satisfied by allowing such (normalization-sensitive) operations on text that has not yet been proven to be normalized. A requirement that such operations on such text result in fully-normalized text poses the unreasonable additional burden of doing complete normalization of possibly huge text without a clear application need to do so.

    We urge the document's editors to modify the requirements to recognize the facts that some applications require the ability to use and manipulate un-normalized text and that operations on such text need not necessarily be normalized. In XQuery, we have a normalize function to allow applications to *force* normalization when they require normalized text and we believe that this is a useful way forward.

  • Decision: Accepted.

C046NaNaODavid Fallside
XMLP WG
VariousXMLP WG response to Charmod LC#2
C047NaNaOTim Bray
-
VariousComments on Character Model
C048NaNaOYin Leng Husband
WSArch WG
VariousWSArch WG review of Charmod LC #2
C049NaNaONorman Walsh
TAG
VariousTAG comments on Character Model for the World Wide Web 1.0
C050SASCliff Schmidt
Microsoft
VariousCharMod restricts closed systems
  • Comment (received 2002-06-06) -- CharMod restricts closed systems

    In the effort to improve interoperability of text exchange across open applications on the Web, the Character Model should not restrict the ability for closed systems to leverage the Web and Web-based technologies The term 'closed system', as used in this document, refers to a system designed to support organizations communicating among themselves based on a contract into which all parties have explicitly entered.

    The background of this spec states that 'the Web may be seen as a single, very large application...rather than as a collection of small independent applications.' Based on this premise, it is understandable why the CharMod spec chooses to require early normalization in a single canonical form. However, the Web and technologies that have developed to support the Web have also provided enormous value to closed systems, including intranet and extranet scenarios. The relationship between the evolution of the World Wide Web and its use in closed/private systems has been a mutually beneficial one. Private systems have benefited from the efficiencies of applying Web-developed standards and tools, which has in-turn increased the demand and support for these Web-enabling components. The current Character Model spec threatens to break this relationship by forcing restrictions on tools that are commonly used in closed systems, in order to exclusively support the goals of the open system Web.

    It is apparent that the I18N WG has solid reasons for preferring Normalization Form C as an interchange format for the Web; however, it is not likely to be the optimal choice for all applications. There are many legacy systems (both applications and operating systems) that use a decomposed character normalization. It will be difficult for organizations to justify why they should adopt CharMod-based technologies (such as XML 1.1 over XML 1.0), which require transcoding to a less optimal normalization form with no benefit for their closed system. This is likely to lead to fractured use of technologies such as XML 1.0/1.1.

    Finally, XML plays an important role as a data-interchange format in scalable, loosely coupled systems. The Character Model reduces XML to a format applicable only to natural language communication in one particular normalization form. This is unfortunate considering that vastly more bytes of machine-to-machine XML are transmitted than are people-to-people or people-machine bytes.

    The restrictions mandated by the Character Model limit the use of the Web and Web-based technologies for a large base of users. While supporting the vision for the Web as a 'single, very large application', the limitations to other uses of the Web does not appear to support the Character Model's goal to 'facilitate the use of the Web by all people'.

  • Decision: Accepted.

  • Decision: Changed to overall SHOULD in 4.4.

  • Our response (sent 2003-03-06) -- Notified and discussed during FTF mtg

  • Comment (received 2003-03-15) -- Satisfied

C051EASCliff Schmidt
Microsoft
2Inconsistent/Redundant Requirements for W3C Spec Conformance
  • See also the following comments: C038 C088 C089 C135

  • Comment (received 2002-06-06) -- Inconsistent/Redundant Requirements for W3C Spec Conformance

    '[S] Every W3C specification MUST:

    1. conform to the requirements applicable to specifications,

    2. specify that implementations MUST conform to the requirements applicable to software, and

    3. specify that content created according to that specification MUST conform to the requirements applicable to content.

    [S] If an existing W3C specification does not conform to the requirements in this document, then the next version of that specification SHOULD be modified in order to conform.'

    CONCERN: Stating that all specs 'MUST' conform, but that non-conforming specs 'SHOULD' be modified appears to be inconsistent.

    RECOMMENDATION: The conformance model should only apply to future specs (including future versions of current specs), instead of specifying different conformance levels for existing and future versions.

  • Decision: Accepted.

  • Our response (sent 2003-03-06) -- Notified and discussed during FTF mtg

  • Comment (received 2003-03-15) -- Satisfied

C052SPSCliff Schmidt
Microsoft
2W3C Spec Conformance
  • Comment (received 2002-06-06) -- W3C Spec Conformance

    '[S] Every W3C specification MUST:

    1. conform to the requirements applicable to specifications,

    2. specify that implementations MUST conform to the requirements applicable to software, and

    3. specify that content created according to that specification MUST conform to the requirements applicable to content.

    [S] If an existing W3C specification does not conform to the requirements in this document, then the next version of that specification SHOULD be modified in order to conform.'

    CONCERN: The CharMod's requirement that all specs and related implementations conform to the entire CharMod spec will force non-NFC based applications to perform round trip transcoding to/from NFC in order to use the Web, even in closed system scenarios (e.g. extranets). This will also affect intranet scenarios as corporate systems are forced to jump through hoops in order to satisfy text processors (such as XML parsers) that are required to reject non-NFC text. The costs certainly outweigh the benefits for closed systems. However, it is clear that a recommended conformance level would improve open system interoperability.

    RECOMMENDATION: Replace conformance paragraph and included list with the following sentence:

    '[S] Future W3C specifications (including future versions of existing specifications) MUST reference this specification as W3C recommended guidance for interoperable Web applications.'

  • Decision: Partially accepted. The concerns about NFC have been addressed in section 4.4

  • Our response (sent 2003-03-06) -- Notified and discussed during FTF mtg

  • Comment (received 2003-03-15) -- Satisfied

C053EPSCliff Schmidt
Microsoft
3.5Full Range of Unicode Code Points Not Allowed in XML
  • Comment (received 2002-06-06) -- Full Range of Unicode Code Points Not Allowed in XML

    '[S] Specifications SHOULD allow the use of the full range of Unicode code points from U+0000 to U+10FFFF inclusive; code points above U+10FFFF MUST NOT be used.'

    CONCERN: If this is truly a goal for text on the Web, users should understand why XML is unable to achieve this. As a high profile W3C spec, readers are likely to notice the inconsistent message. Does this mean that I18N believes that XML (1.1 or some later version) should support the characters 0x0-0x1F?

    RECOMMENDATION: If XML 1.1 is unable to achieve this goal, the Character Model spec should either remove this requirement or explain the discrepancy.

  • Decision: Partially accepted.

  • Rationale for 'Partially accepted': It is up to each specification to provide the specific reason(s) for deviating from a 'SHOULD' requirement. We have however amended the text to read: 'Specifications SHOULD not arbitrarily exclude characters from the full range of Unicode code points from U+0000 to U+10FFFF inclusive;'.

  • Our response (sent 2003-03-06) -- Notified and discussed during FTF mtg

  • Comment (received 2003-03-15) -- Satisfied

C054ERSCliff Schmidt
Microsoft
4.2.3Definition of 'Fully-Normalized'
  • Comment (received 2002-06-06) -- Definition of 'Fully-Normalized'

    'Text is fully-normalized if:

    1. the text is in a Unicode encoding form, is include-normalized and none of the constructs comprising the text begin with a composing character or a character escape representing a composing character; or

    2. the text is in a legacy encoding and, if it were transcoded to a Unicode encoding form by a normalizing transcoder, the resulting text would satisfy clause 1 above.'

    CONCERN: Based on previous definitions, 'Unicode-normalized' may be a more precise term than 'Unicode encoding form' (if the implication is that full normalization requires include-normalization, which requires Unicode normalization as defined in 4.2.1.

    RECOMMENDATION: Refer to text that is 'Unicode-normalized' (possibly linked to the definition [in 4.2.1]), instead of 'Unicode encoding form'.

  • Decision: Partially accepted.

  • Rationale: Checking reveals that we could go either way; change record to partially accepted, but no change.

  • Our response (sent 2003-03-06) -- Notified and discussed during FTF mtg

  • Comment (received 2003-03-15) -- Satisfied

C055SRSCliff Schmidt
Microsoft
4.4Mandating NFC for All Web Content
  • Comment (received 2002-06-06) -- Mandating NFC for All Web Content

    '[C] In order to conform to this specification, all text content on the Web MUST be in include-normalized form and SHOULD be in fully-normalized form.

    [S] Specifications of text-based formats and protocols MUST, as part of their syntax definition, require that the text be in normalized form.'

    CONCERN: This restriction is currently applied to 'all text content on the Web' when only content intended for interoperability in open systems will necessarily benefit from it. Although the first requirement for content producers ([C]) could be interpreted to have no impact on intranet scenarios, the second requirement for specifications ([S]) will impact the tools that intranet scenarios have been depending on.

    RECOMMENDATION: Replace above text with text similar to:

    '[C] In order to conform to this specification, all text content on the Web intended for consumption by foreign systems MUST be in include-normalized form and SHOULD be in fully-normalized form.

    [S] Specifications of text-based formats and protocols MUST, as part of their syntax definition, reference the above requirement.'

  • Decision: Rejected. Note, however, the relaxation of the language in section 4.4.

  • Rationale: 'Foreign systems' is undefined and undefinable.

  • Our response (sent 2003-03-06) -- Notified and discussed during FTF mtg

  • Comment (received 2003-03-15) -- Satisfied

C056SPSCliff Schmidt
Microsoft
4.4Text-Processors MUST Perform Normalization Checking
  • Comment (received 2002-06-06) -- Text-Processors MUST Perform Normalization Checking

    '[S] [I] A text-processing component that receives suspect text MUST NOT perform any normalization-sensitive operations unless it has first confirmed through inspection that the text is in normalized form, and MUST NOT normalize the suspect text. Private agreements MAY, however, be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed.'

    CONCERN: This requirement will force technologies such as XML parsers to be tied to the latest list of NFC disallowed diacritic characters in order to check normalization. Additionally, in some cases ('MAYBE' cases) NFC checks require text processors to scan backwards through a text stream in order to confirm normalization status. This will require major architectural changes for any processors designed to break a text stream into separate smaller windows for efficient processing, because no previously processed buffer can be thrown away until it is no longer needed to confirm the validity of any diacritic code points at the start of the next buffer. Text processors that expand character entities today at least have the ability to note the ‘&’ flag. It is also worth noting that optimizers of normalization checks will observe that all code points < 0x341 are always allowable. This would result in non-English based texts being disproportionately impacted by normalization checks. Finally, this requirement forces the redefinition of XML to allow for only NFC text.

    RECOMMENDATION: Replace the above text with text similar to:

    '[S] [I] Text-processing components MAY include an option to verify that suspect text is in normalized form. Text-processing components MUST NOT normalize the suspect text without specific direction.'

  • Decision: Partially accepted.

    Rationale: Try to add a note explaining that in the base case, only a one-character lookahead is needed. In the long term, try to move material about 'composing' characters to UAX 15.

  • Our response (sent 2003-03-06) -- Notified and discussed during FTF mtg

  • Comment (received 2003-03-15) -- Satisfied

C057SPSCliff Schmidt
Microsoft
4.4Content Producers and Proxies
  • Comment (received 2002-06-06) -- Content Producers and Proxies

    'NOTE: As an optimization, it is perfectly acceptable for a system to define the producer to be the actual producer (e.g. a small device) together with a remote component (e.g. a server serving as a kind of proxy) to which normalization is delegated. In such a case, the communications channel between the device and proxy server is considered to be internal to the system, not part of the Web. Only data normalized by the proxy server is to be exposed to the Web at large, as shown in the illustration below:'

    CONCERN: Although this note seems to allow closed systems to define their own boundaries, these systems will still be prevented from leveraging technologies based on CharMod, without first normalizing.

    RECOMMENDATION: If the CharMod spec continues to mandate that all text processors must check normalization, this note should point out that such processors could not be used between devices and their proxies.

  • Decision: Partially accepted.

  • Our response (sent 2003-03-06) -- Notified and discussed during FTF mtg

  • Comment (received 2003-03-15) -- Satisfied

C058ERSCliff Schmidt
Microsoft
4.4Web Repositories
  • Comment (received 2002-06-06) -- Web Repositories

    'A similar case would be that of a Web repository receiving content from a user and noticing that the content is not properly normalized. If the user so requests, it would certainly be proper for the repository to normalize the content on behalf of the user, the repository becoming effectively part of the producer for the duration of that operation.'

    CONCERN: As noted in issue, 'Content Producers and Proxies', this scenario is not possible if XML is to be used between the user and the Web repository. This scenario also seems to imply that users may very liberally interpret the boundaries of content production. Could one also claim that the transfer from one repository to another repository was also contained within the realm of content production? What if the second repository was the final destination for the content; does this mean the content never in fact needed to be normalized?

    This scenario seems to encourage users to get around normalization requirements where impractical or inappropriate, yet leaves them with a confusing message and no legal tools to work with (since tools such as XML parsers will only accept normalized text anyway).

    RECOMMENDATION: Delete this paragraph.

  • Decision: Rejected.

  • Rationale: Defining the boundary of content production is outside the scope of this specification.

  • Our response (sent 2003-03-06) -- Notified and discussed during FTF mtg

  • Comment (received 2003-03-15) -- Satisfied

C059SANCliff Schmidt
Microsoft
8IRIs
  • See also the following comments: C031 C170

  • Comment (received 2002-06-06) -- IRIs

    '[S] W3C specifications that define protocol or format elements (e.g. HTTP headers, XML attributes, etc.) which are to be interpreted as URI references (or specific subsets of URI references, such as absolute URI references, URIs, etc.) SHOULD use Internationalized Resource Identifiers (IRI) [I-D IRI] (or an appropriate subset thereof).'

    CONCERN: Although other W3C specifications support goals similar to those of the IRI proposal, we hesitate to endorse this section of the CharMod spec until the IRI draft has undergone further review.

    RECOMMENDATION: Considering the W3C practice to keep the maturity level of a technical report within one level of any technical report on which it depends, the Character Model should not be considered for Recommendation until the IRI proposal has reached RFC status. Although the precedent has typically referred to W3C dependencies, it seems reasonable that any dependency on a spec outside the W3C should be judged by criteria at least as strong as those imposed on the W3C.

  • Decision: Accepted.

    Our plan is that the IRI ID, referenced in this section, will have been submitted for Proposed Standard by the time CharMod moves to the next stage. IRI equality is fully addressed in the latest IRI ID version.

C060NaNaNDavid Fallside
XMLP WG
2XML Protocol LC#2 review question on implementation testing
  • Comment (received 2002-05-30) -- XML Protocol LC#2 review question on implementation testing

    In reviewing the Charmod LC#2, the XML Protocol WG has a request in relation to implementation conformance.

    - Section 2 Conformance, 3rd paragraph, last sentence

    - '[S] [I] [C] In order to conform to this document, specifications MUST NOT violate any requirements preceded by [S], software MUST NOT violate any requirements preceded by [I], and content MUST NOT violate any requirements preceded by [C].'

    - The XML Protocol WG has produced a protocol specification, which makes various testable assertions, as well as a Collection of Tests each showing whether an assertion is implemented in the protocol processor. As such, the tests do not check for conformance to other specifications (e.g. XML 1.0). The XML Protocol WG asks the I18n WG to comment on whether the [I] implementation requirements of Charmod apply to the XMLP Test Collection.

  • Decision: Not applicable

  • Rationale: We have classified this comment as 'not applicable', because it is a question, not a comment leading to a potential change of the Character Model. The test suite should test for CharMod-related requirements in the specification(s) being tested. The tests should conform to [C] requirements (except where they are wrong on purpose). If the test collection includes code, then that should also conform to [I] requirements.

C061ENaNDavid Fallside
XMLP WG
1.1'All W3C specifications must conform to this document'
  • Comment (received 2002-05-30) -- XMLP WG response to Charmod LC#2

    Goals and Scope, second last paragraph

    'All W3C specifications must conform to this document (see section 2 Conformance).'

    This statement seems too comprehensive and probably needs qualification. What about existing W3C specifications or a W3C specification whose status is very close to LC (e.g. XMLP's)?

    Suggest: 'All W3C specifications published after [a certain date or event such as this Charmod becoming a recommendation] must conform to this document (... etc).'

  • Decision: Rejected

  • Rationale 'Rejected': This para states the general principle and refers to section 2 for details. The various requirements will come into force once the CharMod spec becomes a REC.

  • New decision: Not applicable

    Rationale: We have classified this comment as 'not applicable' because we have been told that it is inappropriate for a W3C spec to directly enforce requirements on other specifications, and have removed the relevant language from section 2. We still define conformance to CharMod. We have been instructed to request a finding from the TAG corresponding to the text that we removed. So CharMod will be enforced by the fact of being a REC, coupled with an eventual TAG finding and ongoing reviews of relevant specs by the I18N WG.

C062NaNaCDavid Fallside
XMLP WG
4.4'XML protocol need not normalize application payloads or check to insure that they are normalized'
  • Comment (received 2002-05-30) -- XMLP WG response to Charmod LC#2

    Responsibility for Normalization, 8th paragraph, 1st sentence

    '[S] Specifications of text-based formats and protocols MUST, as part of their syntax definition, require that the text be in normalized form'

    In our previous response to the Charmod WD review [1], we said

    'The XML Protocol processor will defer to applications any normalization (early or late) that may be required for sending and/or receiving application payload(s).'

    We received confirmation from i18n WG that that is acceptable:

        > XML protocol need not normalize application payloads

        > or check to insure that they are normalized

        Correct.

    The XMLP WG would like to see clarification of this [S] requirement in relation to the payload's normalization or lack thereof.

  • Decision: We shall write to the XMLP WG explaining why CharMod does not require XMLP to be responsible for the N11N status of the payload.

C063NaNaCDavid Fallside
XMLP WG
4.4Is text to be normalized when forwarded?
  • Comment (received 2002-05-30) -- XMLP WG response to Charmod LC#2

    Responsibility for Normalization, 8th paragraph, 1st sentence

    '[S] Specifications of text-based formats and protocols MUST, as part of their syntax definition, require that the text be in normalized form'

    In a previous review, we asked the question: 'May intermediaries re-send payloads (either normalized or un-normalized) untouched, even though they may change the protocol envelope?' From this requirement, do we take it that re-sent text is to be normalized when forwarded? However, please see comment [C062].

  • Decision: We shall write to the XMLP WG explaining why CharMod does not require XMLP to be responsible for the N11N status of the payload.

C064ERCDavid Fallside
XMLP WG
4.4Give the reason(s) for prohibition against normalizing suspect text
  • Comment (received 2002-05-30) -- XMLP WG response to Charmod LC#2

    Responsibility for Normalization, 9th paragraph, 1st sentence

    '[S] [I] A text-processing component that receives suspect text MUST NOT perform any normalization-sensitive operations unless it has first confirmed through inspection that the text is in normalized form,and MUST NOT normalize the suspect text'

    It would be helpful to give the reason(s) for the prohibition against normalizing the suspect text.

  • Decision: Rejected.

  • Rationale: Covered by section 4.1.

C065NaNaCDavid Fallside
XMLP WG
4.4Does rejected un-normalized text have to be normalized before it is returned to sender?
  • Comment (received 2002-05-30) -- XMLP WG response to Charmod LC#2

    Responsibility for Normalization, 9th paragraph, 1st sentence

    '...and MUST NOT normalize the suspect text'

    In a previous review, we asked the question: 'If un-normalized text is rejected and returned to sender, does it have to be normalized before transmission?' From this requirement, do we take it that rejected un-normalized text is to remain un-normalized when returned?

  • Decision: We shall write to the XMLP WG explaining that this should be handled in the same manner as a payload that is not well-formed.

C066NaNa David Fallside
XMLP WG
4.4Specifications that define a mechanism for producing a document SHOULD require that the final output be normalized
  • Comment (received 2002-05-30) -- XMLP WG response to Charmod LC#2

    Responsibility for Normalization, last [S] requirement

    '[S] Specifications that define a mechanism (for example an API or a defining language) for producing a document SHOULD require that the final output of this mechanism be normalized.'

    Is a document text-based format? Is this requirement covered by the earlier one - '[S] Specifications of text-based formats and protocols MUST, as part of their syntax definition, require that the text be in normalized form.'? However, the earlier requirement is stronger ('MUST') than this one ('SHOULD').

  • Q: Is a document text-based format? A: Yes. [We must clarify that we mean the textual parts of documents]

  • Q: Is this requirement covered by the earlier one? A: No. There may not be a spec for the output document, as in the case of plain text, or the spec may not (yet) require N11N.

C067SPSTim Bray
-
3.1.5Collation
  • Comment (received 2002-05-30) -- Comments on Character Model

    [S] [I] Software that sorts or searches text for users MUST do so on the basis of appropriate collation units and ordering rules for the relevant language and/or application.

    Hmm, there are cases where you just don't know the language, and even if you do, is this a requirement in the general case for things like XQuery? I think there are scenarios where it's reasonable to say a particular module shall order things by Unicode character number order and that's all there is to it. I think this should be rewritten to say that IF strings are being collated, they MUST be collated EITHER in the order appropriate to the language they're in, or if that's not possible by unicode character number.

  • Decision: Partially accepted.

  • Note: We'll change the 'MUST' to a 'SHOULD'.

  • Our response (sent 2003-05-01) -- Notification

  • Comment (received 2003-05-01) -- Satisfied

C068SPSTim Bray
-
3.6Unique Character Encoding
  • See also the following comments: C114

  • Comment (received 2002-05-30) -- Comments on Character Model

    [S] When designing a new protocol, format or API, specifications SHOULD mandate a unique character encoding.

    No. If the format is in XML and has likely usage scenarios which include creation by humans, this is a good enough reason to just go by the XML rules. For example, I habitually compose XML documents in ISO-8859-1, which suits my needs as a user of European languages. I see no reason whatsoever why a specification should invalidate either my habits or those of a Japanese author who wants to use some flavor of JIS.

    OK, I guess this argument could fall under the exception clause of SHOULD, but I'd go so far as to add

    [S] When designing an XML-based protocol which is apt to be authored by humans, specifications MUST NOT limit the use of character encodings beyond the rules provided by XML.

  • Decision: Partially accepted.

  • Rationale: We have added: "[S] When basing a protocol, format, or API on a protocol, format, or API that already has rules for character encoding, specifications SHOULD use rather than change these rules." and have added XML as an example. As said elsewhere, we prefer not to have requirements specific to a particular format. Also, the 'authored by humans' part is not necessarily true; in general, humans care about the actual text and about the tools they use, not about encodings.

  • Our response (sent 2004-01-16) -- Notification

  • Comment (received 2004-01-24) -- Satisfied

C069EPSTim Bray
-
3.6.2Admissibility of UTF-*
  • Comment (received 2002-05-30) -- Comments on Character Model

    The paragraph beginning

    '[S] If the unique encoding approach is not chosen, specifications MUST designate at least one of the UTF-8 and UTF-16 encoding forms of Unicode as admissible... '

    is fine, but if the format uses XML, then XML's rules cover this and in fact require that UTF-8 and -16 are both admissable; which takes priority over the language here and this should be noted.

  • Decision: Partially accepted.

  • Note: Covered by our edit resulting from C114 and your previous comment C068.

  • Our response (sent 2004-01-16) -- Notification

  • Comment (received 2004-01-24) -- Satisfied

C070SNSTim Bray
-
4Early Uniform Normalization
  • Comment (received 2002-05-30) -- Comments on Character Model

    I am unable to develop an intelligent opinion as to the cost-benefit trade-off of Early Uniform Normalization and will remain unable to do so without hard information as to the cost. For example, if there was a C-language library available unencumbered by licensing issues which had a memory footprint smaller than say 10k and which ran at I/O speeds, you could reasonably argue that this is a cost effectively equal to zero. On the other hand, if E.U.N. requires a memory footprint of 256K or, worse, understanding and linking to the entire ICU library (blecch), the cost is likely to be unacceptable in a large class of applications.

    There's a normalizer demo at Unicode.org referenced from Appendix D, which suggests that a few hundred lines of Java suffice, but I haven't had time to build to tables or to really think about whether they are being done in the best possible way.

    I think my blockage on this point will be shared by the AC members who will eventually be asked to express on opinion on E.U.N. So I think somebody owes the world the gift of a few quantitative research results on these numbers.

  • Decision: Noted.

    We agree with the sentiment. Refer to some mails by Mark about cost of checking/normalizing. Doing normalization really early (when data is input or converted) is usually very cheap because it can be done by design (e.g. keyboards with dead keys, conversion from a specific legacy encoding).

  • Decision: Noted.

    We agree that this is an important consideration. Please refer to some earlier mails by Mark Davis about cost of checking/normalizing. Doing normalization really early (when data is input or converted) is usually very cheap because it can be done by design (e.g. keyboards with dead keys, conversion from a specific legacy encoding). Normalization is indeed best run at i/o speed, but this should be human input speed rather than network i/o speed.

    A general normalization algorithm needs significantly more than 10KB footprint. But there is quite a wide range of possible tradeoffs between speed and footprint.

    We have added references to implementations and additional material in Appendix D, resources for Normalization. http://www.w3.org/International/Group/charmod-edit/#sec-n11n-resources There is also an FAQ at http://www.unicode.org/faq/normalization.html. An implementation that I (MD) did for just *checking* NFC came in under 50KB (in C). Mark reported 110KB for actual normalization to NFC (in Java).

  • Our response (sent 2004-01-16) -- Notification

  • Comment (received 2004-01-24) -- Satisfied

C071ERDTim Bray
-
6Bit-by-bit identity
  • Comment (received 2002-05-30) -- Comments on Character Model

    List item 4. 'Testing for bit-by-bit identity.'

    <pedantry intensity='severe'>This may be the way you do it but I think it's the wrong way to talk about it. The point about Unicode is that it says is a character is an thingie identified by number which has a bunch of properties. At the end of the day, what you want people to do is to normalize the data in computer storage to a series of non-negative integers and when testing for equality, if you have two sequences of non-negative integers which are equal in length and pairwise equal in value, then you have equality. It is is conceivable in theory that the integer values are stored differently in two parts of the same program; and in practice, who knows what lurks inside a Perl 'scalar', and and what really happens when perl processes the '==' operator?. So I think that item 4 should say the strings are pairwise numerically equal by code point and leave it at that.</pedantry>

  • Decision: Rejected.

  • Rationale: What actually happ