W3C HomeDocument Formats DomainInternational | Group Home Page | Member-Confidential!

Public Last Call #3 Comments
Character Model for the World Wide Web 1.0: Fundamentals
W3C Working Draft 25 February 2004

Useful links

Character Model: Last Call #3 WD | Last Call #2 WD | Last Call #1 WD

Related documents: Public Last Call #1 Comments | Public Last Call #2 Comments

Related mail archives: www-i18n-comments

Last Call Comments

#See keyFromForRefDescription
IDS
LC002EANTom Milo-BRe: Character Model: Two new documents and Last Call
  • Comment (received 2004-02-26) -- Re: Character Model: Two new documents and Last Call

    Consider using a real Arabic example for Characters, Keystrokes and Glyphs. If you want to illustrate two instances of lam-alif each keyed in different, don't uses nonsense /laalaagaga/ لالاغغ. ...

  • Behdad Esfahbod suggests: لالایی - a persian word meaning 'lullaby', pronounced /laalaayee/

  • Najib Tounsi suggests: تتلألأ - an arabic word which means "it shines", "it sparkles". Where the subject "it" is of female gender. Pronounce: ta ta la' la'

  • Discussed:

  • Discussed:

  • Decision: Accepted. We used the Persian word suggested above.

LC003SNNMarkus Scherer-Overallcharmod vs. UTF-16/32
  • Comment (received 2004-02-27) -- charmod vs. UTF-16/32

    The names UTF-16 and UTF-32 are each used for an encoding form and an encoding scheme. charmod should mention this, and mention that the encoding scheme versions use Byte Order Marks (BOMs) while the encoding forms don't.

  • Decision: Noted and deferred.

    We agree with the first part of the sentence, but we do not yet have enough consensus to talk about the BOM in this version of the document.

LC004SNNMarkus Scherer-Overallcharmod vs. UTF-16/32
  • Comment (received 2004-02-27) -- charmod vs. UTF-16/32

    It should be explicitly permissible to recognize that a document uses the UTF-16 encoding scheme by its BOM, if it is present. This is common practice for HTML and XML and has proven valuable because these encoding schemes are not compatible with ASCII byte streams.

  • Decision: Noted and deferred

    We agree, but we do not yet have enough consensus to talk about the BOM in this version of the document.

LC005SNNMarkus Scherer-Overallcharmod vs. UTF-16/32
  • Comment (received 2004-02-27) -- charmod vs. UTF-16/32

    There are BOM-like signature byte sequences for other Unicode encodings as well, such as UTF-32 and SCSU. Justification as before; UTF-8 is not always the most desirable encoding.

  • Decision: Noted and deferred

    We agree, but we do not yet have enough consensus to talk about the BOM in this version of the document.

LC006SRNMarkus Scherer-6.2charmod vs. UTF-16/32
  • Comment (received 2004-02-27) -- charmod vs. UTF-16/32

    charmod C051/C052 prefers code point indexing (called "character string indexing"). This will lead to inefficiencies because most implementations will use UTF-16 strings. It would be better to recommend UTF-16 code unit indexing. (See UTN #12 http://www.unicode.org/notes/tn12/)

  • Discussed:

  • Decision: Rejected The 'character string' provides a good balance between user requirements (ideally count in terms of grapheme clusters) and implementation requirements (count in terms of code units). Also, it takes into account that specifications (in particular those related to XML) are written in terms of characters, not code units.

    We would like to point out that we have carefully listed the alternatives and the reasons for when to use them in C052 and C071,..., so that readers of the Character Model (writers of specifications) should be able to make the best decision on their own.

    Although we understand performance concerns about calculating string length, we haven't heard any complaints about this e.g. from implementers of XSLT. Also, in cases where it should really become a bottleneck, e.g. finding a certain character position in an extremely long string encoded in UTF-16 (or for that matter e.g. in UTF-8), there are techniques for optimization (e.g. building an index of every 1000'th character position for an 1M long string, to be used for speedup of subsequent indexing operations).

    Also, strings in general are not as easy to use as they may seem. For some interesting background, please see http://www.joelonsoftware.com/articles/fog0000000319.html.

LC007EANfantasai-3.2poor example of multi-letter phonemes
  • Comment (received 2004-03-04) -- poor example of multi-letter phonemes

    # for example 'wr' and 'ng' in "writing" ... For the 'wr' in writing, it is generally perceived that the w is silent and the 'r' alone gives its sound. And then for the 'ng', unless they've taken Linguistics, most English speakers don't notice that it's a separate phoneme. So I suggest you use "thing", because "th" is definitely a single phoneme.

  • Decision: Accepted

LC008EANfantasai-3.3describing 'logical' order
  • Comment (received 2004-03-04) -- describing 'logical' order

    The spec mentions that logical ordering "benefits accessibility, searching, and collation". It would be illustrative if you mention rendering the text to speech. Visual order is irrelevant to speech, the logical ordering in it is very clear, it's a concrete example, and it demonstrates the *naturalness* of logical ordering.

  • Decision: Accepted

LC009SRNFrank Ellermann-4.5C069
  • Comment (received 2004-03-03) -- C069

    C069 Content SHOULD NOT misuse character technology for pictures or graphics.

    Please add examples for smileys like ":-)" resp. characters like U+263A / U+263B (white / black smiling face). Maybe add links to corresponding CSS and accessability documents.

  • Discussed:

  • Discussed

  • Decision: Rejected We have decided to reject your comment, but would like to thank you for making it, because it has helped us getting more clarity on what exactly we should say.

    We agree that C069, as it was written, at least in some interpretations, would have prohibited ASCII art and ASCII smilies, and potentially even Unicode smilies and so on. While we do not think that ASCII art and ASCII smilies are necessarily a good idea, and in particular there are accessibility issues, we note that there is quite a widespread practice, and that with respect to accessibility, it is the expertise of a separate group, and a separate spec, that is most qualified to decide this (WCAG 1.0 has some techniques that mention ASCII art, but doesn't prohibit it outrightly). So we decided to defer the question of what to say about ASCII art and so on, and decided to remove C069, and insert a much more specific conformance requirement into the spec, placed somewhat earlier after the Note after C073:

    >>>>>>>>

    C076 [C] Content MUST NOT use a code point for any purpose other than that defined by its character encoding.

    This prohibits the construction of fonts that misuse e.g. iso-8859-1 to represent different scripts, characters, or symbols than what is actually encoded in iso-8859-1.

    >>>>>>>>

    This is the major misuse that we tried to address with C069, in a somewhat too general a fashion. In an ASCII smiley, a ')' is still a ')' as defined in ASCII, it's just used in a different way than usually, but neither the character model nor Unicode say how characters can be used and how not.

LC010NaRNFrank Ellermann-4.6C048
  • Comment (received 2004-03-03) -- C048

    "content SHOULD use the hexadecimal form of character escapes when there is one."

    This translates to "content should not be visible with any browser" in the case of Netscape 3.x (and 4.x, but that's a known problem). Based on the explanations in chapter 2 C048 would transform perfectly valid (X)HTML content to "non-conforming" content.

  • Discussed:

  • Decision: Rejected We have taken the assumption that your comment asks for removing C048 to avoid problems with browsers such as Netscape 3.x and 4.x. Under that assumption, we have rejected your comment.

    We would like to note that not only do these browsers not deal with hexadecimal character references, they are also very bad at dealing with character references in general according to the reference processing model. In particular, for Netscape 4.x, one has to label a document as UTF-8 in order for arbitrary (decimal!) character references to take effect. Given that very poor if not non-existent support for the very basics of the Character Model in those browser versions, we do not feel that it is appropriate to remove C048, which otherwise is undisputed. In addition, browser statistics (see e.g. http://www.w3schools.com/browsers/browsers_stats.asp) show that the percentage of these browsers is declining steadily and has reached very low numbers.

    We would also like to note that C048 is only a SHOULD, so this still allows the use of decimal numeric character references in situations where backwards compatibility with such kinds of browsers is really important, e.g. in intranet environments with very slow upgrade cycles.

    Please note that the wording of C048 has changed to "Content SHOULD use the hexadecimal form of character escapes rather than the decimal form when there are both." to avoid saying anything about the relative preference of named character entities vs. numeric character references. But this should be only marginally related to your comment.

LC011EASTim Bray-1.2Review of WD-charmod-20040225
  • Comment (received 2004-03-05) -- Review of WD-charmod-20040225

    second <ul second point:

    "Non-ASCII characters [ISO/IEC 646] are being used"

    awkward: the reference is to ASCII and not to "non-ASCII characters. I suggest

    "Characters outside the ASCII [ISO/IEC 646] repertoire are being used"

  • Discussed:

  • Discussed

  • Accepted We used the proposed wording.

  • Comment (received 2004-07-29) -- Satisfied

  • See also clarification

LC012ENSTim Bray-1.2Review of WD-charmod-20040225
  • Comment (received 2004-03-05) -- Review of WD-charmod-20040225

    second <ul third point:

    "More and more APIs are defined, not just protocols and format"

    So what? Why is this point here? Either remove it or explain how it relates to i18n.

  • Decision: Noted We have classified this comment as 'noted', which means that we acknowledge the point, but don't think that a change to the specification is necessary.

    APIs often require more detailled specifications than protocols or formats:

    - APIs are often used on lower-granulary units than procotols and formats.

    - APIs often work on a single machine, and trade efficiency for (cross-architecture) interoperability.

    - Protocols and formats often only move data, whereas APIs manipulate data.

    For I18N, this means that more details e.g. re. Unicode may have to be specified for APIs than for protocols and formats. This is explained in detail a couple paragraphs later.

  • Comment (received 2004-07-29) -- Satisfied

  • See also clarification

LC013ENSTim Bray-1.2Review of WD-charmod-20040225
  • Comment (received 2004-03-05) -- Review of WD-charmod-20040225

    just below previous:

    "In short, the Web may be seen as a single, very large application..."

    this paragraph may or may not be true and is orthogonal to i18n (I think) so either remove it or explain why it matters

  • Decision: Noted We have classified this comment as 'noted'. This means that while it raises a valid point, we have decided not to change the specification.

    The fact that the Web can be seen as a single, very large application (in the sense that data flows through all the pieces without any total boundaries) is indeed very important in particular for the use of Unicode as a common reference point in the Character Model. Without such a reference, binary data would be exchanged without the chance to be able to compare two text strings (e.g. if they are in incompatible encodings). This also increases the requirement for Web-wide agreements on things such as counting characters,... So this is indeed relevant to i18n, and is to quite some extent actually explained before and after the text in question.

  • Comment (received 2004-07-29) -- Satisfied

  • See also clarification

LC014EPSTim Bray-1.2Review of WD-charmod-20040225
  • Comment (received 2004-03-05) -- Review of WD-charmod-20040225

    3rd last para:

    "It should be noted that such aspects also exist in legacy encoding"

    Awkward language, suggest ".. that such issues also exist for ..."

  • Decision: Partially accepted Changed 'in' to 'for', but not 'aspects' because this word is used before.

  • Comment (received 2004-07-29) -- Satisfied

  • See also clarification

LC015EPSTim Bray-1.3Review of WD-charmod-20040225
  • Comment (received 2004-03-05) -- Review of WD-charmod-20040225

    The first sentence, beginning "For the purpose of this specification..." totally baffles me. The notion of the "producer" of text data is entirely self-explanatory, and this sentence is unnecessary, and also confusing because most people don't have an internal world-view that distinguishes "products" and "formats". I don't. I suggest

    "This specification distinguishes between the roles of <b>producer</b> and <b>recipient</b> of text data. In a networked information system, a software module may be both a producer and a recipient."

  • Discussed: see notes

    RI finds only one other use of 'producer' and 'recipient'.

  • Discussed: see notes

  • Decision: Partially Accepted We removed the first paragraph and note in section 3.1, since those definitions were not needed for this document. We will use your proposed text for the Normalization document, where these definitions are needed.

  • Comment (received 2004-07-29) -- Satisfied

LC016EASTim Bray-2Review of WD-charmod-20040225
LC017EASTim Bray-2Review of WD-charmod-20040225
  • Comment (received 2004-03-05) -- Review of WD-charmod-20040225

    items 3 and 4 in the <ol>

    I think "where applicable" is a little stronger and smoother than "if applicable"

  • Decision: Accepted

  • Comment (received 2004-07-29) -- Satisfied

LC018EASTim Bray-2Review of WD-charmod-20040225
LC019EASTim Bray-3.1Review of WD-charmod-20040225
LC020EASTim Bray-3.3Review of WD-charmod-20040225
  • Comment (received 2004-03-05) -- Review of WD-charmod-20040225

    first para after the <ul>

    "Each glyph can be represented by a number of different glyph images; a set of glyph images makes up a font."

    The part before the semicolon is very awkward and I'm not sure I understand what it's saying. Maybe an example? Are you saying that even though é is a single character, the standalone accent is also in the font even if you can't use it standalone?

  • Decision: Accepted

  • Comment (received 2004-07-29) -- Satisfied

LC021EASTim Bray-3.3Review of WD-charmod-20040225
  • Comment (received 2004-03-05) -- Review of WD-charmod-20040225

    material on selection

    This section needs either to be split or a new section 3.3.1 selection. There is a clear transition at the paragraph beginning "In the presence of bidirectional text..." from talking about directionality to talking about selection. In fact, you could make a case for the paragraph beginning "Some scripts, in particular Arabic..." being a standalone section. The material here on selection and bidirectionality is excellent and the usefulness would be better if it had a section number so people could reference it.

  • Decision: Accepted

  • Comment (received 2004-07-29) -- Satisfied

LC022EASTim Bray-4.1Review of WD-charmod-20040225
  • Comment (received 2004-03-05) -- Review of WD-charmod-20040225

    first two sentences

    the phrase "in particular on the WWW" is wrong, it's no more necessary to encode chars here than anywhere else. I suggest "On the WWW, as in any computing environment, characters must be encoded to be of any use."

    The second sentence beginning "In fact, much of the information..." is pure fluff, I suggest just losing it. By byte count, the amount of text flowing around the network has been a small minority since the creation of alt.sex.pictures, which predates the web by a few years. You don't need to convince anyone that there's text out there and that encoding it is important.

  • Decision: Accepted

  • Comment (received 2004-07-29) -- Satisfied

LC023EASTim Bray-4.3Review of WD-charmod-20040225
  • Comment (received 2004-03-05) -- Review of WD-charmod-20040225

    first para

    "... where no markup or programing language applies." Non-idiomatic, suggest "(not in the context of markup or a programming language)"

  • Decision: Accepted

  • Comment (received 2004-07-29) -- Satisfied

LC024SASTim Bray-4.3Review of WD-charmod-20040225
  • Comment (received 2004-03-05) -- Review of WD-charmod-20040225

    Para beginning "Unicode contains some code points for internal use..."

    Shouldn't the "should not" here be a MUST not? No spec should *ever* specify sending a surrogate, except implicitly as part of an astral-plane character.

  • Discussed:

  • Discussion:

  • Decision: Accepted We have split the requirement into two, making it a MUST NOT for surrogates, and a SHOULD not for other stuff. The main reason for the distinction is that surrogates are the biggest area, therefore easiest to exclude. From there on, it's a bit of a slippery slope, with a decreasing return on investement. An example would be the U+??FFFE and U+??FFFF codepoints at the end of each plane. They are clearly not allowed, but a spec might want to make their own decision of whether they want to formally disallow them or not, based on efficiency considerations.

  • Comment (received 2004-07-29) -- Satisfied

  • See also clarification

LC025SRSTim Bray-4.3Review of WD-charmod-20040225
  • Comment (received 2004-03-05) -- Review of WD-charmod-20040225

    C016

    This is controversial. I think in general this is reasonable, with the single exception of doing what XML did and blessing both UTF-8 and UTF-16. The problem with a single encoding is that it forces people to choose between being Java/C# friendly (UTF-16) and C/C++ friendly (UTF-8). Later on, you in fact seem to agree with this point. Furthermore it's trivially easy to distinguish between UTF-8 and UTF-16 if you specify a BOM. But I think that if I were defining the next CSS or equivalent I'd like to be able to say "UTF-8 or UTF-16" without feeling guilty.

  • Discussed:

  • Decision: Rejected We have decided to reject this comment. The argument about having to choose between Java/C# friendly and C/C++ friendly has been countered on www-tag: in terms of programming, an explicit decoding step has to be used anyway e.g. in Java to deal with endianness issues, and interoperability and speed is not increased by adding more encodings because in the general case, all encodings have to be addressed. Also, we note that recently, the focus on abstract representations should allow to e.g. pass data directly as characters between two Java programs or processes.

    In addition, we note that we don't know any technology that currently would allow exactly UTF-8 and UTF-16 but nothing else (as opposed to XML, which allows lots of other encodings). This would mean that it would be impossible to show implementation experience for such a combination. This seems to be in accordance with a well-known (at least in the IETF) saying for spec design: "zero, one, many".

    In the case of (the next version of) CSS, this wouldn't really apply, because CSS, at least currently, like XML allows a wide range of character encodings. Also, it is very ASCII-heavy, more so on average than XML, so that UTF-16 is less important.

  • Comment (received 2004-07-29) -- Satisfied

  • See also clarification

LC026ENSTim Bray-variousReview of WD-charmod-20040225
LC027SPSTim Bray-4.4.2Review of WD-charmod-20040225
  • Comment (received 2004-03-05) -- Review of WD-charmod-20040225

    C033

    This is fuzzy and doesn't actually tell me anything that I can use. Either remove it or beef it up with examples.

  • Discussed: see notes

  • Decision: Partially accepted

    We felt that the first part of the sentence had meaning and value, but removed the second part [ " and SHOULD implement them in such a way that they are easy to use (for instance in HTTP servers)" ].

  • Comment (received 2004-07-29) -- Satisfied

  • See also clarification

LC028EASTim Bray-4.4.2Review of WD-charmod-20040225
LC029EPSTim Bray-4.4.2Review of WD-charmod-20040225
LC030EPSTim Bray-4.6Review of WD-charmod-20040225
  • Comment (received 2004-03-05) -- Review of WD-charmod-20040225

    last item in <ul>

    Item #3 is fuzzy. I think what you really mean is

    3. Expressing characters that can't be input directly (e.g. because of keyboard limitations).

    4. Expressing characters that can't be displayed (e.g. because of font limitations)

  • Discussed: see notes

  • Decision: Partially accepted This is a formal definition of an escape, rather than a statement of purpose, so we feel that point 3 is fine. We did however, change ''character codes' to 'encoded characters'.

  • Comment (received 2004-07-29) -- Satisfied

  • See also clarification

LC031EASTim Bray-4.6Review of WD-charmod-20040225
  • Comment (received 2004-03-05) -- Review of WD-charmod-20040225

    Third EXAMPLE

    This is incorrect. Within CDATA sections, &#xD801; is perfectly legal and just encodes a string of 8 ASCII characters. Outside of CDATA sections "&#xD801; is illegal, but that's an XML thing, not a CDATA section thing.

  • Discussed:

  • Discussed:

  • Decision: Accepted Changed "CDATA sections do not allow the expression of unrepresentable characters and in fact prevent their expression using numeric character references." to "CDATA sections prevent the expression of characters using numeric character references."

  • Comment (received 2004-07-29) -- Satisfied

  • See also clarification

LC032ERSTim Bray-4.6Review of WD-charmod-20040225
  • Comment (received 2004-03-05) -- Review of WD-charmod-20040225

    C048

    Seems silly. We're pretty well deprecating everything except Unicode right, so this vague notion of "character set standards" is useless. And you already said use hex for Unicode.

  • Discussed:

  • Discussed:

  • Decision: Rejected Charmod does not deprecate everything but Unicode (although it shows a clear and intentional preference) and also the earlier requirement for hex escapes (C045) applies to specifications that define escape syntaxes, while this one applies to content (and implementations that generate content).

  • Comment (received 2004-07-29) -- Satisfied

  • See also clarification

LC033ERSTim Bray-4.6Review of WD-charmod-20040225
  • Comment (received 2004-03-05) -- Review of WD-charmod-20040225

    C049

    The notion of a "character encoding based on Unicode" is jarring here. Doesn't the whole document say "use Unicode"?

  • Discussed: see notes

  • Decision: Rejected Charmod does not deprecate everything but Unicode (although it shows a clear and intentional preference). C049 includes things like using iso-8859-1 or windows-1252 for western European languages, or shift_jis,... for Japanese, and so on.

  • Comment (received 2004-07-29) -- Satisfied

  • See also clarification

LC034EASTim Bray-6.2Review of WD-charmod-20040225
  • Comment (received 2004-03-05) -- Review of WD-charmod-20040225

    C056

    I think it would be helpful to link back to the section where you show that a character does not map to a single unit of sound or display or input, as another good reason for this constraint.

  • Discussed: see notes

  • Decision: Accepted. We extended the example just after C056 to point to the relevant section.

  • Comment (received 2004-07-29) -- Satisfied

  • See also clarification

LC035SASTim Bray-7Review of WD-charmod-20040225
  • Comment (received 2004-03-05) -- Review of WD-charmod-20040225

    C058

    Can you proceed to recommendation with this dependency on IRIs, which are not yet cooked?

  • Discussed: see notes

  • Decision: Accepted We have accepted this comment. As a result of this and other comments, we have split the character model again, creating a separate part that only deals with IRIs. We plan to move that to CR, and only proceed to PR when the IRI spec has further proceeded in the IETF (e.g. is published as a Proposed Standard RFC).

    We would like to note that the IRI spec recently has made quite some progress, IETF last call has successfully ended, and IESG approval may be close.

  • Comment (received 2004-07-29) -- Satisfied

  • See also clarification

LC036ERSTim Bray-4.6Review of WD-charmod-20040225
  • Comment (received 2004-03-05) -- Review of WD-charmod-20040225

    C062

    I agree with this, could we strengthen it to say MUST reference Unicode? Anyone defining a protocol or language that has text in it had better say the text is unicode and if they say so, should really have a normative reference, right? Is there any situation we can imagine where it would be OK to not have such a reference?

  • Discussed:

  • Decision: Rejected Not all specs need a reference to Unicode (or to 10646, for that matter). An example would be the xml:base spec, which doesn't involve characters except indirectly through XML and URIs. However, it is difficult to clearly define when a spec does or does not depend on character definitions and semantics. If C062 changed to a MUST, it would need to have a qualifier (e.g. "if the spec depends on character definitions and semantics...") which would make the MUST clause untestable. Therefore, this should remain a SHOULD, providing an escape hatch for specs that legitimately do not require a Unicode reference.

  • Comment (received 2004-07-29) -- Satisfied

  • See also clarification

LC037EANSusan LeschW3C Communications TeamOverallBackground color of images
LC038EANSusan LeschW3C Communications TeamOverallXML Spec XSLT
  • Comment (received 2004-03-12) -- XML Spec XSLT

    The document looks like the output of an old version of the XML Spec XSLT stylesheet. Would you be able to use a newer version? A problem with "dt.label { display: run-in; }" was fixed about a year and a half ago [1] by deleting it. It makes the references section hard to read in Mac IE [2]. [1] http://lists.w3.org/Archives/Public/spec-prod/2002JulSep/0009.html [2] http://www.w3.org/2004/03/12-charmod.png

  • Discussed: see notes

  • Decision: Accepted: but we fixed the XSLT we are using, rather than upgrade to the current version of XMLSpec, because we have added numerous extensions and don't feel we have the time or need to redo things.

LC039EANSusan LeschW3C Communications Team4.3Minor editorial
  • Comment (received 2004-03-12) -- Minor editorial

    In "the important aspect is that everything is text" I would link "text" to its definition.

  • Decision: Accepted

LC040TANSusan LeschW3C Communications Team4.3Typo
  • Comment (received 2004-03-12) -- Typo

    s/accessibility/accessiblity/

  • Decision: Accepted as s/accessiblity/accessibility/

LC041TANSusan LeschW3C Communications Team2Typos
  • Comment (received 2004-03-12) -- Typos

    s/A specification conforms to this document if they:/A specification conforms to this document if it:/ s/Implementations (software) conform to this document if it does not violate/An implementation (software) conforms to this document if it does not violate/

  • Decision: Accepted

LC042EPNSusan LeschW3C Communications TeamAURIs in citations
LC043SPSDan Connolly-3.2conformance to "software MUST NOT assume" measurable?
  • Comment (received 2004-03-18) -- conformance to "software MUST NOT assume" measurable?

    Regarding: C001 [S] [I] [C] Specifications, software and content MUST NOT assume that there is a one-to-one correspondence between characters and the sounds of a language.

    How does one test/measure/observe/demonstrate that? Would you please point me at a test case?

    I think it's fine to write: Take care not to assume a one-to-one correspondence between characters and sounds of a language. followed by the examples you give, but I don't see how making this a conformance clause is helpful.

    This applies to C002 and C003 as well. [[[ on the submission form: Hmm... I want to cc this comment to some colleagues. This form doesn't help. I suppose the privacy policy is reasonably clear, but an explicit link to W3C's privacy policy (I assume we have one) seems in order. Ah... 2 step confirmation is good. ]]]

  • Discussed: see notes

  • Decision: Partially accepted We have changed the wording from:

    C001 [S][I][C] Specifications, software and content MUST NOT >assume that there is< a one-to-one correspondence between characters and the sounds of a language.

    to

    C001 [S][I][C] Specifications, software and content MUST NOT >require or depend on< a one-to-one correspondence between characters and the sounds of a language.

    and have made the same change for C002 and C003. This avoids the issue that specifications, implementation, and content don't really make 'assumptions'.

    As for conformance, we would like to first point out that all the conformance criteria in the Character Model are predicated on whether a given criteria actually applies to a give technology. So technology that does not deal with the auditory representation of language (i.e. most W3C specifications) are not affected by this criterion. Technology that is affected (e.g. VoiceXML and in particular SSML) can be checked.

    If SSML for example tried to do text-to-speach conversion by defining a format for a table that would only associate single phonemes with single characters, it would very clearly not conform to the character model. But as you can check at http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/#S3.1.9, SSML definitions of written to spoken correspondence using the <phoneme> element allows definitions on whole words or larger pieces of text, so it is conformant. With this example, I hope that we have shown that conformance of specifications can indeed be checked.

    To be even more concrete, one could easily collect a series of examples (starting with those mentioned in the spec, such as "thing"), where there is not a one-to-one correspondence between characters and phonemes, and check whether specs, implementations,... that deal with such correspondences can handle them.

  • Comment (received 2004-10-12) -- Satisfied

LC044SRDDan Connolly-3.7define 'character' once and for all
  • Comment (received 2004-03-18) -- define 'character' once and for all

    Regarding... C010 [S] When specifications use the term 'character' the specifications MUST define which meaning they intend.

    How is a choice of definitions of this term useful? Please let charmod export exactly one definition of the term "character". If you're going to speak of conformance of other specifications to this one, let conforming specifications use exactly that one definition.

    The definitions in section 3.7 look OK, to me; to wit: a character can be defined informally as a small logical unit of text. Text is then defined as sequences of characters. That occurrence of 'character' should be marked up specially, not the previous one ("The term character is used differently in a variety of contexts ...")

  • Discussed: see notes

  • Decision: Rejected The definition for 'character' currently available in the document ("a character can be defined informally as a small logical unit of text") is too fuzzy to be directly useful in other specifications. Having a single, very precise, definition of 'character' is not really feasible, because different kinds of specifications may need different definitions. Also, in C067, we advise to use more specific terms if available. The wide range of ways to look at the phenomenon of a 'character, and to define the term 'character', should become obvious to the reader after reading Section 3 of the Character Model.

  • Comment (received 2004-10-12) -- Dissatisfied

  • Discussed Decided to leave as dissatisfied.

LC045SASMartin Dürst-8Last call comment on Charmod (Fundamentals)
LC046SASDan Connolly-4.6appropriate mechanism exists... says who?
  • Comment (received 2004-03-18) -- appropriate mechanism exists... says who?

    Regarding... C042 [S] Specifications MUST NOT invent a new escaping mechanism if an appropriate one already exists.

    How is this to be tested/measured/observed? How is existence of an appropriate mechanism to be determined? I don't see how making that a conformance clause helps. At least change it to SHOULD NOT.

  • Discussed:

  • Discussed:

  • Decision: Accepted Changed "Specifications MUST NOT invent a new escaping mechanism if an appropriate one already exists." to "Specifications SHOULD NOT invent a new escaping mechanism if an appropriate one already exists."

    This can indeed only be observed by humans looking at a specification and comparing it with known pre-existing escaping mechanisms, and this will include some judgement. However, we think that it is better to have this conformance criterion to make such judgement explicit rather than to have spec writers come up with new mechanisms all the time.

  • Comment (received 2004-10-12) -- Satisfied

LC047SASDan Connolly-7IRI section needs too much testing to go in Fundamentals
LC048SPNDominique Hazaël-Massieux-OverallSupport for DanC's comment re conformance
LC049SANFrank Ellermann-4.6C049
  • Comment (received 2004-03-03) -- //www.w3.org/mid/997380425.20040303163908@toro.w3.mag.keio.ac.jp" >C049

    The character encoding of a document SHOULD be chosen so that it maximizes the opportunity to directly represent characters and minimizes the need to represent characters by markup means such as character escapes.

    Use windows-1252 and its code points 0x80, 0x84, 0x93, and 0x94 instead of Latin-1 and euro; bdquo; ldquo; rdquo;

    Are you sure? I like it. But other character encodings like PC-Multilingual-850+euro are too obscure for the WWW. There can be very good reasons to use US-ASCII and some character esacapes instead of UTF-8, e.g. my favourite browser doesn't support Unicode, my favourite editor doesn't know UTF-8, etc.

  • Discussed:

  • Discussed:

  • Decision: Accepted Moved requirement C049 and following note from 4.6 to middle of 4.4.2. Changed "The character encoding of a document SHOULD be chosen so that it maximizes the opportunity to directly represent characters and minimizes the need to represent characters by markup means such as character escapes." to "The character encoding of a document SHOULD be chosen so that it maximizes the opportunity to directly represent characters (ie. minimizes the need to represent characters by markup means such as character escapes) while avoiding obscure encodings that are unlikely to be understood by recipients."

LC050EANPhilippe Le Hégaret-7IRI Reference should be normative
  • Comment (received 2004-03-26) -- IRI Reference should be normative

    See comment title.

  • Discussion:

  • Discussion: question is: does the CharMod text require it to be a normative reference?

  • Decision: Accepted We moved ID-IRI reference from non-normative to normative references. Updated document links to draft-duerst-iri-10.txt. Added following note: "[NOTE: This reference should be taken to point to the RFC once the IRI draft has progressed to that stage.]" Note that as a result of other comments, we have moved the section about IRIs to a separate document.

LC051SANChris LilleyTAG4.5Pi fonts and PUA
  • Comment (received 2004-03-29) -- Pi fonts and PUA

    TAG agreed to this comment at its 22 March 2004 TAG teleconference http://www.w3.org/2004/03/22-tag-summary.html

    The TAG believes that its comment C125 regarding the Private Use Area (PUA) on the previous last call has been substantially addressed.

    We note one additional issue in the new text. It discourages an existing use (encoding of pi or symbol fonts); on the one hand this is good because inline graphics should be used for graphics, and it says so

    C068 [S] Specifications SHOULD allow the inclusion of or reference to pictures and graphics where appropriate, to eliminate the need to (mis)use character-oriented mechanisms for pictures or graphics. C069 [C] Content SHOULD NOT misuse character technology for pictures or graphics.

    On the other hand, we worry that this might inadvertently encourage people to encode pi or symbol fonts on the ascii range, which is worse than using the PUA! For unencoded characters, or symbols, the PUA is appropriate. To guard against this possibility we suggest adding the following text - perhaps a new conformance requirement after C069 or an extension of C069:

    C0xx [I][C] Fonts for characters not yet in Unicode, or for graphical symbols, SHOULD use the PUA rather than overloading existing characters with unrelated glyphs.

  • Decision: Accepted.

  • Discussion:

  • Discussed

  • Decision: Accepted We have added some new text:

    >>>>>>>>

    C076 [C] Content MUST NOT use a code point for any purpose other than that defined by its character encoding.

    This prohibits the construction of fonts that misuse e.g. iso-8859-1 to represent different scripts, characters, or symbols than what is actually encoded in iso-8859-1.

    >>>>>>>>

    This is just after C073, which says that content on the Web SHOULD not use the PUA. By having C076 being a MUST and C73 a SHOULD, it is clear that if symbols not encoded in Unicode have to be represented, they have to go into the PUA rather than into some assigned or reserved area.

    On the other hand, we have removed C069 because it was too general and covered e.g. things like ASCII art, which is an issue of use of characters rather than encoding of characters.

LC052EANChris LilleyTAG3.3C004 ambiguous
  • Comment (received 2004-03-29) -- C004 ambiguous

    TAG agreed to this comment at its 22 March 2004 TAG teleconference http://www.w3.org/2004/03/22-tag-summary.html

    >> C004 [S] Specifications of protocols and APIs that involve

    >> selection of ranges SHOULD provide for discontiguous selections, at

    >> least to the extent necessary to support implementation of visual

    >> selection on screen on top of those protocols and APIs.

    TAG is pleased by the changes made to this section. We still feel that there is ambiguity there which would be removed by saying "discontiguous logical selections" in C004, which is the type of discontiguity needed for visual selection.

  • Discussion:

  • Decision: Accepted. Changed 'discontiguous selections' to 'discontiguous logical selections'.

LC053SANChris LilleyTAG7Please remove IRIs
LC054EPSKarl DubostQA3.2KD-001
  • Comment (received 2004-03-29) -- [QA Review] CharMod for the Web 1.0: Fundamentals WD 25 Feb 2004

    C001 [S] [I] [C] Specifications, software and content MUST NOT assume that there is a one-to-one correspondence between characters and the sounds of a language.

    ===> How do you test that for each implementations [S][I][C]? What will be the three tests that you will be able to create to demonstrate the implementability of this during the CR period where you will seek for implementation? If you can't design a test for it, it means that your assertion is not testable, therefore not implementable. I think one of the problems comes from the "assume".

    Imagine a language where you have "a one-to-one correspondence between characters and the sounds of a language". If the software implements only this language because it's a specific use for only this language. It means that it's not conformant to C001, even if this software does the correct thing.

  • Discussion:note that this is linked to LC055, LC056 and LC068

  • Decision: Partially accepted For 3.2, C001; 3.3, C002; 3.4, C005; 3.6, C009: replaced "MUST NOT assume" with "MUST NOT require or depend on".

    We have changed the wording from: "C001 [S][I][C] Specifications, software and content MUST NOT >>assume that there is<< a one-to-one correspondence between characters and the sounds of a language." to "C001 [S][I][C] Specifications, software and content MUST NOT >>require or depend on<< a one-to-one correspondence between characters and the sounds of a language."

    This avoids the issue that specifications, implementation, and content don't really make 'assumptions'.

    As for conformance, we would like to first point out that all the conformance criteria in the Character Model are predicated on whether a given criteria actually applies to a give technology. So technology that does not deal with the auditory representation of language (i.e. most W3C specifications) are not affected by this criterion. Technology that is affected (e.g. VoiceXML and in particular SSML) can be checked.

    If SSML for example tried to do text-to-speach conversion by defining a format for a table that would only associate single phonemes with single characters, it would very clearly not conform to the character model. But as you can check at http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/#S3.1.9, SSML definitions of written to spoken correspondence using the <phoneme> element allows definitions on whole words or larger pieces of text, so it is conformant. With this example, I hope that we have shown that conformance of specifications can indeed be checked.

    To be even more concrete, one could easily collect a series of examples (starting with those mentioned in the spec, such as "thing"), where there is not a one-to-one correspondence between characters and phonemes, and check whether specs, implementations,... that deal with such correspondences can handle them.

    As for implementability, there are a lot of text-to-speech engines, and a lot of speech detection engines, that do not require or depend on a one-to-one correspondence, so it is very clear that this can be implemented.

    As for your point of "If the software implements only this language because it's a specific use for only this language", yes, such a software would not conform to the character model. From the viewpoint of the character model, this would be on purpose; in the age of the World Wide Web, it is a bad idea to create software that can handle only one language, and it is a bad idea to create software that has language-related issues hard-coded when it can easily be made configurable.

  • Comment (received 2004-10-12) -- Satisfied

LC055EPSKarl DubostQA3.3KD-002
  • Comment (received 2004-03-29) -- [QA Review] CharMod for the Web 1.0: Fundamentals WD 25 Feb 2004

    C002 [S] [I] [C] Specifications, software and content MUST NOT assume a one-to-one mapping between characters and units of displayed text.

    ===> Same comment than KD-001 How do you test that for each implementations [S][I][C]? What will be the three tests that you will be able to create to demonstrate the implementability of this during the CR period where you will seek for implementation? If you can't design a test for it, it means that your assertion is not testable, therefore not implementable. I think one of the problems comes from the "assume".

    Imagine a language where you have "a one-to-one mapping between characters and units of displayed text". If the software implements only this language because it's a specific use for only this language. It means that it's not conformant to C002, even if this software does the correct thing.

  • Discussion:

  • Discussion:

  • Decision: Partially acceptedOur reply is basically the same as that for LC054.

    We replaced "MUST NOT assume" with "MUST NOT require or depend on". We note that this is testable with very simple examples, some of which can be found in the spec itself. Implementations dealing with only a single language may not conform to the character model, and that is by design; it's the goal of the character model to make sure that specs and software can deal with as much languages as possible.

  • Comment (received 2004-10-12) -- Satisfied

LC056EPSKarl DubostQA3.4KD-003
  • Comment (received 2004-03-29) -- [QA Review] CharMod for the Web 1.0: Fundamentals WD 25 Feb 2004

    C005 [S] [I] Specifications and software MUST NOT assume that a single keystroke results in a single character, nor that a single character can be input with a single keystroke (even with modifiers), nor that keyboards are the same all over the world.

    ===> Same comment than KD-001 How do you test that for each implementations [S][I]? What will be the two tests that you will be able to create to demonstrate the implementability of this during the CR period where you will seek for implementation? If you can't design a test for it, it means that your assertion is not testable, therefore not implementable. I think one of the problems comes from the "assume".

    Imagine a language where you have "a single keystroke results in a single character". If the software implements only this language because it's a specific use for only this language. It means that it's not conformant to C005, even if this software does the correct thing.

    Could the following solve your problem?

    "C005 Specifications and software MUST authorize complex input methods where there is single keystroke doesn't result in a single character... ... "

  • Discussion:

  • Discussion:

  • Decision: Partially accepted Our reply is basically the same as that for LC054.

    We replaced "MUST NOT assume" with "MUST NOT require or depend on". We note that this is testable with very simple examples, some of which can be found in the spec itself. Implementations dealing with only a single language may not conform to the character model, and that is by design; it's the goal of the character model to make sure that specs and software can deal with as much languages as possible.

  • Comment (received 2004-10-12) -- Satisfied

LC057ERSKarl DubostQA3.5KD-004
  • Comment (received 2004-03-29) -- [QA Review] CharMod for the Web 1.0: Fundamentals WD 25 Feb 2004

    C008 [S] [I] Specifications and implementations of sorting and searching algorithms SHOULD accommodate all characters in Unicode.

    ===> What's happening if you implement all western languages but not asian because the context of applications do not make it necessary. Do I still have to implement everything? If not how can I be conformant?

  • Discussion:

  • Decision: Rejected You write: ===> What's happening if you implement all western languages but not asian because the context of applications do not make it necessary. Do I still have to implement everything? If not how can I be conformant?

    As we have already explained in our responses to LC054-56 that the goal of the character model is to cover as many languages/scripts/ characters as possible. On the WWW, you never know what input you get. If an implementation blows up just because it is unable to do anything with Asian characters, that would be very bad. Please note that we do not require any particular sort order for any character, simply sorting 'unknown' characters by codepoint would be okay.

  • Comment (received 2004-10-12) -- Satisfied

LC058EPSKarl DubostQA3.6KD-005
  • Comment (received 2004-03-29) -- [QA Review] CharMod for the Web 1.0: Fundamentals WD 25 Feb 2004

    C009 [S] [I] [C] Specifications, software and content MUST NOT assume a one-to-one relationship between characters and units of physical storage.

    ===> Same comment than KD-001. Make it testable.

  • Discussion:

  • Discussion:

  • Decision: Partially accepted Our reply is basically the same as that for LC054.

    We replaced "MUST NOT assume" with "MUST NOT require or depend on". We note that this is testable with very simple examples, some of which can be found in the spec itself. Implementations dealing with only a single language may not conform to the character model, and that is by design; it's the goal of the character model to make sure that specs and software can deal with as much languages as possible.

  • Comment (received 2004-10-12) -- Satisfied

LC059EASKarl DubostQA3.7KD-006
  • Comment (received 2004-03-29) -- [QA Review] CharMod for the Web 1.0: Fundamentals WD 25 Feb 2004

    C067 [S] Specifications SHOULD avoid the use of the term 'character' if a more specific term is available.

    ===> Not testable. avoid is like assume, there's a notion of intention, of vague choice. You could say:

    "Specifications SHOULD use specific terms, when it's available, instead of the general term 'character'."

  • Discussion:

  • Decision: Accepted We substituted the proposed wording.

  • Comment (received 2004-08-02) -- Satisfied

LC060EASKarl DubostQA4.4.1KD-007
  • Comment (received 2004-03-29) -- [QA Review] CharMod for the Web 1.0: Fundamentals WD 25 Feb 2004

    C018 [S] When a unique character encoding is mandated, the character encoding MUST be UTF-8, UTF-16 or UTF-32. C019 [S] If a unique character encoding is mandated and compatibility with US-ASCII is desired, UTF-8 (see [RFC 3629]) is RECOMMENDED. In other situations, such as for APIs, UTF-16 or UTF-32 may be more appropriate. Possible reasons for choosing one of these include efficiency of internal processing and interoperability with other processes.

    ===> Please separate the part about APIs. Basically, jump a line ;) The clue for now is just visual which means, it's not anymore visible nor accessible without colors. It can lead to misunderstanding.

  • Discussion:

  • Discussion:

    Decision: Accepted We split each conformance criterion out from surrounding text as a separate paragraph.

  • Comment (received 2004-08-02) -- Satisfied

LC061EASKarl DubostQA4.4.2KD-008
  • Comment (received 2004-03-29) -- [QA Review] CharMod for the Web 1.0: Fundamentals WD 25 Feb 2004

    C027 [S] Specifications MAY define either UTF-8 or UTF-16 as a default encoding form (or both if they define suitable means of distinguishing them), but they MUST NOT use any other character encoding as a default.

    ===> Double assertions make difficult to understand and analyse what is the exact conformance clause. Try to wrap up in one or separate it.

  • Discussion:

  • Decision: Accepted Replaced "Specifications MAY define either UTF-8 or UTF-16 as a default encoding form (or both if they define suitable means of distinguishing them), but they MUST NOT use any other character encoding as a default." with "Specifications that mandate a default encoding MUST define either UTF-8 or UTF-16 as the default, or both if they define suitable means of distinguishing them."

  • Comment (received 2004-08-02) -- Satisfied

LC062EASKarl DubostQA4.4.2KD-009
  • Comment (received 2004-03-29) -- [QA Review] CharMod for the Web 1.0: Fundamentals WD 25 Feb 2004

    032 [I] Receiving software MAY recognize as many character encodings and as many charset names and aliases for them as appropriate.

    ===> Jump a line. AND it's not testable. That's a good recommendation but you can't really test it. It encourages people to support as much as possible but it's not a requirement or you have to define clearly and without ambiguities appropriate.

  • Discussion:

  • Decision: Accepted We relegated this conformance criterion to simple descriptive text.

  • Comment (received 2004-08-02) -- Satisfied

LC063EASKarl DubostQA4.4.2KD-010
  • Comment (received 2004-03-29) -- [QA Review] CharMod for the Web 1.0: Fundamentals WD 25 Feb 2004

    C033 [I] Software MUST completely implement the mechanisms for character encoding identification and SHOULD implement them in such a way that they are easy to use (for instance in HTTP servers).

    ===> same comment than KD-008. Double assertions.

  • Discussion:

  • Decision: Accepted The second part of this requirement has already been removed as a result of earlier discussions, so this comment is now moot.

  • Comment (received 2004-08-02) -- Satisfied

LC064SASKarl DubostQA4.5KD-011
  • Comment (received 2004-03-29) -- [QA Review] CharMod for the Web 1.0: Fundamentals WD 25 Feb 2004

    C069 [C] Content SHOULD NOT misuse character technology for pictures or graphics.

    ===> I perfectly understand the rationale behind this comment, but it might lead to a strictness which for example might block someone who will use character technology for an artistic project. Though not that it's fondamental anywhere. But I'm not sure, it achieves something. Could you give more examples with this requirement, why it's bad, how does it lead to problem, etc?

    For example, does that mean you forbid all possibilities of ascii arts.... or even smileys :))))

    For example in your own specification you are using [S], then the characters "[" and "]". Is it a valid usage of this character in american english language or is a graphical abuse? to make it like a button. Where elsewhere you are using it for marking a reference to a document. Do you mean in fact:

    <span class="requirement-type"><img src="specificationbutton" alt="Specification"></span>

    or

    <abbr class="requirement-type" title="Specification">S.</abbr>

  • Discussion:

  • Discussed

  • Decision: Accepted We have removed C069, which was too general, because we don't want to discuss the use of characters so much as the encoding of characters. We have added new text, just after C073

    >>>>>>>>

    C076 [C] Content MUST NOT use a code point for any purpose other than that defined by its character encoding.

    This prohibits the construction of fonts that misuse e.g. iso-8859-1 to represent different scripts, characters, or symbols than what is actually encoded in iso-8859-1.

    >>>>>>>>

    in order to not loose the main issue for which C069 was originally introduced.

  • Comment (received 2004-10-12) -- Satisfied

LC065ERSKarl DubostQA6.1KD-012
  • Comment (received 2004-03-29) -- [QA Review] CharMod for the Web 1.0: Fundamentals WD 25 Feb 2004

    C012 [S] The 'character string' definition of a string is generally the most useful and SHOULD be used by most specifications, following the examples of Production [2] of XML 1.0 [XML 1.0], the SGML declaration of HTML 4.0 [HTML 4.01], and the character model of RFC 2070 [RFC 2070].

    ===> you may want to rephrase that sentence as: "The 'character string' definition of a string SHOULD be used by most specifications..."

  • Discussion:

  • Decision: Rejected We don't see how this proposed change improves the text.

  • Comment (received 2004-08-02) -- Satisfied

LC066EASKarl DubostQA8KD-013
  • Comment (received 2004-03-29) -- [QA Review] CharMod for the Web 1.0: Fundamentals WD 25 Feb 2004

    C062 [S] Since specifications in general need both a definition for their characters and the semantics associated with these characters, specifications SHOULD include a reference to the Unicode Standard, whether or not they include a reference to ISO/IEC 10646. By providing a reference to the Unicode Standard implementers can benefit from the wealth of information provided in the standard and on the Unicode Consortium Web site.

    ===> Jump a line

  • Discussion:

  • Decision: Accepted This is already covered by LC060.

  • Comment (received 2004-08-02) -- Satisfied

LC067ERSKarl DubostQA8KD-014
  • Comment (received 2004-03-29) -- [QA Review] CharMod for the Web 1.0: Fundamentals WD 25 Feb 2004

    C064 [S] All generic references to the Unicode Standard [Unicode] MUST refer to the latest version of the Unicode Standard available at the date of publication of the containing specification.

    ===> Will it block some republication. Imagine you republished a specification for erratas and fixing typos. But you are referring to an old version of Unicode. Do you have to modify the specification to make it conformant to Charmod? Which means that it can lead to a complete remodeling of a spec where you have things which could be strongly dependant on that references. (Just trying here to get the rabbits out of the bush)

  • Discussion:

  • Decision: Rejected In your comment, you mention the case that a spec depends on a particular version of Unicode. In this case, it is not a generic reference, but a specific reference. The difference is given in http://www.w3.org/TR/charmod/#C063. If a spec follows this, then it will use a generic reference to indicate that future codepoints allocated can be used, and it will use a specific reference if it has to reference a particular version of e.g. normalization, character properties, or so. Then if that spec is updated, the generic reference can be updated to the latest version of Unicode without problems, but the specific reference is not changed, unless there is an explicit decision that a the newer version e.g. of normalization or character properties should be used.

  • Comment (received 2004-10-12) -- Satisfied

LC068EPSKarl DubostQA-KD-015
  • Comment (received 2004-03-29) -- [QA Review] CharMod for the Web 1.0: Fundamentals WD 25 Feb 2004

    "MUST NOT assume" is a bad terminology. You are often using this term to explain to software developers and specifications writers that if they are creating a *generic international* application, they have to be careful. The problem is that it makes it NOT testable at all. You have to find a way to turn your requirements that will make them testable. A software can sometimes be a piece of code which is a Library that will implement perfectly the support for ONE language, without respecting what you are saying in this document.

    You may want to precise also at the begining of your document, that this specification is made for people implementing and developing things for a multilingual context and use. It will avoid to have to precise at the start of each sentence. "When you implement a international [S][I][C], blabla MUST..." I precise that which seems obvious but which is in fact not clear in your introduction. OR If I'm a developer of an application, a library which deals with only one language:

    - Should I care about this spec?

    - if yes, can I be conformant?

    exemple: Do I have to care about chinese input method if I'm creating a spell checker library for an english scrabble game? How do I answer to C005?)

    You did it for example in C006 by adding "for the relevant language and/or application."

  • Discussion:note that this is linked to LC054, LC055 and LC056

  • Discussion:

  • Decision: Partially accepted For 3.2, C001; 3.3, C002; 3.4, C005; 3.6, C009: replaced "MUST NOT assume" with "MUST NOT require or depend on".

  • Comment (received 2004-08-02) -- Satisfied

LC069ENNKarl DubostQA-KD-016
  • Comment (received 2004-03-29) -- [QA Review] CharMod for the Web 1.0: Fundamentals WD 25 Feb 2004

    There's a need for a glossary where you will define the terms. Maybe you could expand the terminology section and use the specific markup for it. A benefit of that is that the W3C glossary will be enriched and make it easier to have a controlled vocabulary of terms used at W3C.

  • Discussion:

  • Decision: Noted and deferred We agree that having a glossary would be a good idea, but given our current resources, we have had to give priority to moving the spec on. We may be able to come back to create a glossary at a later stage, or it may be possible to extract (at least some of the) terms from the document, because to a large extent, terms and their usage is already marked up.

LC070EPSBjörn Höhrmann-1.2Clarify "legacy encoding"
  • Comment (received 2004-04-08) -- Clarify "legacy encoding"

    Section 1.2, Background: [...] It should be noted that such aspects also exist in legacy encodings (where legacy encoding is taken to mean any character encoding not based on Unicode), and in many cases have been inherited by Unicode in one way or another from such legacy encodings. [...]

    It is not clear to me what it means for an encoding to be based on Unicode. Is US-ASCII a legacy encoding (there is a complete mapping to Unicode hence it appears to be based on Unicode)? Is UTF-7 a legacy encoding (a UTF clearly is based on Unicode, isn't it)? Or CESU-8? I would suggest to define e.g. "Unicode Encoding" (or the existing "Unicode encoding form") to mean UTF-8/16/32 and "legacy encoding" to mean all other encodings.

  • Discussed

  • Discussed

  • Decision: Partially accepted We removed 'legacy encoding' as a formally defined term from CharMod Fundamentals. We will revisit this for CharMod Normalization.

  • Comment (received 2004-10-27) -- Notification response

  • Discussed

  • The one remaining use of the word 'legacy' was in the introductory text, and was not used in a technical way, but we have removed that.

  • Comment (received 2004-10-29) -- Satisfied

LC071SRDBjörn Höhrmann-4.4APIs vs. physical string representations
  • Comment (received 2004-04-08) -- APIs vs. physical string representations

    4.4, Choice and Identification of Character Encodings: [...] C016 [S] When designing a new protocol, format or API, specifications SHOULD mandate a unique character encoding. [...]

    This would only be a good thing if everyone adheres to this approach. Consider the DOM, it requires UTF-16 yet many DOM implementations are non-conforming just because it made more sense for them to use UTF-8, for example if the programming language uses UTF-8 for strings and the DOM implementation is based on this string type. In fact, if the language provides a Unicode string type, the internal storage should not matter for an API specification. I am fine with this for protocols and content.

  • Discussion:

  • Decision: Rejected You raise a valid point mentioning that although the DOM specifies to use UTF-16, not all implementations follow this. When the DOM was created, we were apparently more worried about inter-language compatibility than necessary, but not only that, we were also worried about intra-language compatibility. Today, most major languages have their model for how to deal with Unicode; when DOM1 was created, that wasn't the case. In particular, people were pointing to Corba, which does things like character encoding negotiation, which would not at all have been suited for DOM.

    Going back to the text of C016: "When designing a new protocol, format or API, specifications SHOULD require a unique character encoding.", we would like to point out that it doesn't require APIs across languages to use the same encoding. We would also like to point out that e.g. the DOM1 spec is very careful to avoid using the word API for DOM itself (see e.g. http://www.w3.org/TR/1998/REC-DOM-Level-1-19981001/). In addition, we would like to point out that C017, "When basing a protocol, format, or API on a protocol, format, or API that already has rules for character encoding, specifications SHOULD use rather than change these rules." would provide strong justification for a spec like DOM that would want to leave the question of character encoding to each language binding.

    So therefore, we don't think that the current wording causes any problems.

  • Comment (received 2004-11-01) -- Dissatisfied

LC072ENaOBjörn Höhrmann-OverallEditorial suggestions
  • Comment (received 2004-04-08) -- Editorial suggestions

    I think the use of abbreviations like CCS, CEF, CES, etc. reduces the readability of the document. While it might be convenient to use abbreviated forms in discussions, they make the document more difficult to read, especially because these look too similar. You cannot expect that someone not familiar with the issues involved could easily understand a paragraph like

    [...] A CES is a mapping of the code units of a CEF into well-defined sequences of bytes, taking into account the necessary specification of byte-order for multi-byte base datatypes and including in some cases switching schemes between the code units of multiple CESes (an example is ISO 2022). A CES, together with the CCSes ... [...]

    It is already quite difficult to differentiate between terms like "code point" and "code unit". Please spell these out more often. It might also be helpful to include (simplified) definitions of these terms for each occurence, like

    <span title='a mapping from a repertoire of characters to a set of non-negative integers'>Coded Character Set</span> and/or make occurences links to the definitions, for example XML Schema uses constructs like <a href="#dt-value-space" class="termref"><span class="arrow" >·</span>value space<span class="arrow">·</span></a>

    It would also be helpful to have a summary glossary of the terms used in the document, this would help to create the title attributes suggested above and simplifies lookup for these terms.

    If there is any chance you could renumber the Cxxx codes to bring them back in order, please do. While it might be convenient not to break links and references to the document, having many of them out of order is quite confusing. The document split already breaks references, for example css3-selectors CR references the normalization part which is no longer included in the latest version of the document it references, hence it appears there is only a minimal addtional cost here.

    Please make sure to publish checklists containing the conformance requirements (complete and by product) along with the CR. This would be a great help for your audience (specification and implementation reviews in particular).

    Please make Cxxx identifiers links to that section, i.e., turn e.g. <a id="C013" name="C013"> into <a id="C013" name="C013" href="#C013"> so I can copy and paste pointers to specific guidelines more easily. That these are links could be hidden through style sheets, if you think this reduces readability.

  • Discussion: see notes

    This comment has been split into the following comments: LC086, LC087, LC088, LC089

LC073SASBjörn Höhrmann-4.4.1Strike C019
  • Comment (received 2004-04-08) -- Strike C019

    Section 4.4.1, Mandating a unique character encoding

    [...] C018 [S] When a unique character encoding is mandated, the character encoding MUST be UTF-8, UTF-16 or UTF-32. C019 [S] If a unique character encoding is mandated and compatibility with US-ASCII is desired, UTF-8 (see [RFC 3629]) is RECOMMENDED. [...]

    I think C019 should be removed, C018 requires the use of UTF-8, UTF-16 or UTF-32 and among them only UTF-8 is US-ASCII compatible in any meaningful interpretation of "compatible". If you consider it worth mentioning that US-ASCII and UTF-8 are compatible, put that into the normal text.

  • Discussion:

  • Discussed (First part of minutes)

  • Discussion:

  • Decision: Accepted We have replaced C019 with the following sentence: "US-ASCII is upwards-compatible with UTF-8 (an US-ASCII string is also an UTF-8 string, see [RFC 3629]), and UTF-8 is therefore appropriate if compatibility with US-ASCII is desired."

  • Comment (received 2004-11-01) -- Satisfied

LC074EASBjörn Höhrmann-4.6Improved text for C047
  • Comment (received 2004-04-08) -- Improved text for C047

    Section 4.6, Character Escaping

    [...] C047 [I] [C] Escapes SHOULD be avoided when the characters to be expressed are representable in the character encoding of the document. [...]

    I think <http://www.w3.org/TR/xslt-xquery-serialization/> provides a better rule in this regard, "Entity references and character references should be used only where the character is not present in the selected encoding, or where the visual representation of the character is unclear (as with &nbsp;, for example)", the wording should be adopted.

    These rules also seem to suggest that I should use &#x0026; in place of &amp; in XML documents, it should be clarified that this is not meant.

  • Discussed

  • Discussed

  • Decision: Accepted We now say: "Escapes should only be used when the characters to be expressed are not directly representable in the format or the encoding of the document, or when the visual representation of the character is unclear.", and we added a note referring to the &nbsp; as an example.

  • Comment (received 2004-10-27) -- Satisfied

LC075SPDBjörn Höhrmann-6.2Arguments vs. return types
  • Comment (received 2004-04-08) -- Arguments vs. return types

    Section 6.1, String concepts:

    [...] C056 [S] Specifications of APIs SHOULD NOT specify single character or single encoding-unit arguments. EXAMPLE: uppercase('ß') cannot return the proper result (the two-character string 'SS') if the return type of the uppercase function is defined to be a single character. [...]

    The return type of a function is not an argument, hence the example does not match the actual requirement. Also note that "encoding-unit" is not defined in the document; the closest term is probably "units of encoding", if this is the intended meaning, the definition of "units of encoding" should be properly marked up as definition and "encoding-unit" be replaced by "units of encoding".

    I do not like this rule, uppercase('ß') is already a good example for an exception, it takes a single character as argument. But you probably want to state this for return types not arguments anyway. I would also prefer to state that this is not recommended over SHOULD NOT (since you might want uppercase transformations that do not change string length, for example).

  • Discussion:

  • Discussion:

  • Decision: Partially Accepted Changed C056 from "Specifications of APIs SHOULD NOT specify single character or single encoding-unit arguments." to "Specifications of APIs SHOULD NOT specify single characters or single units of encoding as argument or return types."

    We agree that return types should also be mentioned, and that 'encoding-unit' has to be replaced by 'units of encoding'.

    However, we disagree with your counterexample. The fact that an 'uppercase' function can take a single character, even an sz, as an argument in some cases doesn't prove that there are no cases where it will not become necessary to hand over more than one character at a time to a function for proper uppercasing. Therefore, in general, both the arguments and the return type should be strings.

  • Comment (received 2004-10-27) -- Dissatisfied

  • Discussed

  • Our response (sent 2004-10-28) -- Re: Your comments on Character Model Fundamentals [LC070, LC074, LC075, LC079, LC080, LC081, LC082, LC083, LC084, LC085, LC086, LC087, LC088, LC089]

    See also following mails with same subject.

  • Discussed We decided to stay with his objection. The text is the way it is for a reason and it is not a MUST in any case.

LC076SPDBjörn Höhrmann-4.5Clarify "character technology misuse"
  • Comment (received 2004-04-08) -- Clarify "character technology misuse"

    Section 4.5, Private use code points:

    [...] C069 [C] Content SHOULD NOT misuse character technology for pictures or graphics. [...]

    C068 and C069 are misplaced in section 4.5, the section is about private use code points, not about proper use of characters. There should be a clarification what the specification considers misuse. Unicode contains a number of characters that could be used as a replacement for many common graphics or ASCII art on the web. Is it misuse to use e.g. <span style = 'font-size: xx-large'>&#x2601;</span> in a weather forecast? Or should I use <img src = 'cloud' ...>

    The former is in fact troublesome, maybe it should be written as e.g. <abbr title='cloudy'>&#x2601;</abbr> but then I am not sure how a voice browser would render U+2601 anyway. Maybe a number of Unicode characters are just a more advanced form of ASCII art and should indeed be replaced by graphics with proper (often empty) alternate text... There should be at least some discussion with illustrative examples to clarify what this means and what issues are involved, this requirement would otherwise have no real effect in practise.

  • Discussed

  • Decision: Partially-accepted You are correct that C068 and C069 do strictly speaking not belong into the PUA section, but they are in that section because they have a very strong connection to the other things in that section.

    You are correct to raise the question about ASCII art and Unicode smilies. Given your and others' comments, we have realized that C069 is too general. The various factors affecting the use of ASCII art and Unicode smilies are not questions of character encoding. In as far as they are accessibility issues, they should be and are being addressed in the relevant specifications. We have therefore removed C069 and instead added new text, just after C073:

    >>>>>>>>

    C076 [C] Content MUST NOT use a code point for any purpose other than that defined by its character encoding.

    This prohibits the construction of fonts that misuse e.g. iso-8859-1 to represent different scripts, characters, or symbols than what is actually encoded in iso-8859-1.

    >>>>>>>>

    in order to not loose the main issue for which C069 was originally introduced.

  • Comment (received 2004-10-27) -- Dissatisfied

  • Discussed

  • Our response (sent 2004-10-28) -- RE: Your comments on Character Model Fundamentals [LC076]

    See also following mails with same subject.

  • Discussed Keep as dissatisfied. See notes in minutes.

LC077ERDBjörn Höhrmann-1.3Use uppercase hhhh
  • Comment (received 2004-04-08) -- Use uppercase hhhh

    Section 1.3, Terminology and Notation:

    [...] Unicode code points are denoted as U+hhhh, where "hhhh" is a sequence of at least four, and at most six hexadecimal digits. [...]

    please use HHHH as you use uppercase hexadecimal digits in the document.

  • Discussion:

  • Decision: Rejected We have rejected this request because we feel that the uppercase string U+HHHH is inferior in appearance compared to the string U+hhhh and that the latter is more common when giving an example of Unicode Scalar Values. In particular, the Unicode standard, v4.0, on page xxxiv ("Notational Conventions") introduces the USV notation with lowercase ('x' and 'y' in this case).

  • Comment (received 2004-11-01) -- Dissatisfied

  • Discussed Leave as dissatisfied.

LC078SRDBjörn Höhrmann-2Specs must require specs to conform
  • Comment (received 2004-04-08) -- Specs must require specs to conform

    The definition of conformance for specifications lacks an item like 5. if applicable, requires specifications conforming to the specification to conform to this document

  • Discussion:

  • Decision: Rejected We think that it is a somewhat rare edge case that doesn't warrant additional complication in the conformance section. In the general case, every specification should try to conform to the character model, whether it also conforms to some other specification or not. In many cases, conformance to the character model also will come naturally for a derived spec.

  • Comment (received 2004-11-01) -- Dissatisfied

  • Discussed Leave as dissatisfied.

LC079SRDBjörn Höhrmann-4.4.2Using "charset" should be prohibed
LC080SPNBjörn Höhrmann-6.1A string is a sequence of characters
  • Comment (received 2004-04-08) -- A string is a sequence of characters

    I think section 6.1 "String concepts" is flawed. A "string" is a sequence of characters, to have more notions of a string is confusing and does not help to understand the issues involved. The actual wording is confusing too, for example

    [...] Byte string: A string viewed as a sequence of bytes representing characters in a particular character encoding. This corresponds to a CES. [...]

    This suggests that a byte string is a character encoding scheme. In fact, I do not quite understand the difference between a code unit string and a byte string, a byte string appears to be an instance of a code unit string.

  • Discussion:

  • Discussion:

  • Discussion:

  • Decision: Partially accepted We revised the explanations of byte and other strings to clarify their utility.

    We added this whole section (the diff. kinds of string) in response to a comment on a previous version of Charmod. To the issue of whether or not distinguishing the different types of strings is a good idea, as we indicate in the first paragraph of section 6.1, these are existing notions. We feel it is important to formalize their definitions so we can label and describe appropriate and inappropriate practices. We added the example to the section on byte strings, to emphasize a bad practice.

LC081EASBjörn Höhrmann-4.4.2C036 and C033 are duplicates and too obvious
  • Comment (received 2004-04-08) -- C036 and C033 are duplicates and too obvious

    C036 and C033 are duplicates, they should be merged or removed. They stress only that conforming software must be conforming which is already sufficiently obvious.

  • Discussion:

  • Decision: Accepted We moved C033 to where C036 was. We deleted C036 and added "and conflict resolution" to the end of C033.

  • Comment (received 2004-10-27) -- Satisfied

LC082EASBjörn Höhrmann-OverallAvoid inline conformance criteria
  • Comment (received 2004-04-13) -- Avoid inline conformance criteria

    For example in section 4.6, Character Escaping, conformance criteria are inside a list, one list item for each item, while in e.g. section 4.4.2, Character encoding identification, these stick all together in one overly long paragraph. This makes it difficult to scan http://www.useit.com/alertbox/whyscanning.html the text, please don't combine more than two (or one?) conformance criteria in one paragraph and use lists instead.

  • Discussion: See also LC060.

  • Decision: Accepted We have separated each conformance criterion into a separate paragraph, or in one case, a paragraph followed by a list.

  • Comment (received 2004-10-27) -- Satisfied

LC083EADBjörn Höhrmann-2Define "mandate"
  • Comment (received 2004-04-08) -- Define "mandate"

    Charmod uses the term "mandate" quite a number of times, yet it does not define what it means for a specification to "mandate" something. The RFC 2119 terms "MUST" and "SHOULD" both "mandate" something in some way, but they have a rather different meaning. If "mandate" consistently means "MUST" throughout the document, it should be defined that way, if it does not, you should use different terms to clarify this. It might be a good idea to rephrase "mandate" statements to e.g. "MUST REQUIRE", "SHOULD RECOMMEND", etc.

  • Discussion:

  • Decision: Accepted We have replaced 'mandate' with 'require' thoughout the document. 'Require' is well defined in RFC 2119. Please note that we are not using upper-case in this case, because we are using 'require' descriptively, rather than normatively, in our spec.

  • Comment (received 2004-10-27) -- Dissatisfied

  • Discussed

  • Our response (sent 2004-10-28) -- Re: Your comments on Character Model Fundamentals [LC070, LC074, LC075, LC079, LC080, LC081, LC082, LC083, LC084, LC085, LC086, LC087, LC088, LC089]

    See also following mails with same subject.

LC084NaNNBjörn Höhrmann-4.4.2Clarify C034 in case of heuristics
  • Comment (received 2004-04-13) -- Clarify C034 in case of heuristics

    Section 4.4.2, Character encoding identification

    [...] C034 [C] Content MUST make use of available facilities for character encoding identification by always indicating character encoding; where the facilities offered for character encoding identification include defaults (e.g. in XML 1.0 [XML 1.0]), relying on such defaults is sufficient to satisfy this identification requirement. [...]

    This needs some clarification. Is this a requirement because otherwise the implementation does not know the encoding of the content? What if the specification requires heuristics, would content still be required to include such information? For example, would a CSS 2.1 style sheet be required to have either a charset parameter or the @charset rule (or maybe a BOM) in order to conform to C034? CSS 2.1 has a default but it applies only if the style sheet is loaded without a referring resource (editors or validators might do this, browsers typically not [1]), so it seems that most cases style sheets would be required to have charset/@charset which would be most reasonable but I think there is not necessarily consensus to this effect in the CSS WG.

    [1] which raises an interesting question, would a style sheet considered by a "View Style Sheet Source" function in a browser be considered to have no referring document and thus show different content than what was applied to the document? ...

  • Discussion:

  • Decision: Noted We have decided to classify this comment as 'noted', which means that we think it raises a valid point, but does not merit changes to the current specification. With the example of XML, we have tried to make clear that rules that unambiguously lead to a determination of the character encoding to be used for decoding the document are not considered heuristics.

    Whether it is a good idea to make the used character encoding depend on the way the document is loaded is a different issue, not addressed by C034, but such cases already exist (e.g. loading a document from a file system vs. serving it over the Web including meta information in HTTP headers). The case you mention, loading from a link in an existing document vs. idenpendent loading, is just an extension of the above case.

  • Comment (received 2004-10-27) -- Re: Your comments on Character Model Fundamentals [LC070, LC074, LC075, LC079, LC080, LC081, LC082, LC083, LC084, LC085, LC086, LC087, LC088, LC089]

    See also following mails with same subject.

  • Discussed

LC085SRNBjörn Höhrmann-2Conformance vs. non-conformance
  • Comment (received 2004-04-13) -- Conformance vs. non-conformance

    May a specification that does not conform to Charmod require software (content) to conform to Charmod? If yes, would such software still have to conform to all Charmod conformance criteria?

    For example, it would be quite difficult to conform to C033 if the specification in question violates C028; I've got this problem as an implementer since CSS 2.1 violates C028 and I basically have no idea how to reasonably conform to C033; I can live with non-conformance with CSS 2.1 but additionally non-conformance with Charmod worries me.

    If CSS 3 continues to violate C028, would it nevertheless possible to require CSS 3 implementations to conform to Charmod? This would not make much sense to me.

  • Discussion:

  • Decision: Rejected We have decided to reject this comment because, as you may be able to deduce from our response to your issue LC084, we do not think that CSS 2.1 violates C028. The specific example you give is therefore not applicable. Even if we accepted your point re. C028, we would like to note that it would still be possible to produce CSS that conformed to the character model, for example by always using the @charset rule.

    The question of what should be done with some implementations or content to try to conform to the character model even if the specification they use doesn't conform to the character model doesn't have an easy answer in general (as shown above, sometimes this may be easy; at other times, it may not be easy). Making any general statements about such cases therefore doesn't look like it will help at all.

LC086EASBjörn Höhrmann-OverallEditorial suggestions
  • Comment (received 2004-04-08) -- Editorial suggestions

    I think the use of abbreviations like CCS, CEF, CES, etc. reduces the readability of the document. While it might be convenient to use abbreviated forms in discussions, they make the document more difficult to read, especially because these look too similar. You cannot expect that someone not familiar with the issues involved could easily understand a paragraph like

    [...] A CES is a mapping of the code units of a CEF into well-defined sequences of bytes, taking into account the necessary specification of byte-order for multi-byte base datatypes and including in some cases switching schemes between the code units of multiple CESes (an example is ISO 2022). A CES, together with the CCSes ... [...]

    It is already quite difficult to differentiate between terms like "code point" and "code unit". Please spell these out more often. It might also be helpful to include (simplified) definitions of these terms for each occurence, like

    <span title='a mapping from a repertoire of characters to a set of non-negative integers'>Coded Character Set</span> and/or make occurences links to the definitions, for example XML Schema uses constructs like <a href="#dt-value-space" class="termref"><span class="arrow" >·</span>value space<span class="arrow">·</span></a>

    It would also be helpful to have a summary glossary of the terms used in the document, this would help to create the title attributes suggested above and simplifies lookup for these terms.

  • Discussion see LC072 notes

    Accepted We reworked many abbreviations.

  • Comment (received 2004-10-27) -- Satisfied

LC087ERSBjörn Höhrmann-OverallEditorial suggestions
  • Comment (received 2004-04-08) -- Editorial suggestions

    If there is any chance you could renumber the Cxxx codes to bring them back in order, please do. While it might be convenient not to break links and references to the document, having many of them out of order is quite confusing. The document split already breaks references, for example css3-selectors CR references the normalization part which is no longer included in the latest version of the document it references, hence it appears there is only a minimal addtional cost here.

  • Discussion: see LC072 notes

  • Discussed

  • Decision: Rejected There are links to conformance criteria from existing mailnotes and the like, and these links depend on the numbers used. For this reason, though we would also prefer to maintain a sequential order, we do not want to renumber the criteria.

  • Comment (received 2004-10-27) -- Satisfied

LC088EPSBjörn Höhrmann-OverallEditorial suggestions
  • Comment (received 2004-04-08) -- Editorial suggestions

    Please make sure to publish checklists containing the conformance requirements (complete and by product) along with the CR. This would be a great help for your audience (specification and implementation reviews in particular).

  • Discussion: see LC072 notes

  • Discussed

  • Decision: Partially accepted We have produced a simple list of the conformance requirements in Appendix D. We may make this more sophisticated in future versions of the document.

  • Comment (received 2004-10-27) -- Satisfied

LC089EASBjörn Höhrmann-OverallEditorial suggestions
  • Comment (received 2004-04-08) -- Editorial suggestions

    Please make Cxxx identifiers links to that section, i.e., turn e.g. <a id="C013" name="C013"> into <a id="C013" name="C013" href="#C013"> so I can copy and paste pointers to specific guidelines more easily. That these are links could be hidden through style sheets, if you think this reduces readability.

  • Discussion: see LC072 notes

  • Decision: Accepted If you click on the number for a conformance criterion (eg. "LC067") you will link to that criterion.

  • Comment (received 2004-10-27) -- Satisfied

Key

The possible values of Impact are:

The possible values of Decision are:

The possible values of Status are:

Colours:


Addison Phillips, WG chair
Martin J. Dürst, Core Task Force Chair
Richard Ishida, W3C staff contact
last revised $Date: 2004/11/11 18:36:03 $ by $Author: rishida $

Valid XHTML 1.0!