These are comments on behalf of the I18N WG.
We have not yet reviewed in detail the following sections:
|5||5.5 Attribute types||
This comment has implications for XHTML Modularization spec: The term 'charset' does not really mean 'encoding'. Please change this term to 'encoding'.
Note that 'charset' may be more coherent with the HTTP usage, but this usage is typically only known about by geeks. Continuing to equate 'charset' with 'encoding' causes confusion for average users, and causes problems. Using the term 'encoding' should not confuse geeks, and is coherent with the use of 'encoding' in the XML declaration.
|7||5.5 Attribute types||
This comment has implications for XHTML Modularization spec: LanguageCode: this should at least say "as per RFC3066 or it successor", and should include the possibility of an empty string. We already had lots of trouble getting people away from RFC1766 with earlier versions of HTML (and i think it's still not in the errata). Let's not go down that road again. (Note: there is a successor to RFC3066 in preparation.)
It is further recommended that you point to the xml:lang definition in the XML spec, rather than redefine it here.
|7a||5.5 Attribute types||
Shouldn't the LanguageCodes definition should point to section 14.5 of RFC2616?
|8||5.5 Attribute types||
This comment has implications for XHTML Modularization spec: Number: Does "one or more digits" mean any Unicode digit, or just 0-9? Eg, there are arabic digits, farsi digits, bengali digits, thai digits, etc. Does it preclude decimal separators, then? Or other separators for that matter?
Also, "one or more digits" describes syntax, not semantics. It should say "integer", etc.
It would help to include a pointer to the XHTML Modularization spec when describing attribute types.
|9||5.5 Attribute types||
This comment has implications for XHTML Modularization spec: URI: This should be called IRI. At least we should indicate clearly in a note that this can contain non-ASCII text.
Why are I18N and Bi-directional modules separate from Core? If you are to develop properly internationalised documents you will always need these items. Having them in separate modules suggests that internationalization is a feature, rather than part of the base, and we think that is a bad thing.
We would like you to require the xml:lang attribute on the <html> tag.
It is important for accessibility and for i18n to declare the text processing language at this point (and not others such as the body element). We need a higher awareness of the value of this and how it should be done for content authors. We also have a chicken and egg situation wrt the usefulness of language information. We need people to use it as a matter of course to enable future developments that make use of it. (See also the discussion of the difference between use of the HTTP header and html tag for declaring language, in comment #37.)
In the very rare event where a document is completely evenly divided into more than one langauge, or the XHTML document contains no language specific information, we recommend the use of the empty attribute with xml:lang. This is already defined to mean that the language is undefined. Of course, in this case authors should indicate the appropriate language on lower level elements as soon as language-sensitive information appears. (Note we have discussed in detail the possible use of a value such as 'mul', but have concluded that it would not be appropriate.)
We would like to see xml:lang used in all examples that include the html tag.
We are strongly against the use of the title attribute to provide the full or expanded form of an abbreviation, for the following reasons:
Please change the content model of abbr to include an element that expresses the full form.
"When necessary, authors should use style sheets to specify the pronunciation of an abbreviated form."
We have brought this up several times in the past. The use of style sheets only addresses a small part of the problem. It is totally inadequate for dealing with an abbreviation such as "CSAT", pronounced "see-sat", or "MA", pronounced "Massachussetts", etc. Please put in place a method to allow pronunciation to be dealt with properly.
If this is not done, please at least modify the text in the specification to recognise that stylesheets are only going to be useful in certain circumstance - it is not a real solution to the problem. (We think it would also be helpful to show an example of how stylesheets should be used, if you suggest that.)
It's bad enough that the US has recently appropriated such British classics as The Italian Job, Ladykillers, and Pooh Bear, but please don't try to claim that Gandalf speaks en-us ! ;-) In the interests of international harmony, please change the example to say "<quote xml:lang="en">".
We think it would greatly aid clarity and consistency of markup to offer some advice about when it is appropriate for document authors to add quotes directly in the text or via style sheets [note that this is spelled as a single word in the para starting "Visual user agents...", but spelt as two words in earlier parts of the text].
We would recommend that style sheets are used as the default method, and that therefore the XHTML processor support the necessary CSS. This is to facilitate localization. It is much faster and easier to adapt quotation marks in a style sheet than to change all instances in the markup. Manual insertion of quotation marks is appropriate when quoting a passage that will not be translated.
We think you should also recommend that the quotes appear outside the quote markup, since they are part of the surrounding text.
Note that the example given for quote does the wrong thing on both counts here, and we would say encourages bad practise.
Please add "dir attribute" to "... in conjunction with style sheets, the xml:lang attribute, etc...."
Please help promote good practise by adding a note that class names should be chosen to reflect semantic distinctions, not presentational ones, eg. use 'emph' rather than 'italic'. This also assists in ensuring localizability of markup, since presentational values do not necessarily map from one script to another.
We appreciate your work in eliminating attributes containing user readable text from the XHTML2 format. This will significantly aid localizability, since it reduces the number of places where unique ids, or language or bidi markup are unavailable.
The title attribute still sticks out like a sore thumb, though, in this regard. Can we not convert it to a common inline element, that can optionally appear as the first item in most other elements? Otherwise, there is a significant usability impact for international users, and XHTML will look Western-centric.
The initial text says "This attribute specifies the base language of the resource designated by href", which seems to hark back to the usage of hreflang in HTML4, ie. to declare the language of the target content, rather than to request it. Please check the wording for the section for ambiguities.
You should indicate in the second para that language values should conform to RFC 3066 or its successors.
It would be useful to add a note about usage of hreflang to the effect that a value such as en-GB will not match against a file such as filename.en.html. Users requesting en-GB should use the following hreflang value: "en-GB, en". (See the GEO FAQ.) People don't typically know this.
Given that the usage of hreflang is now different from that in HTML4, we feel that using the attribute name 'hreflang' may cause confusion for the user. We would prefer to see this attribute called 'accept-language', since this is both clearer in meaning, and avoids potential confusion.
The spec uses the term 'base language' without an apparent definition. It also uses the term 'primary language' in the example in this section. The i18n WG has put a lot of thought into matters of this kind lately and it would seem appropriate to step back and consider the usage of language in XHTML2.
The i18n WG has begun to use the terms 'text processing language' and 'primary language(s)' (see our definitions) to mean different things. The actual terminology is a secondary consideration here. The idea is that there are two ways in which one needs to declare the language of content: the first is to express the basic language of the document as a whole (this could be used for content negotiation, etc.); the second is to express the language of a specific run of text so that applications that manipulate the text (such as text-to-speech, etc) can correctly understand the text they are currently dealing with. The former declaration (what we call 'primary language') could involve declaring more than one language, eg. for documents containing parallel texts in multiple languages, but doesn't necessarily mention every language that appears in the document. The latter type of declaration (what we call 'text processing language') must, of necessity, refer to only a single language at a time, though that declaration can be overriden for a labelled fragment of the text, eg. an embedded French word in English text.
The rules governing the use of language values in HTTP headers and language attributes reinforce our view that the HTTP header should be used to declare the primary language, and that language attributes should be used to declare the text processing language. It is acceptable in our mind to say that, in the absence of language attributes, the first value of the HTTP header Content-Language field could be used to declare the default text processing language for the document, but it would always be better to declare that explicitly in the html tag.
Based on the foregoing, we have the following recommendations for this section:
Other brainstorming thoughts related to language declarations, based on the philosophy introduced in comment 37:
|38a||15.1, 1st para||
The sentence "This attribute specifies the base direction of the element's text content." should probably read "This attribute allows the user to specify the base direction of the element's text content.", since we believe that you do not intend to directly specify the user agent's behaviour wrt this markup, but rather have that behaviour be defined by the association of the appropriate CSS styling.
If this assumption is correct, then the wording of other parts of this section needs to be tweaked a little, too.
|39||15.1, 1st para||
Suggested rewording: "This direction overrides the inherent directionality of characters as defined in Unicode Standard" --> "This direction affects the display of characters as defined in Unicode Standard". It doesn't change the inherent directionality of the characters themselves, just the behaviour of those characters in context.
Also, we think that the directionaliy of the text would typically only be affected when the user uses CSS bidi-override with the lro and rlo value of dir to disable the bidi algorithm.
Also, the first sentence incorrectly refers to 'base direction' for inline elements.
|39a||15.1, 1st para||
The text initially says that "The default value of this attribute is user-agent dependent, but must be either ltr or rtl." but then in the next subsection says "The default value of the dir attribute is "ltr" (left-to-right text)."
We feel that leaving the default to the user agent could have a seriously negative impact on documents that don't specify default directionality - eg. most of the existing documents written in English. Directionality should be predictably associated with the document, not the user agent.
You would only need to use the dir attribute on the p element in a rtl context. It would be better to omit it here and to simply state that the default directionality for this text is ltr. We encounter many people who add dir's to almost all the block elements in a file, when a single declaration on the html tag would suffice. I don't want to encourage such behaviour, since it is unnecessary and detrimental.
We propose you add a note to say that the paragraph is embedded in a rtl context, or show the html tag with the dir attribute (although it is not really needed there either), rather than put the attribute on the p tag.
[ Thankyou for adding the note about setting the base direction for an entire doc in the html tag. Very helpful !]
The Hebrew looks exactly the same in both places in the example, which is potentially confusing for the reader.
It is not at all obvious to the uninitiated reader what the effect of the lro attribute would be unless you show the resulting displayed text. (This is a tricky area to show examples ;)
Note that the success of showing the resulting text would be unpredictable if that part of the example used text (because the reader's browser may not support bidi), so we recommend a graphic.
The attribute name datetime is rather unspecific, given that it can appear on any element. I would prefer editdatetime.
It would be useful to also have an attribute that indicated the name of the person making the change.
This comment also applies to the XHTML Modularization spec. The structure of the datetime attribute should be clearly indicated. It should also clearly either  be limited to the subtype of XML Schema's datetime that includes a required timezone, or  limit the user to a single timezone (Z).
Because the Web is worldwide, times specified without timezone can be meaningless.
Please provide a link to the definitions in XHTML Modularization, and indicate clearly whether these definitions are identical or not.
Several problems with this example:
Again, the title is used for a role specific to this element, and contains user readable text that might need to be marked up for directionality, language, translation id, etc (especially given the context of usage). Does the link element have to be empty? We would prefer to see this text as content pf the link element.
Please use the character è rather than the NCR in Grèce.
charset: Please call this 'encoding'. See also comment #5.
What about precedence? The target document or the HTTP header for that document might declare the encoding, and these should have higher precedence.
It says "Please consult the section on character encodings for more details.", but I wasn't sure which section that was referring to.
Please add xml:lang="en" to the html tag. See comment #14.
Where is the perl script?
Suggested rewording: "When set for the table element" --> "When set for or inherited by the table element".
Entities are extremely useful for disambiguation of invisible or identical characters, such as ‏ and . It is much easier to work with names in these cases than to use NCRs. (Please be driven by user needs rather than technological limitations where possible.)
We do not believe, however, that it is as useful to provide entity names for characters such as the Latin 1 accented characters, especially since this shows a Western bias.
Version: $Id: xhtml2-i18n-review.html,v 1.7 2005/02/04 10:46:27 rishida Exp $