I18n WG comments on XHTML2 WD [Draft]

Version reviewed



These are comments on behalf of the I18N WG.

We have not yet reviewed in detail the following sections:


IDLocationCommentMail thread
55.5 Attribute types

This comment has implications for XHTML Modularization spec: The term 'charset' does not really mean 'encoding'. Please change this term to 'encoding'.

Note that 'charset' may be more coherent with the HTTP usage, but this usage is typically only known about by geeks. Continuing to equate 'charset' with 'encoding' causes confusion for average users, and causes problems. Using the term 'encoding' should not confuse geeks, and is coherent with the use of 'encoding' in the XML declaration.

75.5 Attribute types

This comment has implications for XHTML Modularization spec: LanguageCode: this should at least say "as per RFC3066 or it successor", and should include the possibility of an empty string. We already had lots of trouble getting people away from RFC1766 with earlier versions of HTML (and i think it's still not in the errata). Let's not go down that road again. (Note: there is a successor to RFC3066 in preparation.)

It is further recommended that you point to the xml:lang definition in the XML spec, rather than redefine it here.

7a5.5 Attribute types

Shouldn't the LanguageCodes definition should point to section 14.5 of RFC2616?

85.5 Attribute types

This comment has implications for XHTML Modularization spec: Number: Does "one or more digits" mean any Unicode digit, or just 0-9? Eg, there are arabic digits, farsi digits, bengali digits, thai digits, etc. Does it preclude decimal separators, then? Or other separators for that matter?

Also, "one or more digits" describes syntax, not semantics. It should say "integer", etc.


It would help to include a pointer to the XHTML Modularization spec when describing attribute types.

95.5 Attribute types

This comment has implications for XHTML Modularization spec: URI: This should be called IRI. At least we should indicate clearly in a note that this can contain non-ASCII text.

Note also that the XHTML Modularization spec points to an old version of the URI spec - you should point to the RFC 3987, the IRI Proposed Standard, which references the latest URI Standard RFC 3986.


Why are I18N and Bi-directional modules separate from Core? If you are to develop properly internationalised documents you will always need these items. Having them in separate modules suggests that internationalization is a feature, rather than part of the base, and we think that is a bad thing.


We would like you to require the xml:lang attribute on the <html> tag.

It is important for accessibility and for i18n to declare the text processing language at this point (and not others such as the body element). We need a higher awareness of the value of this and how it should be done for content authors. We also have a chicken and egg situation wrt the usefulness of language information. We need people to use it as a matter of course to enable future developments that make use of it. (See also the discussion of the difference between use of the HTTP header and html tag for declaring language, in comment #37.)

In the very rare event where a document is completely evenly divided into more than one langauge, or the XHTML document contains no language specific information, we recommend the use of the empty attribute with xml:lang. This is already defined to mean that the language is undefined. Of course, in this case authors should indicate the appropriate language on lower level elements as soon as language-sensitive information appears. (Note we have discussed in detail the possible use of a value such as 'mul', but have concluded that it would not be appropriate.)


We would like to see xml:lang used in all examples that include the html tag.


We are strongly against the use of the title attribute to provide the full or expanded form of an abbreviation, for the following reasons:

  1. title attributes do not allow for inline markup to express bidi or language needs - indeed, some inline styling or graphics may be appropriate for such full forms
  2. it may be appropriate to draw a distinction between the language used to pronounce the abbreviation and that used to pronounce the full form, eg. an acronym may be spelled out using English letter names, but the full form may be in French. The problem here being that you cannot label an attribute's language differently from that of the element using xml:lang.
  3. it precludes the use of title for other purposes
  4. the term 'title' does not describe the content of the attribute, and does not identify the semantics of the content - this goes against the philosophy of good structural markup

Please change the content model of abbr to include an element that expresses the full form.


"When necessary, authors should use style sheets to specify the pronunciation of an abbreviated form."

We have brought this up several times in the past. The use of style sheets only addresses a small part of the problem. It is totally inadequate for dealing with an abbreviation such as "CSAT", pronounced "see-sat", or "MA", pronounced "Massachussetts", etc. Please put in place a method to allow pronunciation to be dealt with properly.

If this is not done, please at least modify the text in the specification to recognise that stylesheets are only going to be useful in certain circumstance - it is not a real solution to the problem. (We think it would also be helpful to show an example of how stylesheets should be used, if you suggest that.)


It's bad enough that the US has recently appropriated such British classics as The Italian Job, Ladykillers, and Pooh Bear, but please don't try to claim that Gandalf speaks en-us ! ;-) In the interests of international harmony, please change the example to say "<quote xml:lang="en">".


We think it would greatly aid clarity and consistency of markup to offer some advice about when it is appropriate for document authors to add quotes directly in the text or via style sheets [note that this is spelled as a single word in the para starting "Visual user agents...", but spelt as two words in earlier parts of the text].

We would recommend that style sheets are used as the default method, and that therefore the XHTML processor support the necessary CSS. This is to facilitate localization. It is much faster and easier to adapt quotation marks in a style sheet than to change all instances in the markup. Manual insertion of quotation marks is appropriate when quoting a passage that will not be translated.

We think you should also recommend that the quotes appear outside the quote markup, since they are part of the surrounding text.

Note that the example given for quote does the wrong thing on both counts here, and we would say encourages bad practise.


Please add "dir attribute" to "... in conjunction with style sheets, the xml:lang attribute, etc...."


Please help promote good practise by adding a note that class names should be chosen to reflect semantic distinctions, not presentational ones, eg. use 'emph' rather than 'italic'. This also assists in ensuring localizability of markup, since presentational values do not necessarily map from one script to another.

3412.1, title

We appreciate your work in eliminating attributes containing user readable text from the XHTML2 format. This will significantly aid localizability, since it reduces the number of places where unique ids, or language or bidi markup are unavailable.

The title attribute still sticks out like a sore thumb, though, in this regard. Can we not convert it to a common inline element, that can optionally appear as the first item in most other elements? Otherwise, there is a significant usability impact for international users, and XHTML will look Western-centric.

3513.1, hreflang

The initial text says "This attribute specifies the base language of the resource designated by href", which seems to hark back to the usage of hreflang in HTML4, ie. to declare the language of the target content, rather than to request it. Please check the wording for the section for ambiguities.

35a13.1, hreflang

You should indicate in the second para that language values should conform to RFC 3066 or its successors.

35b13.1, hreflang

It would be useful to add a note about usage of hreflang to the effect that a value such as en-GB will not match against a file such as filename.en.html. Users requesting en-GB should use the following hreflang value: "en-GB, en". (See the GEO FAQ.) People don't typically know this.

35c13.1, hreflang

Given that the usage of hreflang is now different from that in HTML4, we feel that using the attribute name 'hreflang' may cause confusion for the user. We would prefer to see this attribute called 'accept-language', since this is both clearer in meaning, and avoids potential confusion.


The spec uses the term 'base language' without an apparent definition. It also uses the term 'primary language' in the example in this section. The i18n WG has put a lot of thought into matters of this kind lately and it would seem appropriate to step back and consider the usage of language in XHTML2.

The i18n WG has begun to use the terms 'text processing language' and 'primary language(s)' (see our definitions) to mean different things. The actual terminology is a secondary consideration here. The idea is that there are two ways in which one needs to declare the language of content: the first is to express the basic language of the document as a whole (this could be used for content negotiation, etc.); the second is to express the language of a specific run of text so that applications that manipulate the text (such as text-to-speech, etc) can correctly understand the text they are currently dealing with. The former declaration (what we call 'primary language') could involve declaring more than one language, eg. for documents containing parallel texts in multiple languages, but doesn't necessarily mention every language that appears in the document. The latter type of declaration (what we call 'text processing language') must, of necessity, refer to only a single language at a time, though that declaration can be overriden for a labelled fragment of the text, eg. an embedded French word in English text.

The rules governing the use of language values in HTTP headers and language attributes reinforce our view that the HTTP header should be used to declare the primary language, and that language attributes should be used to declare the text processing language. It is acceptable in our mind to say that, in the absence of language attributes, the first value of the HTTP header Content-Language field could be used to declare the default text processing language for the document, but it would always be better to declare that explicitly in the html tag.

Based on the foregoing, we have the following recommendations for this section:

  • Suggested rewording: "This attribute specifies the base language of an element's attribute values and text content" -> "This attribute indicates the language of an element's attribute values and text content, and that of all elements it contains unless overidden". We also recommend using similar markup to XML (in the errata)
  • Add a paragraph to recommend that xml:lang be used with the html tag to set the default language of the document for text processing, unless there is a good reason not to (that might be the case for documents with multiple primary languages, though those are rare).
  • Also please say that the use of the HTTP language information for text processing purposes should only be considered a fallback solution
  • Define the expected behaviour if the HTTP Content-Language declaration contains more than one language.
  • Suggested rewording for the example: "In this example, the primary language of the document is..." -> "In this example, the default text processing language of the document is...after which the language returns to French...".

Other brainstorming thoughts related to language declarations, based on the philosophy introduced in comment 37:

  • It would be appropriate to provide a way of declaring the primary language of the document in the document itself, because we understand that a meta element with Content-Type declaration cannot be used any more? This would retain such information even if the document was not pulled from a server, eg. read from CD. (Note that the xml:lang attribute cannot fill this function for multilingual docs because (a) it's possible to construct documents such that the primary languages are not readily discernable by inspection, (b) it only takes a single value at a time.)
  • If the title attribute is retained (and we hope that it won't be), it would make sense to have a common attribute called something like title-lang, which allows one to specify the language of the title attribute when that differed from the language of the element content?
38a15.1, 1st para

The sentence "This attribute specifies the base direction of the element's text content." should probably read "This attribute allows the user to specify the base direction of the element's text content.", since we believe that you do not intend to directly specify the user agent's behaviour wrt this markup, but rather have that behaviour be defined by the association of the appropriate CSS styling.

If this assumption is correct, then the wording of other parts of this section needs to be tweaked a little, too.

3915.1, 1st para

Suggested rewording: "This direction overrides the inherent directionality of characters as defined in Unicode Standard" --> "This direction affects the display of characters as defined in Unicode Standard". It doesn't change the inherent directionality of the characters themselves, just the behaviour of those characters in context.

Also, we think that the directionaliy of the text would typically only be affected when the user uses CSS bidi-override with the lro and rlo value of dir to disable the bidi algorithm.

Also, the first sentence incorrectly refers to 'base direction' for inline elements.

39a15.1, 1st para

The text initially says that "The default value of this attribute is user-agent dependent, but must be either ltr or rtl." but then in the next subsection says "The default value of the dir attribute is "ltr" (left-to-right text)."

We feel that leaving the default to the user agent could have a seriously negative impact on documents that don't specify default directionality - eg. most of the existing documents written in English. Directionality should be predictably associated with the document, not the user agent.

4015.1, example

You would only need to use the dir attribute on the p element in a rtl context. It would be better to omit it here and to simply state that the default directionality for this text is ltr. We encounter many people who add dir's to almost all the block elements in a file, when a single declaration on the html tag would suffice. I don't want to encourage such behaviour, since it is unnecessary and detrimental.

We propose you add a note to say that the paragraph is embedded in a rtl context, or show the html tag with the dir attribute (although it is not really needed there either), rather than put the attribute on the p tag.

[ Thankyou for adding the note about setting the base direction for an entire doc in the html tag. Very helpful !]

4115.1, example

The Hebrew looks exactly the same in both places in the example, which is potentially confusing for the reader.

It is not at all obvious to the uninitiated reader what the effect of the lro attribute would be unless you show the resulting displayed text. (This is a tricky area to show examples ;)

Note that the success of showing the resulting text would be unpredictable if that part of the example used text (because the reader's browser may not support bidi), so we recommend a graphic.


The attribute name datetime is rather unspecific, given that it can appear on any element. I would prefer editdatetime.


It would be useful to also have an attribute that indicated the name of the person making the change.


This comment also applies to the XHTML Modularization spec. The structure of the datetime attribute should be clearly indicated. It should also clearly either [1] be limited to the subtype of XML Schema's datetime that includes a required timezone, or [2] limit the user to a single timezone (Z).

Because the Web is worldwide, times specified without timezone can be meaningless.


Please provide a link to the definitions in XHTML Modularization, and indicate clearly whether these definitions are identical or not.

4320.5.2, example

Several problems with this example:

  1. There is inconsistency in the use of English vs translated (French) labels. This should not be the case. The question is, which should it be. It depends to some extent on how the information will be displayed, and who the intended reader is. I think it probably should be in foreign languages. - note also that xml:lang says it is in foreign - maybe hreflang
  2. For the Dutch manual, xml:lang="nl" is wrong. Either translate the text to Dutch, or change xml:lang to hreflang. Or translate and add hreflang. Same goes for Portuguese and Arabic.
  3. The French should not use an entity to represent the ç

Again, the title is used for a role specific to this element, and contains user readable text that might need to be marked up for directionality, language, translation id, etc (especially given the context of usage). Does the link element have to be empty? We would prefer to see this text as content pf the link element.

4520.6.1, example

Please use the character è rather than the NCR in Grèce.


charset: Please call this 'encoding'. See also comment #5.

What about precedence? The target document or the HTTP header for that document might declare the encoding, and these should have higher precedence.


It says "Please consult the section on character encodings for more details.", but I wasn't sure which section that was referring to.


Please add xml:lang="en" to the html tag. See comment #14.


Where is the perl script?


Suggested rewording: "When set for the table element" --> "When set for or inherited by the table element".

51Appendix F

Entities are extremely useful for disambiguation of invisible or identical characters, such as &rlm; and &nbsp;. It is much easier to work with names in these cases than to use NCRs. (Please be driven by user needs rather than technological limitations where possible.)

We do not believe, however, that it is as useful to provide entity names for characters such as the Latin 1 accented characters, especially since this shows a Western bias.

Version: $Id: xhtml2-i18n-review.html,v 1.7 2005/02/04 10:46:27 rishida Exp $