W3C W3C Internationalization (I18n) Activity: Making the World Wide Web truly world wide!

Internationalization techniques:
Developing specifications

This page provides checklists for specification developers, editors and reviewers who want to take account of internationalization issues while developing their spec. Where a checklist item is followed by a link, click on that for more information. The page also lists links to useful resources on the W3C Internationalization Activity site and elsewhere that may help.

Use of this checklist doesn't remove the need for a formal review, but warns of potential issues and requirements, at an early stage, that might otherwise be overlooked. Therefore, the later review should yield few, if any, nasty surprises. It also provides a useful set of points for reviewers to follow.

This is just one of several techniques indexes, each of which focus on a particular type of user. There is also a dynamic version of this page that uses JavaScript to hide or expand information, to help you more quickly see what's available and drill down to the information you need.

Characters

Choosing a definition of 'character'

Best practices checklist
  • Specifications, software and content MUST NOT require or depend on a one-to-one correspondence between characters and the sounds of a language. more
  • Specifications, software and content MUST NOT require or depend on a one-to-one mapping between characters and units of displayed text. more
  • Protocols, data formats and APIs MUST store, interchange or process text data in logical order. more
  • Independent of whether some implementation uses logical selection or visual selection, characters selected MUST be kept in logical order in storage. more
  • Specifications of protocols and APIs that involve selection of ranges SHOULD provide for discontiguous logical selections, at least to the extent necessary to support implementation of visual selection on screen on top of those protocols and APIs. more
  • Specifications and software MUST NOT require nor depend on a single keystroke resulting in a single character, nor that a single character be input with a single keystroke (even with modifiers), nor that keyboards are the same all over the world. more
  • Specifications, software and content MUST NOT require or depend on a one-to-one relationship between characters and units of physical storage. more
  • When specifications use the term 'character' the specifications MUST define which meaning they intend. more
  • Specifications SHOULD use specific terms, when available, instead of the general term 'character'. more
How to's

Defining a Reference Processing Model

Best practices checklist
  • Textual data objects defined by protocol or format specifications MUST be in a single character encoding. more
  • All specifications that involve processing of text MUST specify the processing of text according to the Reference Processing Model described by the rest of the recommendations in this list. more
  • Specifications MUST define text in terms of Unicode characters, not bytes or glyphs. more
  • For their textual data objects specifications MAY allow use of any character encoding which can be transcoded to a Unicode encoding form. more
  • Specifications MAY choose to disallow or deprecate some character encodings and to make others mandatory. Independent of the actual character encoding, the specified behavior MUST be the same as if the processing happened as follows: (a) The character encoding of any textual data object received by the application implementing the specification MUST be determined and the data object MUST be interpreted as a sequence of Unicode characters - this MUST be equivalent to transcoding the data object to some Unicode encoding form, adjusting any character encoding label if necessary, and receiving it in that Unicode encoding form, (b) All processing MUST take place on this sequence of Unicode characters, (c) If text is output by the application, the sequence of Unicode characters MUST be encoded using a character encoding chosen among those allowed by the specification. more
  • If a specification is such that multiple textual data objects are involved (such as an XML document referring to external parsed entities), it MAY choose to allow these data objects to be in different character encodings. In all cases, the Reference Processing Model MUST be applied to all textual data objects. more
How to's

Including and excluding character ranges

Best practices checklist
  • Specifications SHOULD NOT arbitrarily exclude code points from the full range of Unicode code points from U+0000 to U+10FFFF inclusive. more
  • Specifications MUST NOT allow code points above U+10FFFF. more
  • Specifications SHOULD NOT allow the use of codepoints reserved by Unicode for internal use. more
  • Specifications MUST NOT allow the use of surrogate code points. more
  • Specifications SHOULD exclude compatibility characters in the syntactic elements (markup, delimiters, identifiers) of the formats they define. more
How to's

Using the Private Use Area

Best practices checklist
  • Specifications MUST NOT require the use of private use area characters with particular assignments. more
  • Specifications MUST NOT require the use of mechanisms for defining agreements of private use code points. more
  • Specifications and implementations SHOULD NOT disallow the use of private use code points by private agreement. more
  • Specifications MAY define markup to allow the transmission of symbols not in Unicode or to identify specific variants of Unicode characters. more
  • Specifications SHOULD allow the inclusion of or reference to pictures and graphics where appropriate, to eliminate the need to (mis)use character-oriented mechanisms for pictures or graphics. more
How to's

Choosing character encodings

Best practices checklist
  • Specifications MUST either specify a unique character encoding, or provide character encoding identification mechanisms such that the encoding of text can be reliably identified. more
  • When designing a new protocol, format or API, specifications SHOULD require a unique character encoding. more
  • When basing a protocol, format, or API on a protocol, format, or API that already has rules for character encoding, specifications SHOULD use rather than change these rules. more
  • When a unique character encoding is required, the character encoding MUST be UTF-8, UTF-16 or UTF-32. more
  • Specifications SHOULD avoid using the terms 'character set' and 'charset' to refer to a character encoding, except when the latter is used to refer to the MIME charset parameter or its IANA-registered values. The term 'character encoding', or in specific cases the terms 'character encoding form' or 'character encoding scheme', are RECOMMENDED. more
  • If the unique encoding approach is not taken, specifications SHOULD require the use of the IANA charset registry names, and in particular the names identified in the registry as 'MIME preferred names', to designate character encodings in protocols, data formats and APIs. more
  • Character encodings that are not in the IANA registry SHOULD NOT be used, except by private agreement. more
  • If an unregistered character encoding is used, the convention of using 'x-' at the beginning of the name MUST be followed. more
  • If the unique encoding approach is not chosen, specifications MUST designate at least one of the UTF-8 and UTF-16 encoding forms of Unicode as admissible character encodings and SHOULD choose at least one of UTF-8 or UTF-16 as required encoding forms (encoding forms that MUST be supported by implementations of the specification). more
  • Specifications that require a default encoding MUST define either UTF-8 or UTF-16 as the default, or both if they define suitable means of distinguishing them. more
How to's
Background reading
  • What is the 'Document Character Set' for XML and HTML, and how does it relate to the encodings I use for my documents?

Identifying character encodings

Best practices checklist
  • Specifications MUST NOT propose the use of heuristics to determine the encoding of data. more
  • Specifications MUST define conflict-resolution mechanisms (e.g. priorities) for cases where there is multiple or conflicting information about character encoding. more
How to's

Designing character escapes

Best practices checklist
  • Specifications should provide a mechanism for escaping characters, particularly those which are invisible or ambiguous. more
  • Specifications SHOULD NOT invent a new escaping mechanism if an appropriate one already exists. more
  • The number of different ways to escape a character SHOULD be minimized (ideally to one). more
  • Escape syntax SHOULD require either explicit end delimiters or a fixed number of characters in each character escape. Escape syntaxes where the end is determined by any character outside the set of characters admissible in the character escape itself SHOULD be avoided. more
  • Whenever specifications define character escapes that allow the representation of characters using a number, the number MUST represent the Unicode code point of the character and SHOULD be in hexadecimal notation. more
  • Escaped characters SHOULD be acceptable wherever their unescaped forms are; this does not preclude that syntax-significant characters, when escaped, lose their significance in the syntax. In particular, if a character is acceptable in identifiers and comments, then its escaped form should also be acceptable. more
How to's

Storing text

Best practices checklist
  • Protocols, data formats and APIs MUST store, interchange or process text data in logical order. more
  • Specifications of protocols and APIs that involve selection of ranges SHOULD provide for discontiguous logical selections, at least to the extent necessary to support implementation of visual selection on screen on top of those protocols and APIs. more
How to's

Specifying sort and search functionality

Best practices checklist
  • Software that sorts or searches text for users SHOULD do so on the basis of appropriate collation units and ordering rules for the relevant language and/or application. more
  • Where searching or sorting is done dynamically, particularly in a multilingual environment, the 'relevant language' SHOULD be determined to be that of the current user, and may thus differ from user to user. more
  • Software that allows users to sort or search text SHOULD allow the user to select alternative rules for collation units and ordering. more
  • Specifications and implementations of sorting and searching algorithms SHOULD accommodate text that contains any character in Unicode. more
How to's

Defining 'string'

Best practices checklist
  • Specifications SHOULD NOT define a string as a 'byte string'. more
  • The 'character string' definition SHOULD be used by most specifications. more
How to's

Indexing strings

Best practices checklist
  • The character string is RECOMMENDED as a basis for string indexing. more
  • A code unit string MAY be used as a basis for string indexing if this results in a significant improvement in the efficiency of internal operations when compared to the use of character string. more
  • Grapheme clusters MAY be used as a basis for string indexing in applications where user interaction is the primary concern. more
  • Specifications that define indexing in terms of grapheme clusters MUST either: (a) define grapheme clusters in terms of default grapheme clusters as defined in Unicode Standard Annex #29, Text Boundaries [UTR #29], or (b) define specifically how tailoring is applied to the indexing operation. more
  • The use of byte strings for indexing is NOT RECOMMENDED. more
  • Specifications that need a way to identify substrings or point within a string SHOULD provide ways other than string indexing to perform this operation. more
  • Specifications SHOULD understand and process single characters as substrings, and treat indices as boundary positions between counting units, regardless of the choice of counting units. more
  • Specifications of APIs SHOULD NOT specify single characters or single 'units of encoding' as argument or return types. more
  • When the positions between the units are counted for string indexing, starting with an index of 0 for the position at the start of the string is the RECOMMENDED solution, with the last index then being equal to the number of counting units in the string. more
How to's

Referencing the Unicode Standard

Best practices checklist
  • Since specifications in general need both a definition for their characters and the semantics associated with these characters, specifications SHOULD include a reference to the Unicode Standard, whether or not they include a reference to ISO/IEC 10646. more
  • A generic reference to the Unicode Standard MUST be made if it is desired that characters allocated after a specification is published are usable with that specification. A specific reference to the Unicode Standard MAY be included to ensure that functionality depending on a particular version is available and will not change over time. more
  • All generic references to the Unicode Standard MUST refer to the latest version of the Unicode Standard available at the date of publication of the containing specification. more
  • All generic references to ISO/IEC 10646 MUST refer to the latest version of ISO/IEC 10646 available at the date of publication of the containing specification. more
How to's

Language

Establishing the language of a resource as a whole

Best practices checklist
  • It must be possible to indicate the default text-processing language for the resource as a whole. more
  • Consider whether it is necessary to have separate declarations for the text-processing language and metadata about the intended linguistic characteristics of the consumer. more
How to's

Establishing the language of blocks, paragraphs, or similar chunks of content

Best practices checklist
  • By default, blocks of content should inherit the text-processing language set for the resource as a whole. more
  • It should be possible to indicate a change in language for blocks of content where the language changes. more
How to's

Establishing the language of inline content spans

Best practices checklist
  • It should be possible to indicate language for spans of inline text where the language changes. more
How to's

Defining language values

Best practices checklist
  • Language values should be BCP47 language tags.
How to's

Providing for content negotiation based on language

Best practices checklist
  • In a multilingual environment it must be possible for the user to receive text in the language they prefer. This may depend on implicit user preferences based on the user's system or browser setup, or on user settings explicitly negotiated with the user.
How to's

Text direction

Setting the bidi direction for the resource as a whole

Best practices checklist
  • The content author must be able to indicate the RTL/LTR direction of the content as a whole, ie. set the overall base direction.
  • The default text direction should be declared as LTR.

Establishing the bidi direction for blocks, paragraphs, or similar chunks of content

Best practices checklist
  • The content author must be able to indicate parts of the text where the base direction changes. This should be achieved using attributes or metadata at a block level, and not rely on Unicode control characters.
  • It must be possible to also set the direction for content fragments to auto. This means that the base direction will be determined by examining the content itself.
  • A typical approach here would be to set the direction based on the first strong directional character outside of any markup, but this is not the only possible method. The algorithm used to determine directionality when direction is set to auto should match that expected by the receiver. more
  • If the overall base direction is set to auto for plain text, the direction of content paragraphs should be determined on a paragraph by paragraph basis.
  • To indicate the sides of a block of text where relative to the start and end of its contained lines, you should use 'before' and 'after' (maybe block-start/block-end – the terminology is changing), rather than 'top' and 'bottom'.
  • To indicate the start/end of a line you should use 'start' and 'end' rather than 'left' and 'right'.
  • Provide dedicated attributes for control of base direction and bidirectional overrides; do not rely on the user applying style properties to arbitrary markup to achieve bidi control.

Establishing the bidi direction for spans of inline content

Best practices checklist
  • It must be possible to indicate spans of inline text where the base direction changes. If markup is available, this is the preferred method. Otherwise your specification must require that Unicode control characters are recognized by the receiving application, and correctly implemented.
  • It must be possible to also set the direction for a span to auto. This means that the base direction will be determined by examining the content itself. A typical approach here would be to set the direction based on the first strong directional character outside of any markup. more
  • If users use Unicode bidirectional control characters, the RLI/LRI/FSI with PDI characters must be supported by the application and recommended (rather than RLE/LRE with PDF) by the spec.
  • Use of RLM/LRM should be appropriate, and expectations of what those controls can and cannot do should be clear in the spec. more
  • Provide dedicated attributes for control of base direction and bidirectional overrides; do not rely on the user applying style properties to arbitrary markup to achieve bidi control.
  • Allow bidi attributes on all inline elements in markup that contain text.
  • Provide attributes that allow the user to (a) create an embedded base direction or (b) override the bidirectional algorithm altogether; the attribute should allow the user to set the direction to LTR or RTL in either of these two scenarios.

Enabling vertical text display

Best practices checklist
  • It should be possible to render text vertically for languages such as Japanese, Chinese, Korean, Mongolian, etc.
  • Vertical text must support line progression from LTR (eg. Mongolian) and RTL (eg. Japanese)

Setting box positioning coordinates when text direction varies

Best practices checklist
  • Box positioning coordinates must take into account whether the text is horizontal or vertical. more

Typographic support

In this section

Miscellaneous

Best practices checklist
  • Line heights must allow for characters that are taller than English.
  • Box sizes must allow for text expansion in translation.
  • Ruby text alongside base text should be supported for CJK text.
  • Line wrapping should take into account the special rules needed for non-Latin scripts. more
  • Avoid specifying presentational tags, such as b for bold, and i for italic. more

Plain text support

In this section

Miscellaneous

Best practices checklist
  • Avoid natural language text in elements that only allow for plain text and in attribute values.
  • Provide a span-like element that can be used for any text content to apply information needed for internationalization. more

Case distinctions

In this section

Miscellaneous

Best practices checklist
  • Identifiers should be case-sensitive.

Contact: ishida@w3.org.

Content last changed 2015-03-23 16:19 GMT