Internationalization Best Practices for Spec Developers

Abstract

This document provides a checklist of internationalization-related considerations when developing a specification. Most checklist items point to detailed supporting information in other documents. Where such information does not yet exist, it can be given a temporary home in this document. The dynamic page Internationalization Techniques: Developing specifications is automatically generated from this document. The current version is still an early draft, and it is expected that the information will change regularly as new content is added and existing content is modified in the light of experience and discussion.

2. Resources

Declaring language
Defining language values
Setting the default base direction
Defining resouce identifiers

Here we are talking about an independent unit of data. Examples include a whole HTML page, an XML document, a JSON file, a WebVTT script, an annotation, etc.

2.1 Declaring language

Language information for a given resource can be used with two main objectives in mind: for text-processing, or as a statement of the intended use of the resource. We will explain the difference here.

2.1.1 Text-processing language information

The text-processing language is the language that is relevant for processing the content when it comes to spell-checking, styling, voice production, etc. For such operations, only one language should be identified at a time for a given range of content.

When specifying the language for text-processing purposes you are declaring the language in which a specific range of text is actually written, so that user agents or applications that manipulate the text, such as voice browsers, spell checkers, or style processors, etc., can apply the appropriate rules to the text in question. So we are, by necessity, talking about associating a single language with a specific range of text.

A language declaration that indicates the text-processing language for a range of text must be associated with a single language value. more

It is normal to express a text-processing language as the default language for processing the resource as a whole, but it may also be necessary to indicate where the language changes within the resource.

Use the HTML lang and XML xml:lang language attributes where appropriate, rather than creating a new attribute or mechanism. more

For example, XML provides xml:lang which can be used in all XML formats to identify the text-processing language for a range of text. It's useful to continue using that, since authors recognise it, as do XML processors.

2.1.2 Language metadata about the resource as a whole

It may also be useful to describe the language of a resource as a whole. This type of language declaration typically indicates the intended use of the resource. For example, such metadata may be used for searching, serving the right language version, classification, etc.

This type of language declaration differs from that of the text-processing declaration in that (a) the value for such declarations may be more than one language subtag, and (b) the language value declared doesn't indicate which bits of a multilingual resource are in which language.

The language(s) describing the intended use of a resource do not necessarily include every language used in a document. For example, many documents on the Web contain embedded fragments of content in different languages, whereas the page is clearly aimed at speakers of one particular language. For example, a German city-guide for Beijing may contain useful phrases in Chinese, but it is aimed at a German-speaking audience, not a Chinese one.

On the other hand, it is also possible to imagine a situation where a document contains the same or parallel content in more than one language. For example, a web page may welcome Canadian readers with French content in the left column, and the same content in English in the right-hand column. Here the document is equally targeted at speakers of both languages, so there are two audience languages. Another use case is a blog or a news page aimed at a multilingual community, where some articles on a page are in one language and some in another. In this case, it may make sense to list more than one language tag as the value of the language declaration.

A metadata-type language declaration that indicates the intended use of the resource, rather than the language of a specific range of text, may be associated with multiple language values. more

2.1.3 Declaring language for the resource

The specification should indicate how to define the default text-processing language for the resource as a whole. more

It saves trouble to identify the language, or at least the default language, of the resource as a whole in one place. For example, in an HTML file, this is done by setting the lang attribute on the html element.

Content within the resource should inherit the language of the text-processing declared at the resource level, unless it is specifically overridden.

Consider whether it is necessary to have separate declarations to indicate the text-processing language versus metadata about the expected use of the resource. more

In many cases a resource contains text in only one language, and in many more cases the language declared as the default language for text-processing is the same as the language that describes the metadata about the resource as a whole. In such cases it makes sense to have a single declaration.

It becomes problematic, however, to use a single declaration when it refers to more than one language unless there is a way to determine which one language should be used as the text-processing default.

If there is only one language declaration for a resource, and it has more than one language tag as a value, it must be possible to identify the default text-processing language for the resource. more

2.1.4 Links

2.1.4.1 How to's

Declaring language

In Internationalization Best Practices for Spec Developers.
BCP 47

The IETF specification that indicates how to create language subtags and how to match them .

2.1.4.2 Background

Use cases for language information in web annotations

Description of use cases for annotations that illustrate the differences between text-processing and metadata types of language declaration.
Language tags in HTML and XML

An overview of how to create language tags using BCP 47.

2.2 Defining language values

Values for language declarations must use BCP 47. more

BCP 47 defines a method to combine subtags in order to create a much more powerful notation for language tags than that provided by the old ISO lists, but it is also backwards compatible with the ISO lists.

For an overview of the key features of BCP 47, see Language tags in HTML and XML.

Refer to BCP 47, not to RFC 5646. more

The link to and name of BCP 47 was created specifically so that there is an unchanging reference to the definition of Tags for the Identification of Languages. RFCs 3066, 4646 and 5646 are versions of BCP 47.

Be specific about what level of conformance you expect for language tags. The word "valid" has special meaning in BCP 47. Generally "well-formed" is a better choice.

Reference BCP47 for language tag matching.

2.2.1 Links

2.2.1.1 How to's

Defining language values

In Internationalization Best Practices for Spec Developers.
BCP 47

The IETF specification that indicates how to create language subtags and how to match them .

2.2.1.2 Background

Language tags in HTML and XML

An overview of how to create language tags using BCP 47.

2.3 Setting the default base direction

In order to correctly display text written in a 'right-to-left' script, it is important to know the base direction that should be applied to that text. For more information see Unicode Bidirectional Algorithm basics.

Example 1

For example, the following annotation will not display correctly unless the application doing the display knows that the base direction needs to be right-to-left.

{
  "@context": "http://www.w3.org/ns/anno.jsonld",
  "id": "http://example.org/anno5",
  "type":"Annotation",
  "body": {
    "type" : "TextualBody",
    "text" : "<p>פעילות הבינאום, W3C</p>",
    "format" : "text/html",
    "language" : "he"
  },
  "target": "http://example.org/photo1"
}

You would expect this phrase to be displayed as

פעילות הבינאום, W3C

however, if there is no indication that the base direction should be right-to-left the following incorrect display will be produced:

פעילות הבינאום, W3C

The spec should indicate how to define a default base direction for the resource as a whole, ie. set the overall base direction. more

The default base direction, in the absence of other information, should be LTR. more

2.3.1 Base direction values

Values for the default base direction should include left-to-right, right-to-left, and auto. more

The auto value allows automatic detection of the base direction for a piece of text. For example, the auto value of dir in HTML looks for the first strong directional character in the text, but ignores certain items of markup also, to guess the base direction of the text. Note that automatic detection algorithms are far from perfect. First-strong detection is unable to correctly identify text that is really right-to-left, but that begins with a strong LTR character. Algorithms that attempt to judge the base direction based on contents of the text are also problematic. The best scenario is one where the base direction is known and declared.

Do not assume that direction can be determined from language information. more

The following are all reasons you cannot use language tags to provide information about base direction:

you can't produce the auto value with language tags.
some languages are written with either RTL or LTR scripts.
the only reliable part of the language tag that would indicate the base direction is the script tag, but BCP47 recommends that you suppress the use of the script tag for languages that don't usually need it, such as Hebrew (suppressscript: Hebr). Languages, such as Persian, that are usually written in a RTL script may be written in transcribed form, and it's not possible to guarrantee that the necessary script tag would be present to carry the directional information. In summary, you won't be able to rely on people supplying script tags as part of the language information in order to influence direction.
the incidence of use of language tags and base direction markers don't necessarily coincide.
they are not semantically equivalent.

2.3.2 Links

2.3.2.1 How to's

Setting the default base direction

In Internationalization Best Practices for Spec Developers.
Unicode Bidirectional Algorithm

In Unicode® Standard Annex #9. Specifies the detail of how the bidirectional algorithm works.

2.3.2.2 Background

Unicode Bidirectional Algorithm basics

Article describing the basics about how the Unicode Bidirectional Algorithm works.

2.4 Defining resource identifiers

Resource identifiers must permit the use of characters outside those of plain ASCII. discussion

Specifications MUST define when the conversion from IRI references to URI references (or subsets thereof) takes place, in accordance with Internationalized Resource Identifiers (IRIs). more

Many current specifications already contain provisions in accordance with Internationalized Resource Identifiers (IRIs). For XML 1.0, see Section 4.2.2, External Entities. XML Schema Part 2: Datatypes provides the anyURI datatype (see Section 3.2.17). The XML Linking Language (XLink) provides the href attribute (see Section 5.4, Locator Attribute).

Document formats should allow IRIs to be used; handlers for protocols that do not currently support IRIs can convert the IRI to a URI when the IRI is dereferenced.

2.4.1 Links

2.4.1.1 How to's

Internationalized Resource Identifiers (IRIs)

3. Blocks, paragraphs, or similar chunks of content

Establishing the language
Setting the base direction
Defining language values

The words block and/or chunk are used here to refer to a structural component within the resource as a whole that groups content together and separates it from adjacent content. Boundaries between one block and another are equivalent to paragraph or section boundaries in text, or discrete data items inside a file.

For example, this could refer to a block or paragraph in XML or HTML, an object declaration in JSON, a cue in WebVTT, a line in a CSV file, etc. Contrast this with inline content, which describes a range within a paragraph, sentence, etc.

The interpretation of which structures defined in a spec are relevant to these requirements may require a little consideration, and will depend on the format of the data involved.

3.1 Establishing the language

By default, blocks of content should inherit the text-processing language set for the resource as a whole. more

See 2.1 Declaring language for guidance related to the default text-processing language information.

It should be possible to indicate a change in language for blocks of content where the language changes. more

3.1.1 Links

3.1.1.1 How to's

Establishing the language

In Internationalization Best Practices for Spec Developers.

3.2 Establishing the base direction

The content author must be able to indicate parts of the text where the base direction changes. At the block level, this should be achieved using attributes or metadata, and should not rely on Unicode control characters.

Relying on Unicode control characters to establish direction for every block is not feasible because line breaks terminate the effect of such control characters. It also makes the data much less stable, and unnecessarily difficult to manage if control characters have to appear at every point where they would be needed.

It must be possible to also set the direction for content fragments to auto. This means that the base direction will be determined by examining the content itself.

A typical approach here would be to set the direction based on the first strong directional character outside of any markup, but this is not the only possible method. The algorithm used to determine directionality when direction is set to auto should match that expected by the receiver.

The first-strong algorithm looks for the first character in the paragraph with a strong directional property according to the Unicode definitions. It then sets the base direction of the paragraph according to the direction of that character.

Note that the first-strong algorithm may incorrectly guess the direction of the paragraph when the first character is not typical of the rest of the paragraph, such as when a RTL paragraph or line starts with a LTR brand name or technical term.

For additional information about algorithms for detecting direction, see Estimation algorithms in the document where this was discussed with reference to HTML.

If the overall base direction is set to auto for plain text, the direction of content paragraphs should be determined on a paragraph by paragraph basis.

To indicate the sides of a block of text where relative to the start and end of its contained lines, you should use 'before' and 'after' (maybe block-start/block-end – the terminology is changing), rather than 'top' and 'bottom'.

To indicate the start/end of a line you should use 'start' and 'end' rather than 'left' and 'right'.

Provide dedicated attributes for control of base direction and bidirectional overrides; do not rely on the user applying style properties to arbitrary markup to achieve bidi control.

For example, HTML has a dir attribute that is capable of managing base direction without assistance from CSS styling. XML formats should define dedicated markup to represent directional information, even if they need CSS to achieve the required display, since the text may be used in other ways.

Style sheets such as CSS may not always be used with the data, or carried with the data when it is syndicated, etc. Directional information is fundamentally important to correct display of the data, and should be associated more closely and more permanently with the markup or data.

3.2.1 Links

3.2.1.1 How to's

Estimation algorithms

In Additional Requirements for Bidi in HTML & CSS.

Here we refer to information that needs to be provided for a range of characters in the middle of a paragraph or string.

4.1 Establishing the language

It should be possible to indicate language for spans of inline text where the language changes. more

Where a switch in language can affect operations on the content, such as spell-checking, rendering, styling, voice production, translation, information retrieval, and so forth, it is necessary to indicate the range of text affected and identify the language of that content.

4.1.1 Links

4.1.1.1 How to's

Establishing the language

In Internationalization Best Practices for Spec Developers.

4.2 Setting base direction

It must be possible to indicate spans of inline text where the base direction changes. If markup is available, this is the preferred method. Otherwise your specification must require that Unicode control characters are recognized by the receiving application, and correctly implemented.

It must be possible to also set the direction for a span to auto. This means that the base direction will be determined by examining the content itself. A typical approach here would be to set the direction based on the first strong directional character outside of any markup. more

For additional information about algorithms for detecting direction, see Estimation algorithms in the document where this was discussed with reference to HTML.

If users use Unicode bidirectional control characters, the RLI/LRI/FSI with PDI characters must be supported by the application and recommended (rather than RLE/LRE with PDF) by the spec.

Use of RLM/LRM should be appropriate, and expectations of what those controls can and cannot do should be clear in the spec. more

The Unicode bidirectional control characters U+200F RIGHT-TO-LEFT MARK and U+200E LEFT-TO-RIGHT MARK are not sufficient on their own to manage bidirectional text. They cannot produce a different base direction for embedded text. For that you need to be able to indicate the start and end of the range of the embedded text. This is best done by markup, if available, or failing that using the other Unicode bidirectional controls mentioned just above.

Provide dedicated attributes for control of base direction and bidirectional overrides; do not rely on the user applying style properties to arbitrary markup to achieve bidi control.

Allow bidi attributes on all inline elements in markup that contain text.

Provide attributes that allow the user to (a) create an embedded base direction or (b) override the bidirectional algorithm altogether; the attribute should allow the user to set the direction to LTR or RTL in either of these two scenarios.

5. Characters

Choosing a definition of 'character'
Defining a Reference Processing Model
Including and excluding character ranges
Using the Private Use Area
Choosing character encodings
Identifying character encodings
Designing character escapes
Storing text
Specifying sort and search functionality
Defining 'string'
Indexing strings
Referencing the Unicode Standard

See the Character Model for the World Wide Web: Fundamentals for basic guidelines related to the use of characters and encodings.

See the Encoding specification for further guidelines related to use of character encodings.

Another Character Model document is currently in development, entitled String Matching and Searching. It looks at issues that arise when you try to compare two strings, be it identifiers or authored content.

5.1 Choosing a definition of 'character'

Specifications, software and content MUST NOT require or depend on a one-to-one correspondence between characters and the sounds of a language. more

Specifications, software and content MUST NOT require or depend on a one-to-one mapping between characters and units of displayed text. more

Protocols, data formats and APIs MUST store, interchange or process text data in logical order. more

Independent of whether some implementation uses logical selection or visual selection, characters selected MUST be kept in logical order in storage. more

Specifications of protocols and APIs that involve selection of ranges SHOULD provide for discontiguous logical selections, at least to the extent necessary to support implementation of visual selection on screen on top of those protocols and APIs. more

Specifications and software MUST NOT require nor depend on a single keystroke resulting in a single character, nor that a single character be input with a single keystroke (even with modifiers), nor that keyboards are the same all over the world. more

Specifications, software and content MUST NOT require or depend on a one-to-one relationship between characters and units of physical storage. more

When specifications use the term 'character' the specifications MUST define which meaning they intend. more

Specifications SHOULD use specific terms, when available, instead of the general term 'character'. more

5.1.1 Links

5.1.1.1 How to's

Perceptions of Characters

In W3C Recommendation, Character Model for the World Wide Web.

5.1.2 See also

Defining 'string'.

5.2 Defining a Reference Processing Model

Textual data objects defined by protocol or format specifications MUST be in a single character encoding. more

All specifications that involve processing of text MUST specify the processing of text according to the Reference Processing Model described by the rest of the recommendations in this list. more

Specifications MUST define text in terms of Unicode characters, not bytes or glyphs. more

For their textual data objects specifications MAY allow use of any character encoding which can be transcoded to a Unicode encoding form. more

Specifications MAY choose to disallow or deprecate some character encodings and to make others mandatory. Independent of the actual character encoding, the specified behavior MUST be the same as if the processing happened as follows: (a) The character encoding of any textual data object received by the application implementing the specification MUST be determined and the data object MUST be interpreted as a sequence of Unicode characters - this MUST be equivalent to transcoding the data object to some Unicode encoding form, adjusting any character encoding label if necessary, and receiving it in that Unicode encoding form, (b) All processing MUST take place on this sequence of Unicode characters, (c) If text is output by the application, the sequence of Unicode characters MUST be encoded using a character encoding chosen among those allowed by the specification. more

If a specification is such that multiple textual data objects are involved (such as an XML document referring to external parsed entities), it MAY choose to allow these data objects to be in different character encodings. In all cases, the Reference Processing Model MUST be applied to all textual data objects. more

5.2.1 Links

5.2.1.1 How to's

Digital Encoding of Characters

In W3C Recommendation, Character Model for the World Wide Web.

5.3 Including and excluding character ranges

Specifications SHOULD NOT arbitrarily exclude code points from the full range of Unicode code points from U+0000 to U+10FFFF inclusive. more

Specifications MUST NOT allow code points above U+10FFFF. more

Specifications SHOULD NOT allow the use of codepoints reserved by Unicode for internal use. more

Specifications MUST NOT allow the use of surrogate code points. more

Specifications SHOULD exclude compatibility characters in the syntactic elements (markup, delimiters, identifiers) of the formats they define. more

5.3.1 Links

5.3.1.1 How to's

Digital Encoding of Characters

In W3C Recommendation, Character Model for the World Wide Web.

5.3.2 See also

Using the Private Use Area.

5.4 Using the Private Use Area

Specifications MUST NOT require the use of private use area characters with particular assignments. more

Specifications MUST NOT require the use of mechanisms for defining agreements of private use code points. more

Specifications and implementations SHOULD NOT disallow the use of private use code points by private agreement. more

Specifications MAY define markup to allow the transmission of symbols not in Unicode or to identify specific variants of Unicode characters. more

Specifications SHOULD allow the inclusion of or reference to pictures and graphics where appropriate, to eliminate the need to (mis)use character-oriented mechanisms for pictures or graphics. more

5.4.1 Links

5.4.1.1 How to's

Private use code points

In W3C Recommendation, Character Model for the World Wide Web.

5.5 Choosing character encodings

Specifications MUST either specify a unique character encoding, or provide character encoding identification mechanisms such that the encoding of text can be reliably identified. more

When designing a new protocol, format or API, specifications SHOULD require a unique character encoding. more

When basing a protocol, format, or API on a protocol, format, or API that already has rules for character encoding, specifications SHOULD use rather than change these rules. more

When a unique character encoding is required, the character encoding MUST be UTF-8, UTF-16 or UTF-32. more

This guideline needs further consideration: utf-16 and utf-32 are not recommended these days. UTF-8 is the recommended encoding.

Specifications SHOULD avoid using the terms 'character set' and 'charset' to refer to a character encoding, except when the latter is used to refer to the MIME charset parameter or its IANA-registered values. The term 'character encoding', or in specific cases the terms 'character encoding form' or 'character encoding scheme', are RECOMMENDED. more

If the unique encoding approach is not taken, specifications SHOULD require the use of the IANA charset registry names, and in particular the names identified in the registry as 'MIME preferred names', to designate character encodings in protocols, data formats and APIs. more

This guideline needs further consideration: the list of character encodings recommended for Web specifications is listed in the Encoding specification.

Character encodings that are not in the IANA registry SHOULD NOT be used, except by private agreement. more

If an unregistered character encoding is used, the convention of using 'x-' at the beginning of the name MUST be followed. more

If the unique encoding approach is not chosen, specifications MUST designate at least one of the UTF-8 and UTF-16 encoding forms of Unicode as admissible character encodings and SHOULD choose at least one of UTF-8 or UTF-16 as required encoding forms (encoding forms that MUST be supported by implementations of the specification). more

Specifications that require a default encoding MUST define either UTF-8 or UTF-16 as the default, or both if they define suitable means of distinguishing them. more

5.5.1 Links

5.5.1.1 How to's

Choice and identification of code points

In W3C Recommendation, Character Model for the World Wide Web.

5.5.1.2 Background reading

Document character set

What is the 'Document Character Set' for XML and HTML, and how does it relate to the encodings I use for my documents?

5.6 Identifying character encodings

Specifications MUST NOT propose the use of heuristics to determine the encoding of data. more

Specifications MUST define conflict-resolution mechanisms (e.g. priorities) for cases where there is multiple or conflicting information about character encoding. more

5.6.1 Links

5.6.1.1 How to's

Choice and identification of code points

In W3C Recommendation, Character Model for the World Wide Web.

5.7 Designing character escapes

Specifications should provide a mechanism for escaping characters, particularly those which are invisible or ambiguous. more

It is generally recommended that character escapes be provided so that difficult to enter or edit sequences can be introduced using a plain text editor. Escape sequences are particularly useful for invisible or ambiguous Unicode characters, including zero-width spaces, soft-hyphens, various bidi controls, mongolian vowel separators, etc.

For advice on use of escapes in markup, but which is mostly generalisable to other formats, see Using character escapes in markup and CSS.

Specifications SHOULD NOT invent a new escaping mechanism if an appropriate one already exists. more

The number of different ways to escape a character SHOULD be minimized (ideally to one). more

Escape syntax SHOULD require either explicit end delimiters or a fixed number of characters in each character escape. Escape syntaxes where the end is determined by any character outside the set of characters admissible in the character escape itself SHOULD be avoided. more

Whenever specifications define character escapes that allow the representation of characters using a number, the number MUST represent the Unicode code point of the character and SHOULD be in hexadecimal notation. more

Escaped characters SHOULD be acceptable wherever their unescaped forms are; this does not preclude that syntax-significant characters, when escaped, lose their significance in the syntax. In particular, if a character is acceptable in identifiers and comments, then its escaped form should also be acceptable. more

5.7.1 Links

5.7.1.1 How to's

Character escaping

In W3C Recommendation, Character Model for the World Wide Web.

5.8 Storing text

Protocols, data formats and APIs MUST store, interchange or process text data in logical order. more

5.8.1 Links

5.8.1.1 How to's

Visual rendering and logical order

In W3C Recommendation, Character Model for the World Wide Web.

5.9 Specifying sort and search functionality

Software that sorts or searches text for users SHOULD do so on the basis of appropriate collation units and ordering rules for the relevant language and/or application. more

Where searching or sorting is done dynamically, particularly in a multilingual environment, the 'relevant language' SHOULD be determined to be that of the current user, and may thus differ from user to user. more

Software that allows users to sort or search text SHOULD allow the user to select alternative rules for collation units and ordering. more

Specifications and implementations of sorting and searching algorithms SHOULD accommodate text that contains any character in Unicode. more

5.9.1 Links

5.9.1.1 How to's

Units of collation

In W3C Recommendation, Character Model for the World Wide Web.

5.10 Converting to a Common Unicode Form

Specifications of text-based formats and protocols MAY specify that all or part of the textual content of that format or protocol is normalized using Unicode Normalization Form C (NFC). more

Specifications that do not normalize MUST document or provide a health-warning if canonically equivalent but disjoint Unicode character sequences represent a security issue. more

Specifications and implementations MUST NOT assume that content is in any particular normalization form. more

Specifications MUST specify that string matching takes the form of "code point-by-code point" comparison of the Unicode character sequence, or, if a specific Unicode character encoding is specified, code unit-by-code unit comparison of the sequences. more

Specifications that define a regular expression syntax MUST provide at least Basic Unicode Level 1 support per Unicode Technical Standard #18: Unicode Regular Expressions and SHOULD provide Extended or Tailored (Levels 2 and 3) support. more

Specifications of text-based formats and protocols that, as part of their syntax definition, require that the text be in normalized form MUST define string matching in terms of normalized string comparison and MUST define the normalized form to be NFC. more

A normalizing text-processing component which receives suspect text MUST NOT perform any normalization-sensitive operations unless it has first either confirmed through inspection that the text is in normalized form or it has re-normalized the text itself. Private agreements MAY, however, be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed. more

Specifications of text-based languages and protocols SHOULD define precisely the construct boundaries necessary to obtain a complete definition of full-normalization. These definitions SHOULD include at least the boundaries between markup and character data as well as entity boundaries (if the language has any include mechanism) , SHOULD include any other boundary that may create denormalization when instances of the language are processed, but SHOULD NOT include character escapes designed to express arbitrary characters. more

Where operations can produce denormalized output from normalized text input, specifications of API components (functions/methods) that implement these operations MUST define whether normalization is the responsibility of the caller or the callee. Specifications MAY state that performing normalization is optional for some API components; in this case the default SHOULD be that normalization is performed, and an explicit option SHOULD be used to switch normalization off. Specifications SHOULD NOT make the implementation of normalization optional. more

Specifications that define a mechanism (for example an API or a defining language) for producing textual data object SHOULD require that the final output of this mechanism be normalized. more

5.10.1 Links

5.10.1.1 How to's

Converting to a Common Unicode Form

In W3C Working Draft, Character Model for the World Wide Web: String Matching and Searching.

5.11 Handling Case Folding

Case sensitive matching is RECOMMENDED as the default for new protocols and formats. more

Because the "simple" case-fold mapping removes information that can be important to forming an identity match, the "Common plus Full" (or "Unicode C+F") case fold mapping is RECOMMENDED for Unicode case-insensitive matching. more

ASCII case-insensitive matching MUST only be applied to vocabularies that are restricted to ASCII. Unicode case-insensitivity MUST be used for all other vocabularies. more

If the vocabulary is not restricted to ASCII or permits user-defined values that use a broader range of Unicode, ASCII case-insensitive matching MUST NOT be required. more

The Unicode C+F case-fold form is RECOMMENDED as the case-insensitive matching for vocabularies. The Unicode C+S form MUST NOT be used for string identity matching on the Web. more

Specifications and implementations that define string matching as part of the definition of a format, protocol, or formal language (which might include operations such as parsing, matching, tokenizing, etc.) MUST define the criteria and matching forms used. These MUST be one of: (a) Case-sensitive (b) Unicode case-insensitive using Unicode case-folding C+F (c) ASCII case-insensitive. more

Specifications SHOULD NOT specify case-insensitive comparison of strings. more

Specifications that specify case-insensitive comparison for non-ASCII vocabularies SHOULD specify Unicode case-folding C+F. more

Specifications MAY specify ASCII case-insensitive comparison for portions of a format or protocol that are restricted to an ASCII-only vocabulary. more

Specifications and implementations MUST NOT specify ASCII-only case-insensitive matching for values or constructs that permit non-ASCII characters. more

5.11.1 Links

5.11.1.1 How to's

Handling Case Folding

In W3C Working Draft, Character Model for the World Wide Web: String Matching and Searching.

5.12 Defining 'string'

Specifications SHOULD NOT define a string as a 'byte string'. more

The 'character string' definition SHOULD be used by most specifications. more

5.12.1 Links

5.12.1.1 How to's

String concepts

In W3C Recommendation, Character Model for the World Wide Web.

5.13 Indexing strings

The character string is RECOMMENDED as a basis for string indexing. more

A code unit string MAY be used as a basis for string indexing if this results in a significant improvement in the efficiency of internal operations when compared to the use of character string. more

Grapheme clusters MAY be used as a basis for string indexing in applications where user interaction is the primary concern. more

Specifications that define indexing in terms of grapheme clusters MUST either: (a) define grapheme clusters in terms of default grapheme clusters as defined in Unicode Standard Annex #29, Text Boundaries [UTR #29], or (b) define specifically how tailoring is applied to the indexing operation. more

Need to check the above recommendation, since extended grapheme clusters are now recommended.

The use of byte strings for indexing is NOT RECOMMENDED. more

Specifications that need a way to identify substrings or point within a string SHOULD provide ways other than string indexing to perform this operation. more

Specifications SHOULD understand and process single characters as substrings, and treat indices as boundary positions between counting units, regardless of the choice of counting units. more

Specifications of APIs SHOULD NOT specify single characters or single 'units of encoding' as argument or return types. more

When the positions between the units are counted for string indexing, starting with an index of 0 for the position at the start of the string is the RECOMMENDED solution, with the last index then being equal to the number of counting units in the string. more

5.13.1 Links

5.13.1.1 How to's

String indexing

In W3C Recommendation, Character Model for the World Wide Web.

5.13.2 See also

Defining 'string'.

5.14 Referring to Unicode characters

Use U+XXXX syntax to represent Unicode code points in the specification. more

The U+XXXX format is well understood when referring to Unicode code points in a specification. These are space separated when appearing in a sequence. No additional decoration is needed. Note that a code point may contain four, five, or six hexadecimal digits. When fewer than four digits are needed, the code point number is zero filled. E.g. U+0020.

5.15 Referencing the Unicode Standard

Since specifications in general need both a definition for their characters and the semantics associated with these characters, specifications SHOULD include a reference to the Unicode Standard, whether or not they include a reference to ISO/IEC 10646. more

A generic reference to the Unicode Standard MUST be made if it is desired that characters allocated after a specification is published are usable with that specification. A specific reference to the Unicode Standard MAY be included to ensure that functionality depending on a particular version is available and will not change over time. more

All generic references to the Unicode Standard MUST refer to the latest version of the Unicode Standard available at the date of publication of the containing specification. more

All generic references to ISO/IEC 10646 MUST refer to the latest version of ISO/IEC 10646 available at the date of publication of the containing specification. more

5.15.1 Links

5.15.1.1 How to's

Referencing the Unicode Standard and ISO/IEC 10646

In W3C Recommendation, Character Model for the World Wide Web.

Abstract

Status of This Document

1. Introduction

2. Resources

2.1 Declaring language

2.1.1 Text-processing language information

2.1.2 Language metadata about the resource as a whole

2.1.3 Declaring language for the resource

2.1.4 Links

2.1.4.1 How to's

2.1.4.2 Background

2.2 Defining language values

2.2.1 Links

2.2.1.1 How to's

2.2.1.2 Background

2.3 Setting the default base direction

2.3.1 Base direction values

2.3.2 Links

2.3.2.1 How to's

2.3.2.2 Background

2.4 Defining resource identifiers

2.4.1 Links

2.4.1.1 How to's

3. Blocks, paragraphs, or similar chunks of content

3.1 Establishing the language

3.1.1 Links

3.1.1.1 How to's

3.2 Establishing the base direction

3.2.1 Links

3.2.1.1 How to's

4. Inline spans

4.1 Establishing the language

4.1.1 Links

4.1.1.1 How to's

4.2 Setting base direction

5. Characters

5.1 Choosing a definition of 'character'

5.1.1 Links

5.1.1.1 How to's

5.1.2 See also

5.2 Defining a Reference Processing Model

5.2.1 Links

5.2.1.1 How to's

5.2.2 See also

5.3 Including and excluding character ranges

5.3.1 Links

5.3.1.1 How to's

5.3.2 See also

5.4 Using the Private Use Area

5.4.1 Links

5.4.1.1 How to's

5.4.2 See also

5.5 Choosing character encodings

5.5.1 Links

5.5.1.1 How to's

5.5.1.2 Background reading

5.6 Identifying character encodings

5.6.1 Links

5.6.1.1 How to's

5.7 Designing character escapes

5.7.1 Links

5.7.1.1 How to's

5.8 Storing text

5.8.1 Links

5.8.1.1 How to's

5.9 Specifying sort and search functionality

5.9.1 Links

5.9.1.1 How to's

5.10 Converting to a Common Unicode Form

5.10.1 Links

5.10.1.1 How to's

5.11 Handling Case Folding

5.11.1 Links

5.11.1.1 How to's

5.12 Defining 'string'

5.12.1 Links

5.12.1.1 How to's

5.12.2 See also

5.13 Indexing strings

5.13.1 Links