Accesskey n skips to in-page navigation. Skip to the content start.

s_gotoW3cHome Internationalization
 

Missing characters and glyphs

Intended audience: XHTML/HTML coders and content authors (using editors or scripting).

Problem description

Web technology is based on the character repertoire of Unicode/ISO 10646 (see Character Model). Unicode contains a huge number of characters covering a wide range of scripts and languages. However, in some cases, there may be something missing:

  1. The character does not exist. Proposed solutions include encoding the character, markup for individual characters, and Private Use Codepoints.
  2. The character exists, but you want to select a particular glyph variant.
  3. A character exists, but the glyph to display it isn't available. This can be solved by technologies such as Web fonts and SVG fonts.
  4. The character exists in Unicode/ISO 10646, but not in the character encoding used for the document. In this case, use Numeric Character References (NCRs, example: 噸).
  5. The character exists in Unicode/ISO 10646, but you want to use a name for it rather than code it directly. You can use named entities defined in a DTD (e.g. é in (X)HTML). Other solutions have been proposed, such as xmlchar, which uses an element per character and an XSLT to convert them.

Point 1 and 2 are often subsumed under the term 'gaiji problem'.

Use Cases

East Asian Ideographs (see also gaiji), mathematical symbols, special ligatures,...

Selecting Glyph Variants

Often it is important that a particular glyph is displayed for a certain character. Styling with CSS or XSL can take care of size, font style, and some other properties. But sometimes, there is a need for more specific glyph variants. There are various proposals to do this:

Unicode glyph variant selectors

Markup for Individual Characters and Symbols

The idea is to define a special element with attributes providing or pointing to the necessary information to process or render the character. This leads to an extremely localized, and therefore extremely flexible and stable solution. The actual markup may look very similar to the one used for selecting glyph variants, the main distinction being that there is no character content that serves as a fallback (in some cases, the element content may be a primitive fallback such as <html:img>, or a private use codepoint is used).

Examples that define markup for individual characters:

Is there a need for a generic element or attribute that could be used widely? Does defining a common anchestor type for such elements help? There is also a need to describe character properties. See e.g. the CHISE project, which uses topic maps.

Encoding Characters

It is possible to submit a proposal for encoding some characters to the Unicode Technical Committee and ISO/IEC SC2 WG2. This requires careful preparation and takes time, but for many cases, it is the right thing to do. On the other hand, some things perceived as characters may not be suitable for encoding, or a character may already have been encoded, but you want a particular glyph variant.

Private Use Codepoints

Unicode/ISO 10646 reserve the Private Use Area in the BMP (U+E000-U+F8FF) and planes 15 and 16 for private use. This means that these codepoints are forever left undefined, but can be used between any two parties with a prior agreement.

The main problem with private use codepoints is that there needs to be an understanding of what these codepoints are used for. But private agreements scale very badly on the Web. Various proposals have been made to associate additional information with a document type (DTD/XML Schema), with a document, or with some part of a document.

However, in all cases, editing and otherwise processing documents with such associated information will become very complicated. Also, character information is only preserved if all operations that process it preserve the associated information correctly. Because missing characters are not a very frequent problem, it is quite unreasonable that e.g. every single Perl script dealing with XML will do the right thing. Using markup for individual missing characters is much more stable.

Tell us what you think (English).

Subscribe to an RSS feed.

New resources

Home page news

Twitter (Home page news)

‎@webi18n

Gaiji

Gaiji (外字, foreign/outside characters) is a term often used in Japan to refer to both unencoded characters and missing glyph variants.

History

Talk at the 12th International Unicode Conference in Tokyo, Japan, April 1998: Exploring the Potentials of Web Technologies for the Handling of Rare Ideographs and Ideograph Variants.

By: Martin J. Dürst.

Valid XHTML 1.0!
Valid CSS!
Encoded in UTF-8!

Content first published 2003-10-24. Last substantive update 2003-10-24 GMT. This version 2011-05-03 18:57 GMT

For the history of document changes, search for O-misscharglyph in the i18n blog.