Missing characters and glyphs

Problem description

Web technology is based on the character repertoire of Unicode (see Character Model). Unicode contains a huge number of characters covering a wide range of scripts and languages. However, in some cases, there may be something missing:

  1. The character does not exist. Proposed solutions include encoding the character, markup for individual characters, and Private Use Codepoints.
  2. The character exists, but you want to select a particular glyph variant.
  3. A character exists, but the glyph to display it isn't available. This can be solved by technologies such as Web fonts.
  4. The character exists in Unicode, but not in the character encoding used for the document. In this case, use Numeric Character References (NCRs, example: 噸).
  5. The character exists in Unicode, but you want to use a name for it rather than code it directly. You can use named character references (e.g. é in HTML). Other solutions have been proposed, such as xmlchar, which uses an element per character and an XSLT to convert them.

Point 1 and 2 are often subsumed under the term 'gaiji problem'.

Use Cases

East Asian Ideographs (see also gaiji), mathematical symbols, special ligatures,...

Selecting Glyph Variants

Often it is important that a particular glyph is displayed for a certain character. Styling with CSS or XSL can take care of size, font style, and some other properties. But sometimes, there is a need for more specific glyph variants. There are various proposals to do this:

Markup for Individual Characters and Symbols

The idea is to define a special element with attributes providing or pointing to the necessary information to process or render the character. This leads to an extremely localized, and therefore extremely flexible and stable solution. The actual markup may look very similar to the one used for selecting glyph variants, the main distinction being that there is no character content that serves as a fallback (in some cases, the element content may be a primitive fallback such as html:img, or a private use codepoint is used).

Examples that define markup for individual characters:

Encoding Characters

It is possible to submit a proposal for encoding some characters to the Unicode Technical Committee and ISO/IEC JTC1/SC2/WG2. This requires careful preparation and takes time, but for many cases, it is the right thing to do. On the other hand, some things perceived as characters may not be suitable for encoding, or a character may already have been encoded, but you want a particular glyph variant.

Private Use Codepoints

Unicode reserves the Private Use Area in the BMP (U+E000-U+F8FF) and planes 15 and 16 for private use. This means that these codepoints are forever left undefined, but can be used between any two parties with a prior agreement.

The main problem with private use codepoints is that there needs to be an understanding of what these codepoints are used for. But private agreements scale very badly on the Web. Various proposals have been made to associate additional information with a document type (DTD/XML Schema), with a document, or with some part of a document.

However, in all cases, editing and otherwise processing documents with such associated information will become very complicated. Also, character information is only preserved if all operations that process it preserve the associated information correctly. Because missing characters are not a very frequent problem, it is quite unreasonable that e.g. every single Perl script dealing with XML will do the right thing. Using markup for individual missing characters is much more stable.

Gaiji

Gaiji (外字, foreign/outside characters) is a term often used in Japan to refer to both unencoded characters and missing glyph variants.

History

Talk at the 12th International Unicode Conference in Tokyo, Japan, April 1998: Exploring the Potentials of Web Technologies for the Handling of Rare Ideographs and Ideograph Variants.