Missing characters and glyphs

Problem description

Web technology is based on the character repertoire of Unicode (see Character Model). Unicode contains a huge number of characters covering a wide range of scripts and languages. However, in some cases, there may be something missing:

The character does not exist. Proposed solutions include encoding the character, markup for individual characters, and Private Use Codepoints.
The character exists, but you want to select a particular glyph variant.
A character exists, but the glyph to display it isn't available. This can be solved by technologies such as Web fonts.
The character exists in Unicode, but not in the character encoding used for the document. In this case, use Numeric Character References (NCRs, example: 噸).
The character exists in Unicode, but you want to use a name for it rather than code it directly. You can use named character references (e.g. é in HTML). Other solutions have been proposed, such as xmlchar, which uses an element per character and an XSLT to convert them.

Point 1 and 2 are often subsumed under the term 'gaiji problem'.

Selecting Glyph Variants

Often it is important that a particular glyph is displayed for a certain character. Styling with CSS or XSL can take care of size, font style, and some other properties. But sometimes, there is a need for more specific glyph variants. There are various proposals to do this:

Unicode provides variation selectors to select a specific variant (range of) glyph(s). Check the list of standardized variants and the list of ideographic variants to see if the variant you need is available. Otherwise, propose your variant for addition. Please note that all variation selectors are reserved for standard use.
JIS X 4052:2000 (日本語文書の組版指定交換形式, Exchange format for Japanese documents with composition markup; in Japanese): The ch element references an image file containing a glyph image. Attributes are used for exact positioning. The content of the element is the character itself, which can serve as a fallback.

Markup for Individual Characters and Symbols

The idea is to define a special element with attributes providing or pointing to the necessary information to process or render the character. This leads to an extremely localized, and therefore extremely flexible and stable solution. The actual markup may look very similar to the one used for selecting glyph variants, the main distinction being that there is no character content that serves as a fallback (in some cases, the element content may be a primitive fallback such as html:img, or a private use codepoint is used).

Examples that define markup for individual characters:

SVG 1.1 (a W3C Recommendation): The altglyph element provides detailled control over the glyphs used to render particular character data.
MathML 3.0 (a W3C Recommendation): The mglyph element has an alt attribute for fallback text, a fontfamily attribute to indicate a font, and an index attribute to indicate the position of a glyph in a font.
In Japan, the INSTAC XML WG2 has defined some markup to reference characters (JIS TR X 0047:2001, Picture Reference Exchange by XML).

Encoding Characters

It is possible to submit a proposal for encoding some characters to the Unicode Technical Committee and ISO/IEC JTC1/SC2/WG2. This requires careful preparation and takes time, but for many cases, it is the right thing to do. On the other hand, some things perceived as characters may not be suitable for encoding, or a character may already have been encoded, but you want a particular glyph variant.

Private Use Codepoints

Unicode reserves the Private Use Area in the BMP (U+E000-U+F8FF) and planes 15 and 16 for private use. This means that these codepoints are forever left undefined, but can be used between any two parties with a prior agreement.

The main problem with private use codepoints is that there needs to be an understanding of what these codepoints are used for. But private agreements scale very badly on the Web. Various proposals have been made to associate additional information with a document type (DTD/XML Schema), with a document, or with some part of a document.

However, in all cases, editing and otherwise processing documents with such associated information will become very complicated. Also, character information is only preserved if all operations that process it preserve the associated information correctly. Because missing characters are not a very frequent problem, it is quite unreasonable that e.g. every single Perl script dealing with XML will do the right thing. Using markup for individual missing characters is much more stable.

Missing characters and glyphs

Problem description

Use Cases

Selecting Glyph Variants

Markup for Individual Characters and Symbols

Encoding Characters

Private Use Codepoints

Gaiji

History