Exploring the Potentials of Web Technologies for the Handling of Rare Ideographs and Ideograph Variants

Martin J. Dürst
Keio University/W3C

Overview

Rare Ideographs and Ideograph Variants
Character Set Technology
WWW Technology
Tradeoffs
Components available today
Components newly needed
Application scenarios
References

Goal of this talk

Present some ideas
Explain potential of the WWW
Create interest for cooperation
No final solutions
No commitment

Ideographs

Concepts, words, morphemes
Very large number
Very long history
Wide geographical distribution
Common core => Well-known Ideographs
Very wide variation of frequency => Rare Ideographs
Shape variations => Ideograph Variants

Ideograph Variants (Itaiji)

Official Simplifications
Regional and minority characters and variants
Time-specific variants
National typographic variants
Typeface-specific variants
Misprints
Unofficial simplifications
Calligraphic and handwriting variants

Character Set Technology

Linear sequence, no structure
Central registration (semidistributed registration failed)
Auxiliary information implied
High efficiency (storage/transmission/implementation)
Limited flexibility (one codepoint or two codepoints)
Updating is expensive

WWW Technology

Structure (SGML/HTML/XML)
Embedding (GIF,...)
Linking (Hypertext)
Indirection
Distributed Authority
Explicit auxiliiary information
High flexibility
Higher vulnerability

Tradeoffs

Better handled by character set technology

Frequent characters
Main, clearly distinguished variants

Better handled by WWW technology

Rare characters
Rare or minor variants

Core Idea

Use strucured data to describe ideographs
Use links to refer to descriptions of ideographs
Software knows about description format
- Browsers display, and provide access to auxiliary information
- Authoring systems give better control of variants,...
- Search engines still find characters (based

Existing Technology

Inline images (HTML)
Inline objects (applets,...; HTML) with fallbacks
Links
Format for structured information (XML)
Applications of inline structured inline information (MathML)
Font downloading (one-glyph-fonts are still too heavy)
Main new thing needed: Ideograph description format (KanjiML?)

Information about Ideographs

Pronunciations (language-dependent)
Meanings (language-dependent)
Shape (structure, example glyphs)
Variants, related characters
History
References to dictionaries, other web character collections
Negative information (not this shape,...)

Application Scenarios

Characters in proper names
Printing industry
Research (literature, history)

Characters in Proper Names

Some people are very peculiar about how their name is written
In Taiwan, some parents create new characters to name their children
Single-character fonts as birthday presents for the child
Accessible on the web, everybody can use the right character/glyph

Printing Industry

Complex interaction between author (and editor) and printer
Computers have made printing easier, but do not yet address this problem
Character codes do not provide enough information
Web technology can provide the necessary information, and the means for accessing it

Research

Different research fields:
- Different character concepts
- Different collections of rare characters
- Different needs for distinctions
Many web sites already up
Cross-linking is being started
Information only accessible by hand

References

A. Charles Muller, World Wide Web CJK-English Dictionary/Database, 1997 (and the many references provided there).

Technical Committee for Windows NT Extended Kanji Processing Council, Windows NT Extended Kanji Processing Specification - An OLE Solution for Extended Kanji Processing, Version 2.0, July 1996 (http://www.piedey.co.jp/xkp/XKP20-E.doc; see also http://www.xkp.or.jp/ and http://www.microsoft.com/japan/PARTNERS/industry/xkphome.htm).

Unicode Unihan Code Charts.

Rick Harbaugh, Chinese Character Genealogy - Web-Based Etymological Dictionary for Learning Chinese Characters.

Dave Raggett, Arnaud Le Hors, and Ian Jacobs, Eds., HTML 4.0 Specification, W3C Recommendation 18-Dec-1997.

Tim Bray, Jean Paoli, and C. M. Sperberg-McQueen, Eds., Extensible Markup Language (XML) 1.0, W3C Recommendation 10-February-1998.