Exploring the Potentials of Web Technologies for the Handling of Rare Ideographs
and Ideograph Variants
© 1998 Unicode/W3C/Keio University
Overview
-
Rare Ideographs and Ideograph Variants
-
Character Set Technology
-
WWW Technology
-
Tradeoffs
-
Components available today
-
Components newly needed
-
Application scenarios
-
References
Goal of this talk
-
Present some ideas
-
Explain potential of the WWW
-
Create interest for cooperation
-
No final solutions
-
No commitment
Ideographs
-
Concepts, words, morphemes
-
Very large number
-
Very long history
-
Wide geographical distribution
-
Common core => Well-known Ideographs
-
Very wide variation of frequency => Rare Ideographs
-
Shape variations => Ideograph Variants
Ideograph Variants (Itaiji)
-
Official Simplifications
-
Regional and minority characters and variants
-
Time-specific variants
-
National typographic variants
-
Typeface-specific variants
-
Misprints
-
Unofficial simplifications
-
Calligraphic and handwriting variants
Character Set Technology
-
Linear sequence, no structure
-
Central registration (semidistributed registration failed)
-
Auxiliary information implied
-
High efficiency (storage/transmission/implementation)
-
Limited flexibility (one codepoint or two codepoints)
-
Updating is expensive
WWW Technology
-
Structure (SGML/HTML/XML)
-
Embedding (GIF,...)
-
Linking (Hypertext)
-
Indirection
-
Distributed Authority
-
Explicit auxiliiary information
-
High flexibility
-
Higher vulnerability
Tradeoffs
Better handled by character set technology
-
Frequent characters
-
Main, clearly distinguished variants
Better handled by WWW technology
-
Rare characters
-
Rare or minor variants
Core Idea
-
Use strucured data to describe ideographs
-
Use links to refer to descriptions of ideographs
-
Software knows about description format
-
Browsers display, and provide access to auxiliary information
-
Authoring systems give better control of variants,...
-
Search engines still find characters (based
Existing Technology
-
Inline images (HTML)
-
Inline objects (applets,...; HTML) with fallbacks
-
Links
-
Format for structured information (XML)
-
Applications of inline structured inline information (MathML)
-
Font downloading (one-glyph-fonts are still too heavy)
-
Main new thing needed: Ideograph description format (KanjiML?)
Information about Ideographs
-
Pronunciations (language-dependent)
-
Meanings (language-dependent)
-
Shape (structure, example glyphs)
-
Variants, related characters
-
History
-
References to dictionaries, other web character collections
-
Negative information (not this shape,...)
Application Scenarios
-
Characters in proper names
-
Printing industry
-
Research (literature, history)
Characters in Proper Names
-
Some people are very peculiar about how their name is written
-
In Taiwan, some parents create new characters to name their children
-
Single-character fonts as birthday presents for the child
-
Accessible on the web, everybody can use the right character/glyph
Printing Industry
-
Complex interaction between author (and editor) and printer
-
Computers have made printing easier, but do not yet address this problem
-
Character codes do not provide enough information
-
Web technology can provide the necessary information, and the means for accessing
it
Research
-
Different research fields:
-
Different character concepts
-
Different collections of rare characters
-
Different needs for distinctions
-
Many web sites already up
-
Cross-linking is being started
-
Information only accessible by hand
References
A. Charles Muller, World Wide
Web CJK-English Dictionary/Database, 1997 (and the many references
provided there).
Technical Committee for Windows NT Extended Kanji Processing Council, Windows
NT Extended Kanji Processing Specification - An OLE Solution for Extended
Kanji Processing, Version 2.0, July 1996
(http://www.piedey.co.jp/xkp/XKP20-E.doc;
see also http://www.xkp.or.jp/ and
http://www.microsoft.com/japan/PARTNERS/industry/xkphome.htm).
Unicode Unihan Code
Charts.
Rick Harbaugh, Chinese Character Genealogy
- Web-Based Etymological Dictionary for Learning Chinese Characters.
Dave Raggett, Arnaud Le Hors, and Ian Jacobs, Eds.,
HTML 4.0 Specification, W3C
Recommendation 18-Dec-1997.
Tim Bray, Jean Paoli, and C. M. Sperberg-McQueen, Eds.,
Extensible Markup Language (XML)
1.0, W3C Recommendation 10-February-1998.