Exploring the Potentials of Web Technologies for the Handling of Rare Ideographs
  and Ideograph Variants
© 1998 Unicode/W3C/Keio University
  Overview
  - 
    Rare Ideographs and Ideograph Variants
  
- 
    Character Set Technology
  
- 
    WWW Technology
  
- 
    Tradeoffs
  
- 
    Components available today
  
- 
    Components newly needed
  
- 
    Application scenarios
  
- 
    References
  Goal of this talk
  - 
    Present some ideas
  
- 
    Explain potential of the WWW
  
- 
    Create interest for cooperation
 
- 
    No final solutions
  
- 
    No commitment
  Ideographs
  - 
    Concepts, words, morphemes
  
- 
    Very large number
  
- 
    Very long history
  
- 
    Wide geographical distribution
  
- 
    Common core => Well-known Ideographs
  
- 
    Very wide variation of frequency => Rare Ideographs
  
- 
    Shape variations => Ideograph Variants
  Ideograph Variants (Itaiji)
  - 
    Official Simplifications
  
- 
    Regional and minority characters and variants
  
- 
    Time-specific variants
  
- 
    National typographic variants
  
- 
    Typeface-specific variants
  
- 
    Misprints
  
- 
    Unofficial simplifications
  
- 
    Calligraphic and handwriting variants
  Character Set Technology
  - 
    Linear sequence, no structure
  
- 
    Central registration (semidistributed registration failed)
  
- 
    Auxiliary information implied
  
- 
    High efficiency (storage/transmission/implementation)
  
- 
    Limited flexibility (one codepoint or two codepoints)
  
- 
    Updating is expensive
  WWW Technology
  - 
    Structure (SGML/HTML/XML)
  
- 
    Embedding (GIF,...)
  
- 
    Linking (Hypertext)
  
- 
    Indirection
  
- 
    Distributed Authority
  
- 
    Explicit auxiliiary information
  
- 
    High flexibility
  
- 
    Higher vulnerability
  Tradeoffs
Better handled by character set technology
  - 
    Frequent characters
  
- 
    Main, clearly distinguished variants
Better handled by WWW technology
  - 
    Rare characters
  
- 
    Rare or minor variants
  Core Idea
  - 
    Use strucured data to describe ideographs
  
- 
    Use links to refer to descriptions of ideographs
  
- 
    Software knows about description format
    
      - 
	Browsers display, and provide access to auxiliary information
      
- 
	Authoring systems give better control of variants,...
      
- 
	Search engines still find characters (based
    
 
  Existing Technology
  - 
    Inline images (HTML)
  
- 
    Inline objects (applets,...; HTML) with fallbacks
  
- 
    Links
  
- 
    Format for structured information (XML)
  
- 
    Applications of inline structured inline information (MathML)
  
- 
    Font downloading (one-glyph-fonts are still too heavy)
 
- 
    Main new thing needed: Ideograph description format (KanjiML?)
  Information about Ideographs
  - 
    Pronunciations (language-dependent)
  
- 
    Meanings (language-dependent)
  
- 
    Shape (structure, example glyphs)
  
- 
    Variants, related characters
  
- 
    History
  
- 
    References to dictionaries, other web character collections
  
- 
    Negative information (not this shape,...)
  Application Scenarios
  - 
    Characters in proper names
  
- 
    Printing industry
  
- 
    Research (literature, history)
  Characters in Proper Names
  - 
    Some people are very peculiar about how their name is written
  
- 
    In Taiwan, some parents create new characters to name their children
 
- 
    Single-character fonts as birthday presents for the child
  
- 
    Accessible on the web, everybody can use the right character/glyph
  Printing Industry
  - 
    Complex interaction between author (and editor) and printer
  
- 
    Computers have made printing easier, but do not yet address this problem
  
- 
    Character codes do not provide enough information
  
- 
    Web technology can provide the necessary information, and the means for accessing
    it
  Research
  - 
    Different research fields:
    
      - 
	Different character concepts
      
- 
	Different collections of rare characters
      
- 
	Different needs for distinctions
    
 
- 
    Many web sites already up
  
- 
    Cross-linking is being started
  
- 
    Information only accessible by hand
  References
A. Charles Muller, World Wide
Web CJK-English Dictionary/Database, 1997 (and the many references
provided there).
Technical Committee for Windows NT Extended Kanji Processing Council, Windows
NT Extended Kanji Processing Specification - An OLE Solution for Extended
Kanji Processing, Version 2.0, July 1996
(http://www.piedey.co.jp/xkp/XKP20-E.doc;
see also http://www.xkp.or.jp/ and
http://www.microsoft.com/japan/PARTNERS/industry/xkphome.htm).
Unicode Unihan Code
Charts.
Rick Harbaugh, Chinese Character Genealogy
- Web-Based Etymological Dictionary for Learning Chinese Characters.
Dave Raggett, Arnaud Le Hors, and Ian Jacobs, Eds.,
HTML 4.0 Specification, W3C
Recommendation 18-Dec-1997.
Tim Bray, Jean Paoli, and C. M. Sperberg-McQueen, Eds.,
Extensible Markup Language (XML)
1.0, W3C Recommendation 10-February-1998.