This appendix is an informative, not a normative, part of the Level 2 DOM specification.
Characters are represented in Unicode by numbers called code points (also called scalar values). These numbers can range from 0 up to 1,114,111 = 10FFFF16 (although some of these values are illegal). Each code point can be directly encoded with a 32-bit code unit. This encoding is termed UCS-4 (or UTF-32). The DOM specification, however, uses UTF-16, in which the most frequent characters (which have values less than FFFF16) are represented by a single 16-bit code unit, while characters above FFFF16 use a special pair of code units called a surrogate pair. For more information, see [Unicode] or the Unicode Web site.
      While indexing by code points as opposed to code units is not
      common in programs, some specifications such as [XPath 1.0] (and therefore XSLT and [XPointer]) use code point indices.  For
      interfacing with such formats it is recommended that the
      programming language provide string processing methods for
      converting code point indices to code unit indices and back. Some
      languages do not provide these functions natively; for these it is
      recommended that the native String type that is bound
      to DOMString be extended to enable this
      conversion. An example of how such an API might look is supplied
      below.
    
Note: Since these methods are supplied as an illustrative example of the type of functionality that is required, the names of the methods, exceptions, and interface may differ from those given here.
Extensions to a language's native String class or interface
interface StringExtend { int findOffset16(in int offset32) raises(StringIndexOutOfBoundsException); int findOffset32(in int offset16) raises(StringIndexOutOfBoundsException); };
findOffset16Note: You can always round-trip from a UTF-32 offset to a UTF-16 offset and back. You can round-trip from a UTF-16 offset to a UTF-32 offset and back if and only if the offset16 is not in the middle of a surrogate pair. Unmatched surrogates count as a single UTF-16 value.
offset32 of type 
int| 
 | UTF-16 offset | 
| 
 | 
		  if  | 
findOffset32len32 = findOffset32(source, source.length());
Note: If the UTF-16 offset is into the middle of a surrogate pair, then the UTF-32 offset of the end of the pair is returned; that is, the index of the char after the end of the pair. You can always round-trip from a UTF-32 offset to a UTF-16 offset and back. You can round-trip from a UTF-16 offset to a UTF-32 offset and back if and only if the offset16 is not in the middle of a surrogate pair. Unmatched surrogates count as a single UTF-16 value.
offset16 of type 
int| 
 | UTF-32 offset | 
| 
 | if offset16 is out of bounds. |