Efficient implementation of Normalization checking for XML 1.1

We now have two versions, one for UTF-16 and one for UTF-8.

Modified charlint to generate the necessary data file in C format (option -nf16check): See version 0.53 or higher
Files: The source files have been moved to public CVS
- nf16check.h: Include file for programs that want to use checking for UTF-16
- nf16check.c: File that implements checking object/algorithm for UTF-16
- nf8check.h: Include file for programs that want to use checking for UTF-8
- nf8check.c: File that implements checking object/algorithm for UTF-8
- nf16data.h: Include file for checking data structure (no need to include in application programs). NOTE: both nf8check and nf16check currently use nf16data.
- nf16data.c: Actual data file. The C source is huge (700K), but compiled, this should be less than 50k. This can be linked statically; no data in this file is ever changed.
NOTE: The code compiles, but has not really been tested.
How to use:
- For each thread/entity/whatever that you want to check concurrently, create a checker with nf8checkNew or nf16checkNew
- If you want to check just for NFC (without the special W3C Character Model starting conditions), call nf8checkInit or nf16checkInit.
- To check data, pass it as 0-delimited blocks to nf8check or nf16check. For the first block, set startP to check W3C Character Model starting conditions. Block boundaries can be at any point (e.g. between high and low surrogate, or between base and combining character or between two combining characters). However, for UTF-16, do not pass in a string with only a high surrogate on the first call (i.e. when startP is set); this will give wrong results. For UTF-8, there are no restrictions on block boundaries at all. Also, it is assumed that the data is clean, i.e. for UTF-16, it is assumed that surrogates always appear in high-low pairs, and for UTF-8, it is assumed that the leading/trailing patterns are correct, or are checked elsewhere.
- You can reuse a checker to check something completely different as soon as it is no longer needed. If you want to change whether the checker checks for existing characters, call nf8checkExists or nf16checkExists.
- Once a checker is no longer needed, delete with nf16checkDelete or nf16checkDelete
Data used:
- flags (in a table that can be indexed directly):
  - does not exist
  - combining class >0
  - combining class=0, but composing
  - Not allowed (composition exclusion and singletons)
  - high/low surrogate
  - Hangul initial/medial/final
  - Hangul initial-medial syllable
  - plain old normal, nothing special
  this takes 4 bits, and assuming we only have to cover the BMP, we need 32kBytes
- sorted lists (to be searched with binary search):
  - combining class
  - non-allowed sequences (includes pairs that would be precomposed, as well as pairs of a precomposed with a combining that would lead to recomposition; there are roughly 2700 such pairs)
Overall, on a Win2000 box with cygwin gcc, this takes 45604 bytes. Storage allocation is straightforward, so this shouldn't differ much on other systems.
Algorithm sketch:
- At the start of a stretch, check whether first character is allowed. If not, stop.
- For each character, check whether allowed. If not, stop. If yes, store one character in a buffer.
- If 'trailer' (combining) is detected, go into extended mode, using a list of characters including:
  - starter/base of combining sequence
  - last previously found combining character
- For each new trailer (including the first one):
  - compare for adequate combining class sorting with previous combining character
  - Check for potential combination with starter. The list of recombinations is calculated so that any combining character that would lead to a change (full combination, recombination so that the combining character gets combined in but another combining character is separated, complete decomposition,...) is listed. 2176 such pairs have been found.
Ideas for higher efficiency:
- Combine the code with some other code, e.g. code checking correct UTF-8/UTF-16, or doing some lookups.
- Use 'register' declarations in nfXcheck, pretty much in the order the variables appear.
- Create versions of nfXcheck that deal with special calling conditions only.
- Create more inner loops, similar to the inner loop for c <= noPROBLEMmax. This can eliminate assignements for startP, lastclass, and starterflag.
- Change binary sorts to register arithmetic.
- Change flag table to use one byte per flag to speed up lookup.

Started 2003-06-06 by Martin Dürst. Additions 2003-06-09 (completed UTF-16 code) and 2003-06-14 (completed UTF-8 code).

$Id: Overview.html,v 1.2 2003/08/29 21:23:24 duerst Exp $