Efficient implementation of Normalization checking for XML 1.1
We now have two versions, one for UTF-16 and one for UTF-8.
- Modified charlint
to generate the necessary data file in C format (option -nf16check): See
version
0.53 or higher
- Files: The source files have been moved to public
CVS
- nf16check.h: Include file for programs
that want to use checking for UTF-16
- nf16check.c: File that implements
checking object/algorithm for UTF-16
- nf8check.h: Include file for programs that
want to use checking for UTF-8
- nf8check.c: File that implements checking
object/algorithm for UTF-8
- nf16data.h: Include file for checking data
structure (no need to include in application programs). NOTE:
both nf8check and nf16check currently use nf16data.
- nf16data.c: Actual data file. The C source
is huge (700K), but compiled, this should be less than 50k. This can
be linked statically; no data in this file is ever changed.
NOTE: The code compiles, but has not really been tested.
- How to use:
- For each thread/entity/whatever that you want to check
concurrently, create a checker with
nf8checkNew
or
nf16checkNew
- If you want to check just for NFC (without the special W3C Character Model starting
conditions), call
nf8checkInit
or
nf16checkInit
.
- To check data, pass it as 0-delimited blocks to
nf8check
or nf16check
. For the first block,
set startP to check W3C Character Model starting conditions. Block
boundaries can be at any point (e.g. between high and low surrogate,
or between base and combining character or between two combining
characters). However, for UTF-16, do not pass in a string with only a
high surrogate on the first call (i.e. when startP is set); this will
give wrong results. For UTF-8, there are no restrictions on block
boundaries at all. Also, it is assumed that the data is clean, i.e.
for UTF-16, it is assumed that surrogates always appear in high-low
pairs, and for UTF-8, it is assumed that the leading/trailing
patterns are correct, or are checked elsewhere.
- You can reuse a checker to check something completely different as
soon as it is no longer needed. If you want to change whether the
checker checks for existing characters, call
nf8checkExists
or nf16checkExists
.
- Once a checker is no longer needed, delete with
nf16checkDelete
or nf16checkDelete
- Data used:
Overall, on a Win2000 box with cygwin gcc, this takes 45604 bytes.
Storage allocation is straightforward, so this shouldn't differ much on
other systems.
- Algorithm sketch:
- At the start of a stretch, check whether first character is
allowed. If not, stop.
- For each character, check whether allowed. If not, stop. If yes,
store one character in a buffer.
- If 'trailer' (combining) is detected, go into extended mode, using
a list of characters including:
- starter/base of combining sequence
- last previously found combining character
- For each new trailer (including the first one):
- compare for adequate combining class sorting with previous
combining character
- Check for potential combination with starter. The list of
recombinations is calculated so that any combining character that
would lead to a change (full combination, recombination so that
the combining character gets combined in but another combining
character is separated, complete decomposition,...) is listed.
2176 such pairs have been found.
- Ideas for higher efficiency:
- Combine the code with some other code, e.g. code checking correct
UTF-8/UTF-16, or doing some lookups.
- Use 'register' declarations in
nfXcheck
, pretty much
in the order the variables appear.
- Create versions of
nfXcheck
that deal with special
calling conditions only.
- Create more inner loops, similar to the inner loop for
c
<= noPROBLEMmax
. This can eliminate assignements for
startP
, lastclass
, and
starterflag
.
- Change binary sorts to register arithmetic.
- Change flag table to use one byte per flag to speed up lookup.
Started 2003-06-06 by Martin Dürst. Additions 2003-06-09 (completed UTF-16
code) and 2003-06-14 (completed UTF-8 code).
$Id: Overview.html,v 1.2 2003/08/29 21:23:24 duerst Exp $