W3C International

Charlint - A Character Normalization Tool

Perl source | Recommended Data Files | How to use | Future Plans | Background | Version History

NEW (2009-11-30): Charlint updated for Unicode 5.2.0.

Charlint is a character normalization/checking tool written in Perl. Among else, it implements Normalization Form C of Unicode TR 15, as a test platform for Early Uniform Normalization in the W3C Character Model.

Perl Source and Installation

Charlint , aka 'Charlie', is written in Perl 5 (mostly independent of minor Perl versions). You can get the Charlint source from anonymous CVS. For initial checkout, use:

prompt$ cvs -d :pserver:anonymous@dev.w3.org:/sources/public login
Logging in to :pserver:anonymous@dev.w3.org:2401/sources/public
CVS password: anonymous
prompt$ cvs -d :pserver:anonymous@dev.w3.org:/sources/public get charlint
cvs server: Updating charlint
U charlint/Overview.html
U charlint/README.cvs
U charlint/charlint.pl

If you don't have CVS, use the Web to download the newest version or start with the charlint CVS overview. Charlint is covered by the W3C software licence. To install charlint, please make sure you have installed Perl 5, you have downloaded an appropriate character data file, and you have downloaded the Perl source. Please send error reports or comments to duerst@it.aoyama.ac.jp; for anouncements and public discussion please see the Winter mailing list (www-international@w3.org).

You'll also need the perl module Storable. The best way to install it may depend on your system.

Recommended Character Data Files

Reading in a data file can take some time. You can use -s and -S to store a preprocessed file and load it back quickly for faster processing.

Charlint needs information on characters in order to work correctly. To indicate the file you want to use, please use the -f option. The recommended character data file is available from ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt. Charlint is updated when needed to work with new data files and to bring the composition exclusions (which are hand-coded) up to date.

How to use charlint

Charlint is a perl script that works as a simple filter. It uses UTF-8 both for input and for output. Behavior can be fine-tuned with various options. Without options, charlint converts input text to NFC (Normalization Form C, Canonical Composition). But charlint also checks for correct UTF-8, makes sure that there is no initial BOM, and warns about undefined and private-use area codepoints.

To preprocess the original data file, use -f originalDatafile -S storageFile -d -D. This saves all the data necessary for later processing for all normalization forms. To specify the storage file on later runs, use -s storageFile.

For NFC, no special option is necessary. For NFD, use -x; for NFKC, use -K; for NFKD, use -x -K. Each option has to be given separately. To avoid any normalization, but still do the various checks, use -C. Charlint checks the input from the Unicode database file very carefully. To get the output of these operations, use -d.

A list of all options such as the one below can be optained by using charlint -h.

(options prefixed by # are currently not available)
-b: Remove initial 'Byte Order Mark'
-B: Supress warning about initial 'Byte Order Mark'
-c: Detect non-normalized data (but do not normalize)
-C: Do not normalize
-d: Debug: Thoroughly check character data table input
-D: Leave after reading in character data
-e: # remove undefined codepoints
-E: Do not warn about undefined codepoints
-f file: Read data from file (no default anymore)
         (please use newest V3.2.0 datafiles)
-F951: Use old (wrong) mapping for U+F951 (use this option
  if you really need 3.1.0 behaviour)
-h: Prints out this short description
-k: # Warn about compatibility codepoints
-K: Normalize out (i.e. decompose) compatibility codepoints
-n: Accept &#ddddd; and &#xhhhh; on input
        (beware of <![CDATA[, <SCRIPT>, <STYLE>)
-nX: same as -n, plus &#Xhhhh; (use for HTML only!)
-N: Produce &#xhhhh; on output
-o: Print out 'unprintable' bytes as \\octal
-p: # Remove stuff in private use areas
-P: Supress checking private use areas
-q: Quiet, don't output progress messages
-s file: Read data from file produced with -S
-S file: Write data to file for fast reload with -s
-u: # Fix UTF-8 (convert or remove)
-U: Supress checking correctness of UTF-8
-v: Print version
-x: Do decomposition only
-X: Don't do decomposition (assume input is decomposed)
-YWH: Treat YOD WITH HIRIQ as precomposed (use this option
  if you really need 3.0.0 behaviour)

Version History

# 2009/11/28: 0.55, updated to Unicode Version 5.2.0                 MJD
# 2002/06/24: 0.54, improving -nf16check (compiler warnings, speed)  MJD
# 2002/06/08: 0.53, added -nf16check data file production            MJD
# 2002/08/23: 0.52, changed default file to UnicodeData.txt          MJD
# 2002/05/21: 0.51, added option -nX (use for HTML only!)            MJD
# 2002/04/03: 0.50, updated for 3.2.0; added -F951; added -c         MJD
# 2001/10/03: 0.49, code cleanup for use strict and -w               MJD
# 2001/04/01: 0.48, updated for 3.1.0 (final)                        MJD
# 2001/03/07: 0.47, YOD WITH HIRIQ corrigendum                       MJD
# 2000/12/19: 0.46, updated for 3.1.0 (beta)                         MJD
# 2000/11/12: 0.45, bug fix for CJK extension A                      MJD
# 2000/11/09: 0.44, implemented -s/-S (Storable data)                MJD
# 2000/10/05: 0.43, implemented -K (kompatibility decomposition)     MJD
# 2000/10/05: 0.42, updated for 3.0.1, fixed line ends               MJD
# 2000/10/05: 0.41, added 2000 to copyright, tested CVS commit       MJD
# 2000/08/03: 0.40, added Hangul support and did quite some testing  MJD
# 2000/08/02: 0.37, added -x and -X for decomposition                MJD
# 2000/07/27: 0.36, fixed a bug for non-starter decompositions       MJD
# 2000/07/24: 0.35, adapted exclusions to 3.0.0 final (+Tibetan)     MJD
# 2000/07/24: 0.34, $chClass = $CombClass{ch}; should read $chClass = $CombClass{$ch};
#                   implemented -C                                   MJD
# 1999/08/16: 0.33, updated for second version of 3.0.0.beta         MJD
# 1999/07/01: 0.32, adapted surrogates/exclusions to 3.0.0.beta      MJD
# 1999/06/25: 0.31, fixed reordering bug, going public               MJD
# 1999/06/23: 0.30, preparation for W3C member test, without Hangul  MJD

Background

Future Plans

Charlint is still being maintained, so bug reports and patches are welcome.

There are many additional features that we have thought about adding, such as:

However, we would probably rewrite charlint in Ruby before adding major new features.


Martin Dürst
Webmaster
last revised $Date: 2009/12/03 08:38:05 $ by $Author: duerst $

Copyright © 1997 W3C® (MIT, INRIA, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements.