The standard lexical tokenizer, lex (or flex), does not accept multi-byte characters, and is thusly impractical for many modern languages. This document describes a mapping from regular expressions describing UTF-8 multi-byte characters to a regular expressions of single bytes.
This document describes a system deployed at W3C. It is not endorsed by the W3C members, team, or any working group.
For many years, the standard way to build a parser was to use lex and yacc. The lexical tokenizer tool, lex, takes as input, a set of regular expressions, and outputs macros that traverse a set of DFAs, tables of bytes to match against the input stream and actions to perform when they are matched. Every character encoded in UTF-8 [UTF8] has a defined byte pattern. Thus, any character can be mapped to a series of bytes. The following text develops an algorithm for transforming regular character expressions to regular byte expressions assuming a UTF-8 encoding. This also lays the groundwork for performing analogous mappings for other encodings (UTF-16, iso-2022, ...). Examples demonstrate that the resulting regular byte expressions for sequences, ranges, and excluded ranges are practical in size and complexity.
The procedure for UTF-8 encoding is amply available so this introduction is relatively terse.
Unicode defines UCS codes for most symbols used in human communications. This code is a number between 0 and 0x10FFFF. UTF-8 encoding defines a way to put these numbers in a byte stream in a way that one can start at an arbitrary place in the file and scan backwards or forwards 4 characters to synchronize with the beginning of a character number. ASCII characters are 7 bit and are encoded the same was in UTF-8. This leaves the MSB to indicate a multi-byte character. When the MSB is set, the number of leading 1 bits indicates the width in bytes of the UTF-8 character. For example, 1110 0010
would indicate a three byte character. The pattern 10xx·xxxx
is not a legal leading byte, and that pattern is used to flag the trailing bytes. Thus, a three byte UTF-8 character would match 1110·xxxx 10xx·xxxx 10xx·xxxx
. The x
bits are available for recording the UCS code. Appendix A includes several examples that are used in developing the lex-U algorithm.
A hole in the unicode (called the surrogates) and a a desire to reject characters that are encoded in a larger than necessary UTF-8 sequence (overlongs) complicate the mapping somewhat. The following table shows a simplistic byte pattern for UTF-8 sequences between 1 and 4 bytes long. Surrogates and overlongs are trimmed from the pattern, resulting in a reasonably validating pattern. This pattern does not restrict to the set of defined UCS characters, instead to the set that is permitted by UTF-8 encoding.
UCS | plane | UTF-8 byte | reason | |||
---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | |||
[00-7F] | ||||||
00 7F | 00 7F | |||||
[C0-DF] | [00-FF] | |||||
[C0-C1] | * | exclude 2-byte overlongs | ||||
* | [00-7F] | 10xx·xxxx trailing byte marker must be set | ||||
* | [C0-FF] | |||||
80 7FF | C2 DF | 80 BF | ||||
[E0-EF] | [00-FF] | [00-FF] | ||||
E0 | [80-9F] | * | exclude 3-byte overlongs | |||
800 FFF | E0 | A0 BF | 80 BF | |||
1000 CFFF | E1 EC | 80 BF | 80 BF | |||
D000 D7FF | ED | 80 9F | 80 BF | |||
ED | [A0-BF] | * | surrogates d800-d8ff would encode as ED A0 80-ED BF BF | |||
E000 FFFF | EE EF | 80 BF | 80 BF | |||
1-16 | [F0-F7] | [00-FF] | [00-FF] | [00-FF] | ||
F0 | [80-8F] | * | * | exclude 4-byte overlongs | ||
10000 3FFFF | 1-3 | F0 | 90 BF | 80 BF | 80 BF | |
40000 FFFFF | 4-15 | F1 F3 | 80 BF | 80 BF | 80 BF | |
100000 10FFFF | 16 | F4 | 80 8F | 80 BF | 80 BF | |
F4 | [90-BF] | * | * | no unicode above 10 FF FF | ||
[F5-FF] | * | * | * |
This lex pattern that validates UTF-8 serves as a template for transcoding ranges (explained below).
1 ['\u0000'-'\u007F'] 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF']) 3 | ( '\u00E0' ['\u00A0'-'\u00BF'] ['\u0080'-'\u00BF']) 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) 5 | ( '\u00ED' ['\u0080'-'\u009F'] ['\u0080'-'\u00BF']) 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) 7 | ( '\u00F0' ['\u0090'-'\u00BF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) 9 | ( '\u00F4' ['\u0080'-'\u008F'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
Lines 4 and 6 can be expressed in a more compact syntax when the character range does not start or end in either character class:
4b | (['\u00E1'-'\u00EC','\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
This is also implemented as a perl regexp in W3C Internationalization FAQ: Multilingual Forms.
I haven't worked out a good way to describe the algorithm. Basically, any valid characters in the regular character expression will correspond to a point in the validating pattern. The trick is to expand the rules where your boundry codepoing is not at the edge of a rule. The most expressive form of this seems to be the compileUTF8 perl script.
Transcoding a single character involves trivially substituting the UTF-8 byte sequence for the original character. The substitution may require grouping. For instance, a wide character that is part of a disjunction
'n' | 'メ' | 'N'
will transcode into a multi-byte sequence taking the same place in the disjunction.
'n' | ('\u00E3', '\u0083', '\u008A') | 'N'
A character range corresponds to a sublanguage of that defined by the validating lex template. Find the validating pattern productions that correspond to the start at the start and end of the range. Each production between them may be substituted directly, but care must be taken with the bottom and top of the range to match exactly the valid bytes only.
For instance, to find the range E3 85 B0 ('ㅰ') - F3 B0 A1 99 ('') following ranges must be included:
byte | reason | |||
---|---|---|---|---|
0 | 1 | 2 | 3 | |
E3 | 85 | [B0-BF] | bottom of range - complete byte 2 | |
[86-BF] | [80-BF] | bottom of range - complete byte 1 | ||
[E4-EC] | [80-BF] | [80-BF] | bottom of range - complete byte 0 | |
ED | [80-9F] | [80-BF] | included production 5 | |
[EE-EF] | [80-BF] | [80-BF] | included production 6 | |
F0 | [90-BF] | [80-BF] | [80-BF] | included production 7 |
[F1-F2] | [80-BF] | [80-BF] | [80-BF] | top of range - complete byte 0 |
F3 | [80-AF] | [80-BF] | [80-BF] | top of range - complete byte 1 |
B0 | [80-A0] | [80-BF] | top of range - complete byte 2 | |
A1 | [80-99] | top of range - complete byte 3 |
( '\u00E3' '\u0085' ['\u00B0'-'\u00BF']) | ( '\u00E3' ['\u0086'-'\u00BF'] ['\u0080'-'\u00BF']) | (['\u00E4'-'\u00EC'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) 5 | ( '\u00ED' ['\u0080'-'\u009F'] ['\u0080'-'\u00BF']) 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) 7 | ( '\u00F0' ['\u0090'-'\u00BF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) | (['\u00F1'-'\u00F2'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) | ( '\u00F3' ['\u0080'-'\u0084'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) | ( '\u00F3' '\u0085' ['\u0080'-'\u00A0'] ['\u0080'-'\u00BF']) | ( '\u00F3' '\u0085' '\u00A1' ['\u0080'-'\u0099'])
or when grouped:
( '\u00E3' ( ( '\u0085' ['\u00B0'-'\u00BF']) | |(['\u0086'-'\u00BF'] ['\u0080'-'\u00BF']))) | (['\u00E4'-'\u00EC'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) 5 | ( '\u00ED' ['\u0080'-'\u009F'] ['\u0080'-'\u00BF']) 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) 7 | ( '\u00F0' ['\u0090'-'\u00BF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) | (['\u00F1'-'\u00F2'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) | ( '\u00F3' ( (['\u0080'-'\u0084'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) | |( '\u0085' ( (['\u0080'-'\u00A0'] ['\u0080'-'\u00BF']) | |( '\u00A1' ['\u0080'-'\u0099'])))))
...
This section demonstrates the process for transcoding example regular character expressions into regular byte expressions.
A simple first example explores a small range in some 3-byte UTF-8 characters where only the last byte varies.
chars: [ 'ロ', 'ナ'- 'ノ' ] UCS: ['\u30ED','\u30CA'-'\u30CE' ] bytes: [E3 83 AD, E3 83 8A-E3 83 8E] group => ('\xE3', '\x83', ['\xAD', '\x8A'-'\x8E'])
Here, the last two bytes vary.
chars: [ 'ナ'- 'ㄅ' ] UCS: ['\u30CA'-'\u3105' ] bytes: [E3 83 8A-E3 84 85] group => '\xE3', ('\x83', ['\x8A'-'\xBF']) | ('\x84", ['\x80'-'\x85'])
This example varies more than just the last byte.
chars: [ '丁'- '龥' ] UCS: ['\u4E01'-'\u9fA5 ] bytes: [E4 B8 81-E9 BE A5] atomize => ('\xE4', (('\xB8', ['\x81'-'\xFF']) | (['\xB9'-'\xFF'],['\x00'-'\xFF']))) | (['\xE5'-'\xE8'], ['\x00'-'\xFF'], ['\x00'-'\xFF']) | ('\xE9', (['\x00'-'\xBD'], ['\x00'-'\xFF']) | ('\xBE', ['\x00'-'\xA5'])) validate => ('\xE4', (('\xB8', ['\x81'-'\xBF']) | (['\xB9'-'\xBF'],['\x80'-'\xBF']))) | (['\xE5'-'\xE8'], ['\x80'-'\xBF'], ['\x80'-'\xBF']) | ('\xE9', (['\x80'-'\xBD'], ['\x80'-'\xBF']) | ('\xBE', ['\x80'-'\xA5'])) group => ...
This example varies the number of bytes encoding the characters. It defines a probably impractical range from LATIN SMALL LETTER Q to an arbitrary place in supplementary private use area B called here PRIVATE USE MY Q:
chars: [ 'q''- '|q|' ] UCS: ['\u0071'-'\u9fA5 ] bytes: [ 71-F4 8D 90 AA] atomize => ['\x71'-'\x7F'] | (['\xC0'-'\xDF'], ['\x00'-'\xFF']) | (['\xE0'-'\xEF'], ['\x00'-'\xFF'], ['\x00'-'\xFF']) | (['\xF0'-'\xF7'], ['\x00'-'\xFF'], ['\x00'-'\xFF'], ['\x00'-'\xFF']) ...
This technique is implemented by an impressive list of one:
piece of cake
Thanks to Martin Dürst who taught me about UTF-8 validation and worked through some of the problems with me. He also suggested that lex worked well with byte tables and that changing that would be more difficult than describing the encoded bytes.
'ロ' '\u30ED' 0011 0000 1110 1101 encode as utf-8 => 1110 10 10 0011 00 0011 10 1101 1110 0011 1000 0011 1010 1101 ('\xe3' '\x83' '\xad')
'メ' '\u30CA' 0011 0000 1100 1010 1110 10 10 0011 00 0011 00 1010 1110 0011 1000 0011 1000 1010 ('\xe3' '\x83' '\x8a')
'ノ' '\u30CE' 0011 0000 1100 1110 1110 10 10 0011 00 0011 00 1110 1110 0011 1000 0011 1000 1110 ('\xe3' '\x83' '\x8e')
'ㅰ' '\u3170' 0011 0000 1110 1101 encode as utf-8 => 1110 10 10 0011 00 0011 10 1101 1110 0011 1000 0011 1010 1101 ('\xe3' '\x83' '\xad')
'丁' '\u4E01' 0100 1110 0000 0001 1110 10 10 0100 11 1000 00 0001 1110 0100 1011 1000 1000 0001 ('\xe4' '\xb8' '\x81')
'龥' '\u9FA5' 1001 1111 1010 0101 1110 10 10 1001 11 1110 10 0101 1110 1001 1011 1110 1010 0101 ('\xe9' '\xbe' '\xa5')
'q' '\u0071' 0111 0001 0 111 0001 0111 0001 ('\x71')
'|q|' '\u10d42a' 0001 0000 1101 0100 0010 1010 1111 0 10 10 10 100 00 1101 01 0000 10 1010 1111 0100 1000 1101 1001 0000 1010 1010 ('\xf4' '\x8d' '\x90' '\xaa')