Adding utf-8 Encoding to Lex

Abstract

The standard lexical tokenizer, lex (or flex), does not accept multi-byte characters, and is thusly impractical for many modern languages. This document describes a mapping from regular expressions describing UTF-8 multi-byte characters to a regular expressions of single bytes.

Status of this Document

This document describes a system deployed at W3C. It is not endorsed by the W3C members, team, or any working group.

Introduction

For many years, the standard way to build a parser was to use lex and yacc. The lexical tokenizer tool, lex, takes as input, a set of regular expressions, and outputs macros that traverse a set of DFAs, tables of bytes to match against the input stream and actions to perform when they are matched. Every character encoded in UTF-8 [UTF8] has a defined byte pattern. Thus, any character can be mapped to a series of bytes. The following text develops an algorithm for transforming regular character expressions to regular byte expressions assuming a UTF-8 encoding. This also lays the groundwork for performing analogous mappings for other encodings (UTF-16, iso-2022, ...). Examples demonstrate that the resulting regular byte expressions for sequences, ranges, and excluded ranges are practical in size and complexity.

UTF-8 Encoding

The procedure for UTF-8 encoding is amply available so this introduction is relatively terse.

Unicode defines UCS codes for most symbols used in human communications. This code is a number between 0 and 0x10FFFF. UTF-8 encoding defines a way to put these numbers in a byte stream in a way that one can start at an arbitrary place in the file and scan backwards or forwards 4 characters to synchronize with the beginning of a character number. ASCII characters are 7 bit and are encoded the same was in UTF-8. This leaves the MSB to indicate a multi-byte character. When the MSB is set, the number of leading 1 bits indicates the width in bytes of the UTF-8 character. For example, 1110 0010 would indicate a three byte character. The pattern 10xx·xxxx is not a legal leading byte, and that pattern is used to flag the trailing bytes. Thus, a three byte UTF-8 character would match 1110·xxxx 10xx·xxxx 10xx·xxxx. The x bits are available for recording the UCS code. Appendix A includes several examples that are used in developing the lex-U algorithm.

Validating Pattern

A hole in the unicode (called the surrogates) and a a desire to reject characters that are encoded in a larger than necessary UTF-8 sequence (overlongs) complicate the mapping somewhat. The following table shows a simplistic byte pattern for UTF-8 sequences between 1 and 4 bytes long. Surrogates and overlongs are trimmed from the pattern, resulting in a reasonably validating pattern. This pattern does not restrict to the set of defined UCS characters, instead to the set that is permitted by UTF-8 encoding.

Validating lex Template

UCS	plane	UTF-8 byte	reason
		[00-7F]
00 7F		00 7F
		[C0-DF]	[00-FF]
	[C0-C1]	*			exclude 2-byte overlongs
	*	[00-7F]			10xx·xxxx trailing byte marker must be set
	*	[C0-FF]
80 7FF		C2 DF	80 BF
		[E0-EF]	[00-FF]	[00-FF]
	E0	[80-9F]	*		exclude 3-byte overlongs
800 FFF		E0	A0 BF	80 BF
1000 CFFF		E1 EC	80 BF	80 BF
D000 D7FF		ED	80 9F	80 BF
		ED	[A0-BF]	*		surrogates d800-d8ff would encode as ED A0 80-ED BF BF
E000 FFFF		EE EF	80 BF	80 BF
	1-16	[F0-F7]	[00-FF]	[00-FF]	[00-FF]
	F0	[80-8F]	*	*	exclude 4-byte overlongs
10000 3FFFF	1-3	F0	90 BF	80 BF	80 BF
40000 FFFFF	4-15	F1 F3	80 BF	80 BF	80 BF
100000 10FFFF	16	F4	80 8F	80 BF	80 BF
		F4	[90-BF]	*	*	no unicode above 10 FF FF
	[F5-FF]	*	*	*

This lex pattern that validates UTF-8 serves as a template for transcoding ranges (explained below).

1    ['\u0000'-'\u007F']
2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
3 | ( '\u00E0'           ['\u00A0'-'\u00BF'] ['\u0080'-'\u00BF'])
4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
5 | ( '\u00ED'           ['\u0080'-'\u009F'] ['\u0080'-'\u00BF'])
6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
7 | ( '\u00F0'           ['\u0090'-'\u00BF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
9 | ( '\u00F4'           ['\u0080'-'\u008F'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])

Lines 4 and 6 can be expressed in a more compact syntax when the character range does not start or end in either character class:

4b | (['\u00E1'-'\u00EC','\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])

lex-U Algorithm

I haven't worked out a good way to describe the algorithm. Basically, any valid characters in the regular character expression will correspond to a point in the validating pattern. The trick is to expand the rules where your boundry codepoing is not at the edge of a rule. The most expressive form of this seems to be the compileUTF8 perl script.

Single Character

Transcoding a single character involves trivially substituting the UTF-8 byte sequence for the original character. The substitution may require grouping. For instance, a wide character that is part of a disjunction

will transcode into a multi-byte sequence taking the same place in the disjunction.

'n' | ('\u00E3', '\u0083', '\u008A') | 'N'

Character Ranges

A character range corresponds to a sublanguage of that defined by the validating lex template. Find the validating pattern productions that correspond to the start at the start and end of the range. Each production between them may be substituted directly, but care must be taken with the bottom and top of the range to match exactly the valid bytes only.

For instance, to find the range E3 85 B0 ('ㅰ') - F3 B0 A1 99 ('󰡙') following ranges must be included:

byte	reason
E3	85	[B0-BF]		bottom of range - complete byte 2
[86-BF]	[80-BF]		bottom of range - complete byte 1
[E4-EC]	[80-BF]	[80-BF]		bottom of range - complete byte 0
ED	[80-9F]	[80-BF]		included production 5
[EE-EF]	[80-BF]	[80-BF]		included production 6
F0	[90-BF]	[80-BF]	[80-BF]	included production 7
[F1-F2]	[80-BF]	[80-BF]	[80-BF]	top of range - complete byte 0
F3	[80-AF]	[80-BF]	[80-BF]	top of range - complete byte 1
B0	[80-A0]	[80-BF]	top of range - complete byte 2
A1	[80-99]	top of range - complete byte 3

    ( '\u00E3'            '\u0085'           ['\u00B0'-'\u00BF'])
  | ( '\u00E3'           ['\u0086'-'\u00BF'] ['\u0080'-'\u00BF'])
  | (['\u00E4'-'\u00EC'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
5 | ( '\u00ED'           ['\u0080'-'\u009F'] ['\u0080'-'\u00BF'])
6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
7 | ( '\u00F0'           ['\u0090'-'\u00BF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
  | (['\u00F1'-'\u00F2'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
  | ( '\u00F3'           ['\u0080'-'\u0084'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
  | ( '\u00F3'            '\u0085'           ['\u0080'-'\u00A0'] ['\u0080'-'\u00BF'])
  | ( '\u00F3'            '\u0085'            '\u00A1'           ['\u0080'-'\u0099'])

    ( '\u00E3'        ( ( '\u0085'           ['\u00B0'-'\u00BF'])
  |                    |(['\u0086'-'\u00BF'] ['\u0080'-'\u00BF'])))
  | (['\u00E4'-'\u00EC'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
5 | ( '\u00ED'           ['\u0080'-'\u009F'] ['\u0080'-'\u00BF'])
6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
7 | ( '\u00F0'           ['\u0090'-'\u00BF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
  | (['\u00F1'-'\u00F2'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
  | ( '\u00F3'        ( (['\u0080'-'\u0084'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
  |                    |( '\u0085'        ( (['\u0080'-'\u00A0'] ['\u0080'-'\u00BF'])
  |                                        |( '\u00A1'           ['\u0080'-'\u0099'])))))

Excluded Character Ranges

Regular Expression Transcoding Examples

This section demonstrates the process for transcoding example regular character expressions into regular byte expressions.

Simple 3-byte Sequence, Range

A simple first example explores a small range in some 3-byte UTF-8 characters where only the last byte varies.

chars: [    'ロ',    'ナ'-    'ノ' ]
UCS:   ['\u30ED','\u30CA'-'\u30CE' ]
bytes: [E3 83 AD, E3 83 8A-E3 83 8E]

group =>
('\xE3', '\x83', ['\xAD', '\x8A'-'\x8E'])

Semi-distributed 3-byte Sequence, Range

chars: [    'ナ'-    'ㄅ' ]
UCS:   ['\u30CA'-'\u3105' ]
bytes: [E3 83 8A-E3 84 85]

group =>
'\xE3', ('\x83', ['\x8A'-'\xBF']) | ('\x84", ['\x80'-'\x85'])

Distributed 3-byte Range

chars: [    '丁'-    '龥' ]
UCS:   ['\u4E01'-'\u9fA5 ]
bytes: [E4 B8 81-E9 BE A5]

atomize =>
  ('\xE4', (('\xB8', ['\x81'-'\xFF'])
            | (['\xB9'-'\xFF'],['\x00'-'\xFF']))) 
| (['\xE5'-'\xE8'], ['\x00'-'\xFF'], ['\x00'-'\xFF'])
| ('\xE9', (['\x00'-'\xBD'], ['\x00'-'\xFF'])
            | ('\xBE', ['\x00'-'\xA5']))

validate =>

  ('\xE4', (('\xB8', ['\x81'-'\xBF'])
            | (['\xB9'-'\xBF'],['\x80'-'\xBF']))) 
| (['\xE5'-'\xE8'], ['\x80'-'\xBF'], ['\x80'-'\xBF'])
| ('\xE9', (['\x80'-'\xBD'], ['\x80'-'\xBF'])
            | ('\xBE', ['\x80'-'\xA5']))
group =>
...

Varied Byte Range

This example varies the number of bytes encoding the characters. It defines a probably impractical range from LATIN SMALL LETTER Q to an arbitrary place in supplementary private use area B called here PRIVATE USE MY Q:

chars: [     'q''-   '|q|'    ]
UCS:   ['\u0071'-'\u9fA5    ]
bytes: [      71-F4 8D 90 AA]

atomize =>
    ['\x71'-'\x7F']
 | (['\xC0'-'\xDF'], ['\x00'-'\xFF'])
 | (['\xE0'-'\xEF'], ['\x00'-'\xFF'], ['\x00'-'\xFF'])
 | (['\xF0'-'\xF7'], ['\x00'-'\xFF'], ['\x00'-'\xFF'], ['\x00'-'\xFF'])

...

Implementations

Conclusions

Acknowledgments

Thanks to Martin Dürst who taught me about UTF-8 validation and worked through some of the problems with me. He also suggested that lex worked well with byte tables and that changing that would be more difficult than describing the encoded bytes.

Appendix A - UTF-8 Encoding Examples

KATAKANA LETTER RO

                         'ロ'
                     '\u30ED'
          0011 0000 1110 1101
encode as utf-8 =>	      
1110      10        10       
     0011   00 0011   10 1101
1110 0011 1000 0011 1010 1101
 ('\xe3'    '\x83'    '\xad')

KATAKANA LETTER NA

                         'メ'
                     '\u30CA'
          0011 0000 1100 1010
				
1110      10        10       
     0011   00 0011   00 1010
1110 0011 1000 0011 1000 1010
 ('\xe3'    '\x83'    '\x8a')

KATAKANA LETTER NO

                        'ノ' 
                    '\u30CE' 
          0011 0000 1100 1110
				   
1110      10        10       
     0011   00 0011   00 1110
1110 0011 1000 0011 1000 1110
 ('\xe3'    '\x83'    '\x8e')

HANGUL LETTER MIEUM-PANSIOS

                         'ㅰ'
                     '\u3170'
          0011 0000 1110 1101
encode as utf-8 =>	      
1110      10        10       
     0011   00 0011   10 1101
1110 0011 1000 0011 1010 1101
 ('\xe3'    '\x83'    '\xad')

CJK UNIFIED IDIOGRAPH 4E01 (CHYOU)

                         '丁'
                     '\u4E01'
          0100 1110 0000 0001
					     
1110      10        10       
     0100   11 1000   00 0001
1110 0100 1011 1000 1000 0001
 ('\xe4'    '\xb8'    '\x81')

CJK UNIFIED IDIOGRAPH 9FA5 (YAKU)

                        '龥' 
                    '\u9FA5' 
          1001 1111 1010 0101
		     		   
1110      10        10       
     1001   11 1110   10 0101
1110 1001 1011 1110 1010 0101
 ('\xe9'    '\xbe'    '\xa5')

LATIN SMALL LETTER Q

PRIVATE USE MY Q

                                 '|q|' 
                            '\u10d42a' 
          0001 0000 1101 0100 0010 1010
   	 			   	  
1111 0    10        10        10       
      100   00 1101   01 0000   10 1010
1111 0100 1000 1101 1001 0000 1010 1010
 ('\xf4'    '\x8d'    '\x90'    '\xaa')