Adding utf-8 Encoding to Lex

CVS Version:
$Id: 23-lex-U.html,v 1.19 2005/06/27 05:09:29 eric Exp $
Author:
Eric Prud'hommeaux, W3C

Abstract

The standard lexical tokenizer, lex (or flex), does not accept multi-byte characters, and is thusly impractical for many modern languages. This document describes a mapping from regular expressions describing UTF-8 multi-byte characters to a regular expressions of single bytes.

Status of this Document

This document describes a system deployed at W3C. It is not endorsed by the W3C members, team, or any working group.

Table of Contents

Introduction

For many years, the standard way to build a parser was to use lex and yacc. The lexical tokenizer tool, lex, takes as input, a set of regular expressions, and outputs macros that traverse a set of DFAs, tables of bytes to match against the input stream and actions to perform when they are matched. Every character encoded in UTF-8 [UTF8] has a defined byte pattern. Thus, any character can be mapped to a series of bytes. The following text develops an algorithm for transforming regular character expressions to regular byte expressions assuming a UTF-8 encoding. This also lays the groundwork for performing analogous mappings for other encodings (UTF-16, iso-2022, ...). Examples demonstrate that the resulting regular byte expressions for sequences, ranges, and excluded ranges are practical in size and complexity.

UTF-8 Encoding

The procedure for UTF-8 encoding is amply available so this introduction is relatively terse.

Unicode defines UCS codes for most symbols used in human communications. This code is a number between 0 and 0x10FFFF. UTF-8 encoding defines a way to put these numbers in a byte stream in a way that one can start at an arbitrary place in the file and scan backwards or forwards 4 characters to synchronize with the beginning of a character number. ASCII characters are 7 bit and are encoded the same was in UTF-8. This leaves the MSB to indicate a multi-byte character. When the MSB is set, the number of leading 1 bits indicates the width in bytes of the UTF-8 character. For example, 1110 0010 would indicate a three byte character. The pattern 10xx·xxxx is not a legal leading byte, and that pattern is used to flag the trailing bytes. Thus, a three byte UTF-8 character would match 1110·xxxx 10xx·xxxx 10xx·xxxx. The x bits are available for recording the UCS code. Appendix A includes several examples that are used in developing the lex-U algorithm.

Validating Pattern

A hole in the unicode (called the surrogates) and a a desire to reject characters that are encoded in a larger than necessary UTF-8 sequence (overlongs) complicate the mapping somewhat. The following table shows a simplistic byte pattern for UTF-8 sequences between 1 and 4 bytes long. Surrogates and overlongs are trimmed from the pattern, resulting in a reasonably validating pattern. This pattern does not restrict to the set of defined UCS characters, instead to the set that is permitted by UTF-8 encoding.

UCS plane UTF-8 byte reason
0 1 2 3
[00-7F]
00
7F
00
7F
[C0-DF][00-FF]
[C0-C1]* exclude 2-byte overlongs
* [00-7F] 10xx·xxxx trailing byte marker must be set
* [C0-FF]
80
7FF
C2
DF
80
BF
[E0-EF][00-FF][00-FF]
E0 [80-9F]* exclude 3-byte overlongs
800
FFF
E0 A0
BF
80
BF
1000
CFFF
E1
EC
80
BF
80
BF
D000
D7FF
ED 80
9F
80
BF
ED [A0-BF]* surrogates d800-d8ff would encode as ED A0 80-ED BF BF
E000
FFFF
EE
EF
80
BF
80
BF
1-16[F0-F7][00-FF][00-FF][00-FF]
F0 [80-8F]* * exclude 4-byte overlongs
10000
3FFFF
1-3F0 90
BF
80
BF
80
BF
40000
FFFFF
4-15F1
F3
80
BF
80
BF
80
BF
100000
10FFFF
16F4 80
8F
80
BF
80
BF
F4 [90-BF]* * no unicode above 10 FF FF
[F5-FF]* * *

Validating lex Template

This lex pattern that validates UTF-8 serves as a template for transcoding ranges (explained below).

1    ['\u0000'-'\u007F']
2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
3 | ( '\u00E0'           ['\u00A0'-'\u00BF'] ['\u0080'-'\u00BF'])
4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
5 | ( '\u00ED'           ['\u0080'-'\u009F'] ['\u0080'-'\u00BF'])
6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
7 | ( '\u00F0'           ['\u0090'-'\u00BF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
9 | ( '\u00F4'           ['\u0080'-'\u008F'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])

Lines 4 and 6 can be expressed in a more compact syntax when the character range does not start or end in either character class:

4b | (['\u00E1'-'\u00EC','\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])

This is also implemented as a perl regexp in W3C Internationalization FAQ: Multilingual Forms.

lex-U Algorithm

I haven't worked out a good way to describe the algorithm. Basically, any valid characters in the regular character expression will correspond to a point in the validating pattern. The trick is to expand the rules where your boundry codepoing is not at the edge of a rule. The most expressive form of this seems to be the compileUTF8 perl script.

Single Character

Transcoding a single character involves trivially substituting the UTF-8 byte sequence for the original character. The substitution may require grouping. For instance, a wide character that is part of a disjunction

'n' | '' | 'N'

will transcode into a multi-byte sequence taking the same place in the disjunction.

'n' | ('\u00E3', '\u0083', '\u008A') | 'N'

Character Ranges

A character range corresponds to a sublanguage of that defined by the validating lex template. Find the validating pattern productions that correspond to the start at the start and end of the range. Each production between them may be substituted directly, but care must be taken with the bottom and top of the range to match exactly the valid bytes only.

For instance, to find the range E3 85 B0 ('ㅰ') - F3 B0 A1 99 ('󰡙') following ranges must be included:

byte reason
0 1 2 3
E385[B0-BF]bottom of range - complete byte 2
[86-BF][80-BF]bottom of range - complete byte 1
[E4-EC][80-BF][80-BF]bottom of range - complete byte 0
ED [80-9F][80-BF]included production 5
[EE-EF][80-BF][80-BF]included production 6
F0 [90-BF][80-BF][80-BF] included production 7
[F1-F2][80-BF][80-BF][80-BF] top of range - complete byte 0
F3[80-AF][80-BF][80-BF] top of range - complete byte 1
B0[80-A0][80-BF] top of range - complete byte 2
A1 [80-99] top of range - complete byte 3
    ( '\u00E3'            '\u0085'           ['\u00B0'-'\u00BF'])
  | ( '\u00E3'           ['\u0086'-'\u00BF'] ['\u0080'-'\u00BF'])
  | (['\u00E4'-'\u00EC'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
5 | ( '\u00ED'           ['\u0080'-'\u009F'] ['\u0080'-'\u00BF'])
6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
7 | ( '\u00F0'           ['\u0090'-'\u00BF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
  | (['\u00F1'-'\u00F2'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
  | ( '\u00F3'           ['\u0080'-'\u0084'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
  | ( '\u00F3'            '\u0085'           ['\u0080'-'\u00A0'] ['\u0080'-'\u00BF'])
  | ( '\u00F3'            '\u0085'            '\u00A1'           ['\u0080'-'\u0099'])

or when grouped:

    ( '\u00E3'        ( ( '\u0085'           ['\u00B0'-'\u00BF'])
  |                    |(['\u0086'-'\u00BF'] ['\u0080'-'\u00BF'])))
  | (['\u00E4'-'\u00EC'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
5 | ( '\u00ED'           ['\u0080'-'\u009F'] ['\u0080'-'\u00BF'])
6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
7 | ( '\u00F0'           ['\u0090'-'\u00BF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
  | (['\u00F1'-'\u00F2'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
  | ( '\u00F3'        ( (['\u0080'-'\u0084'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
  |                    |( '\u0085'        ( (['\u0080'-'\u00A0'] ['\u0080'-'\u00BF'])
  |                                        |( '\u00A1'           ['\u0080'-'\u0099'])))))
['ナ'-    'ノ' ]

Excluded Character Ranges

~['ナ'-    'ノ' ]

...

Regular Expression Transcoding Examples

This section demonstrates the process for transcoding example regular character expressions into regular byte expressions.

Simple 3-byte Sequence, Range

A simple first example explores a small range in some 3-byte UTF-8 characters where only the last byte varies.

chars: [    'ロ',    'ナ'-    'ノ' ]
UCS:   ['\u30ED','\u30CA'-'\u30CE' ]
bytes: [E3 83 AD, E3 83 8A-E3 83 8E]

group =>
('\xE3', '\x83', ['\xAD', '\x8A'-'\x8E'])

Semi-distributed 3-byte Sequence, Range

Here, the last two bytes vary.

chars: [    'ナ'-    'ㄅ' ]
UCS:   ['\u30CA'-'\u3105' ]
bytes: [E3 83 8A-E3 84 85]

group =>
'\xE3', ('\x83', ['\x8A'-'\xBF']) | ('\x84", ['\x80'-'\x85'])

Distributed 3-byte Range

This example varies more than just the last byte.

chars: [    '丁'-    '龥' ]
UCS:   ['\u4E01'-'\u9fA5 ]
bytes: [E4 B8 81-E9 BE A5]

atomize =>
  ('\xE4', (('\xB8', ['\x81'-'\xFF'])
            | (['\xB9'-'\xFF'],['\x00'-'\xFF']))) 
| (['\xE5'-'\xE8'], ['\x00'-'\xFF'], ['\x00'-'\xFF'])
| ('\xE9', (['\x00'-'\xBD'], ['\x00'-'\xFF'])
            | ('\xBE', ['\x00'-'\xA5']))

validate =>

  ('\xE4', (('\xB8', ['\x81'-'\xBF'])
            | (['\xB9'-'\xBF'],['\x80'-'\xBF']))) 
| (['\xE5'-'\xE8'], ['\x80'-'\xBF'], ['\x80'-'\xBF'])
| ('\xE9', (['\x80'-'\xBD'], ['\x80'-'\xBF'])
            | ('\xBE', ['\x80'-'\xA5']))
group =>
...

Varied Byte Range

This example varies the number of bytes encoding the characters. It defines a probably impractical range from LATIN SMALL LETTER Q to an arbitrary place in supplementary private use area B called here PRIVATE USE MY Q:

chars: [     'q''-   '|q|'    ]
UCS:   ['\u0071'-'\u9fA5    ]
bytes: [      71-F4 8D 90 AA]

atomize =>
    ['\x71'-'\x7F']
 | (['\xC0'-'\xDF'], ['\x00'-'\xFF'])
 | (['\xE0'-'\xEF'], ['\x00'-'\xFF'], ['\x00'-'\xFF'])
 | (['\xF0'-'\xF7'], ['\x00'-'\xFF'], ['\x00'-'\xFF'], ['\x00'-'\xFF'])

...

Implementations

This technique is implemented by an impressive list of one:

Conclusions

piece of cake

Acknowledgments

Thanks to Martin Dürst who taught me about UTF-8 validation and worked through some of the problems with me. He also suggested that lex worked well with byte tables and that changing that would be more difficult than describing the encoded bytes.

Appendix A - UTF-8 Encoding Examples

KATAKANA LETTER RO

                         'ロ'
                     '\u30ED'
          0011 0000 1110 1101
encode as utf-8 =>	      
1110      10        10       
     0011   00 0011   10 1101
1110 0011 1000 0011 1010 1101
 ('\xe3'    '\x83'    '\xad')

KATAKANA LETTER NA

                         'メ'
                     '\u30CA'
          0011 0000 1100 1010
				
1110      10        10       
     0011   00 0011   00 1010
1110 0011 1000 0011 1000 1010
 ('\xe3'    '\x83'    '\x8a')

KATAKANA LETTER NO

                        'ノ' 
                    '\u30CE' 
          0011 0000 1100 1110
				   
1110      10        10       
     0011   00 0011   00 1110
1110 0011 1000 0011 1000 1110
 ('\xe3'    '\x83'    '\x8e')

HANGUL LETTER MIEUM-PANSIOS

                         'ㅰ'
                     '\u3170'
          0011 0000 1110 1101
encode as utf-8 =>	      
1110      10        10       
     0011   00 0011   10 1101
1110 0011 1000 0011 1010 1101
 ('\xe3'    '\x83'    '\xad')

CJK UNIFIED IDIOGRAPH 4E01 (CHYOU)

                         '丁'
                     '\u4E01'
          0100 1110 0000 0001
					     
1110      10        10       
     0100   11 1000   00 0001
1110 0100 1011 1000 1000 0001
 ('\xe4'    '\xb8'    '\x81')

CJK UNIFIED IDIOGRAPH 9FA5 (YAKU)

                        '龥' 
                    '\u9FA5' 
          1001 1111 1010 0101
		     		   
1110      10        10       
     1001   11 1110   10 0101
1110 1001 1011 1110 1010 0101
 ('\xe9'    '\xbe'    '\xa5')

LATIN SMALL LETTER Q

      'q'
 '\u0071'
0111 0001
	  
0        
 111 0001
0111 0001
 ('\x71')

PRIVATE USE MY Q

                                 '|q|' 
                            '\u10d42a' 
          0001 0000 1101 0100 0010 1010
   	 			   	  
1111 0    10        10        10       
      100   00 1101   01 0000   10 1010
1111 0100 1000 1101 1001 0000 1010 1010
 ('\xf4'    '\x8d'    '\x90'    '\xaa')

References

[UTF8] RFC3629 - UTF-8, a transformation format of ISO 10646, F. Yergeau (See http://ietfreport.isoc.org/idref/rfc3629/ .)
[UTF8] Properties and Promises of UTF-8, M. Dürst (See http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf .)

$Id: 23-lex-U.html,v 1.19 2005/06/27 05:09:29 eric Exp $