Abstract

While encodings have been defined to some extent, implementations have not always implemented them in the same way, have not always used the same labels, and often differ in dealing with undefined and former proprietary areas of encodings. This specification attempts to fill those gaps so that new implementations do not have to reverse engineer encoding implementations of the market leaders and existing implementations can converge.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document was published by the Internationalization Working Group as a First Public Working Draft. This document is intended to become a W3C Recommendation.

Publication as a First Public Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This is a snapshot of the editor's document, as of the date shown on the title page, published after discussion with the WHATWG editors. No changes have been made in the body of the W3C draft other than to align with W3C house styles. The primary reason that W3C is publishing this document is so that HTML5 and other specifications may normatively refer to a stable W3C Recommendation.

Send feedback to www-international@w3.org (archives) or file a bug (open bugs) IRC: #whatwg on Freenode. All comments are welcome. The editors will manage comments in their draft. Once stable, changes will appear in future W3C Working Drafts until the document becomes a Recommendation.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents

1 Preface

While encodings have been defined to some extent, implementations have not always implemented them in the same way, have not always used the same labels, and often differ in dealing with undefined and former proprietary areas of encodings. This specification attempts to fill those gaps so that new implementations do not have to reverse engineer encoding implementations of the market leaders and existing implementations can converge.

This specification is primarily intended for dealing with legacy content, it requires new content and formats to use the utf-8 encoding exclusively.

2 Conformance

All diagrams, examples, and notes in this specification are non-normative, as are all sections explicitly marked non-normative. Everything else in this specification is normative.

The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative parts of this document are to be interpreted as described in RFC2119. For readability, these words do not appear in all uppercase letters in this specification. [RFC2119]

Conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant.)

User agents may impose implementation-specific limits on otherwise unconstrained inputs, e.g. to prevent denial of service attacks, to guard against running out of memory, or to work around platform-specific limitations.

3 Terminology

Hexadecimal numbers are prefixed with "0x".

In equations, all numbers are integers, addition is represented by "+", subtraction by "−", multiplication by "×", division by "/", calculating the remainder of a division (also known as modulo) by "%", exponentiation by "bn", arithmetic left shifts by "<<", arithmetic right shifts by ">>", bitwise AND by "&", and bitwise OR by "|".

A byte is a sequence of eight bits, represented as a double-digit hexadecimal number in the range 0x00 to 0xFF.

A code point is a Unicode code point and is represented as a four-to-six digit hexadecimal number, typically prefixed with "U+". In equations and indexes code points are prefixed with "0x". [UNICODE]

The ASCII whitespace are code points U+0009, U+000A, U+000C, U+000D, and U+0020.

The ASCII digits are code points in the range U+0030 to U+0039.

A string is a sequence of code points.

Comparing two strings in an ASCII case-insensitive manner means comparing them exactly, code point for code point, except that the characters in the range U+0041 to U+005A (i.e. LATIN CAPITAL LETTER A to LATIN CAPITAL LETTER Z) and the corresponding characters in the range U+0061 to U+007A (i.e. LATIN SMALL LETTER A to LATIN SMALL LETTER Z) are considered to also match.

4 Encodings

An encoding defines a mapping from a code point sequence to a byte sequence (and vice versa). Each encoding has a name, and one or more labels.

Each encoding also has a decoder and encoder algorithm.

A decoder algorithm takes a byte stream and emits a code point stream. The byte pointer is initially zero, pointing to the first byte in the stream. It cannot be negative. It can be increased and decreased to point to other bytes in the stream. The EOF byte is a conceptual byte representing the end of the stream. The byte pointer cannot point beyond the EOF byte. The EOF code point is a conceptual code point that is emitted once the byte stream is handled in its entirety. A decoder must be invoked again when the word continue is used or when one or more code points are emitted of which none is the EOF code point.

An encoder algorithm takes a code point stream and emits a byte stream. It fails when a code point is passed for which it does not have a corresponding byte (sequence). Analogously to a decoder, it has a code point pointer. An encoder must be invoked again when the word continue is used or when one or more bytes are emitted of which none is the EOF byte.

A decoder and encoder both have an associated error handling mode as well as an error algorithm. For a decoder the error handling mode is either replacement (default) or fatal. For an encoder the error handling mode is one of fatal (default), URL, or <form>.

A decoder decoder's error algorithm is as follows:

  1. If decoder's error handling mode is replacement, emit code point U+FFFD.

  2. Otherwise, terminate decoder with failure.

An XML processor would set its decoder's error handling mode to fatal. [XML]

An encoder encoder's error algorithm takes a code point c and is as follows:

  1. If encoder's error handling mode is fatal, terminate encoder with failure.

  2. Otherwise, if encoder's error handling mode is URL, emit byte 0x3F.

  3. Otherwise, emit the result of running utf-8 encode on U+0026, U+0023, followed by the shortest sequence of ASCII digits representing c in base ten, followed by U+003B.

The encoder's error handling modes URL and <form> exist because URLs and HTML forms require non-terminating encoders and have legacy handling whenever an error is reached. [URL] [HTML]


The table below lists all encodings and their labels user agents must support. User agents must not support any other encodings or labels.

Authors must use the utf-8 encoding and must use the "utf-8" label to identify it.

New protocols and formats must use the utf-8 encoding exclusively. If these protocols and formats need to expose the encoding's label, they must expose it as "utf-8".

To get an encoding from a string label, run these steps:

  1. Remove any leading and trailing ASCII whitespace from label.

  2. If label is an ASCII case-insensitive match for any of the labels listed in the table below, return the corresponding encoding, and failure otherwise.

In violation of section 1.4 of Unicode Technical Standard #22 this is a much simpler and more restrictive matching algorithm, as that is found to be necessary to be compatible with deployed content.

Name Labels
The Encoding
utf-8 "unicode-1-1-utf-8"
"utf-8"
"utf8"
Legacy single-byte encodings
ibm866 "866"
"cp866"
"csibm866"
"ibm866"
iso-8859-2 "csisolatin2"
"iso-8859-2"
"iso-ir-101"
"iso8859-2"
"iso88592"
"iso_8859-2"
"iso_8859-2:1987"
"l2"
"latin2"
iso-8859-3 "csisolatin3"
"iso-8859-3"
"iso-ir-109"
"iso8859-3"
"iso88593"
"iso_8859-3"
"iso_8859-3:1988"
"l3"
"latin3"
iso-8859-4 "csisolatin4"
"iso-8859-4"
"iso-ir-110"
"iso8859-4"
"iso88594"
"iso_8859-4"
"iso_8859-4:1988"
"l4"
"latin4"
iso-8859-5 "csisolatincyrillic"
"cyrillic"
"iso-8859-5"
"iso-ir-144"
"iso8859-5"
"iso88595"
"iso_8859-5"
"iso_8859-5:1988"
iso-8859-6 "arabic"
"asmo-708"
"csiso88596e"
"csiso88596i"
"csisolatinarabic"
"ecma-114"
"iso-8859-6"
"iso-8859-6-e"
"iso-8859-6-i"
"iso-ir-127"
"iso8859-6"
"iso88596"
"iso_8859-6"
"iso_8859-6:1987"
iso-8859-7 "csisolatingreek"
"ecma-118"
"elot_928"
"greek"
"greek8"
"iso-8859-7"
"iso-ir-126"
"iso8859-7"
"iso88597"
"iso_8859-7"
"iso_8859-7:1987"
"sun_eu_greek"
iso-8859-8 "csiso88598e"
"csisolatinhebrew"
"hebrew"
"iso-8859-8"
"iso-8859-8-e"
"iso-ir-138"
"iso8859-8"
"iso88598"
"iso_8859-8"
"iso_8859-8:1988"
"visual"
iso-8859-8-i "csiso88598i"
"iso-8859-8-i"
"logical"
iso-8859-10 "csisolatin6"
"iso-8859-10"
"iso-ir-157"
"iso8859-10"
"iso885910"
"l6"
"latin6"
iso-8859-13 "iso-8859-13"
"iso8859-13"
"iso885913"
iso-8859-14 "iso-8859-14"
"iso8859-14"
"iso885914"
iso-8859-15 "csisolatin9"
"iso-8859-15"
"iso8859-15"
"iso885915"
"iso_8859-15"
"l9"
iso-8859-16 "iso-8859-16"
koi8-r "cskoi8r"
"koi"
"koi8"
"koi8-r"
"koi8_r"
koi8-u "koi8-u"
macintosh "csmacintosh"
"mac"
"macintosh"
"x-mac-roman"
windows-874 "dos-874"
"iso-8859-11"
"iso8859-11"
"iso885911"
"tis-620"
"windows-874"
windows-1250 "cp1250"
"windows-1250"
"x-cp1250"
windows-1251 "cp1251"
"windows-1251"
"x-cp1251"
windows-1252 "ansi_x3.4-1968"
"ascii"
"cp1252"
"cp819"
"csisolatin1"
"ibm819"
"iso-8859-1"
"iso-ir-100"
"iso8859-1"
"iso88591"
"iso_8859-1"
"iso_8859-1:1987"
"l1"
"latin1"
"us-ascii"
"windows-1252"
"x-cp1252"
windows-1253 "cp1253"
"windows-1253"
"x-cp1253"
windows-1254 "cp1254"
"csisolatin5"
"iso-8859-9"
"iso-ir-148"
"iso8859-9"
"iso88599"
"iso_8859-9"
"iso_8859-9:1989"
"l5"
"latin5"
"windows-1254"
"x-cp1254"
windows-1255 "cp1255"
"windows-1255"
"x-cp1255"
windows-1256 "cp1256"
"windows-1256"
"x-cp1256"
windows-1257 "cp1257"
"windows-1257"
"x-cp1257"
windows-1258 "cp1258"
"windows-1258"
"x-cp1258"
x-mac-cyrillic "x-mac-cyrillic"
"x-mac-ukrainian"
Legacy multi-byte Chinese (simplified) encodings
gb18030 "chinese"
"csgb2312"
"csiso58gb231280"
"gb18030"
"gb2312"
"gb_2312"
"gb_2312-80"
"gbk"
"iso-ir-58"
"x-gbk"
hz-gb-2312 "hz-gb-2312"
Legacy multi-byte Chinese (traditional) encodings
big5 "big5"
"big5-hkscs"
"cn-big5"
"csbig5"
"x-x-big5"
Legacy multi-byte Japanese encodings
euc-jp "cseucpkdfmtjapanese"
"euc-jp"
"x-euc-jp"
iso-2022-jp "csiso2022jp"
"iso-2022-jp"
shift_jis "csshiftjis"
"ms_kanji"
"shift-jis"
"shift_jis"
"sjis"
"windows-31j"
"x-sjis"
Legacy multi-byte Korean encodings
euc-kr "cseuckr"
"csksc56011987"
"euc-kr"
"iso-ir-149"
"korean"
"ks_c_5601-1987"
"ks_c_5601-1989"
"ksc5601"
"ksc_5601"
"windows-949"
Legacy miscellaneous encodings
replacement "csiso2022kr"
"iso-2022-cn"
"iso-2022-cn-ext"
"iso-2022-kr"
utf-16be "utf-16be"
utf-16le "utf-16"
"utf-16le"
x-user-defined "x-user-defined"

All encodings and their labels are also available as non-normative encodings.json resource.

5 Indexes

Most legacy encodings make use of an index. An index is an ordered list of pointers and corresponding code points. Within an index pointers are unique and code points can be duplicated.

To find the pointers and their corresponding code points in an index, let lines be the result of splitting the resource's contents on U+000A. Then remove each item in lines that is the empty string or starts with U+0023. Then the pointers and their corresponding code points are found by splitting each item in lines on U+0009. The first subitem is the pointer (as a decimal number) and the second is the corresponding code point (as a hexadecimal number). Other subitems are not relevant.

The index code point for pointer in index is the code point corresponding to pointer in index, or null if pointer is not in index.

The index pointer for code point in index is the first pointer corresponding to code point in index, or null if code point is not in index.

These are the indexes defined by this specification, excluding index single-byte:

IndexNotes
index big5 index-big5.txt This matches the Big5 standard in combination with the Hong Kong Supplementary Character Set and other common extensions.
index euc-kr index-euc-kr.txt This matches the KS X 1001 standard and the Unified Hangul Code, more commonly known together as Windows Codepage 949.
index gb18030 index-gb18030.txt This matches the GB18030 standard for code points encoded as two bytes.
index gb18030 ranges index-gb18030-ranges.txt This index works different from all others. Listing all code points would result in over a million items whereas they can be represented neatly in 207 ranges combined with trivial limit checks. It therefore only superficially matches the GB18030 standard for code points encoded as four bytes. See also index gb18030 ranges code point and index gb18030 ranges pointer below.
index jis0208 index-jis0208.txt This is the JIS X 0208 standard including formerly proprietary extensions from IBM and NEC.
index jis0212 index-jis0212.txt This is the JIS X 0212 standard.

The index gb18030 ranges code point for pointer is the return value of these steps:

  1. If pointer is greater than 39419 and less than 189000, or pointer is greater than 1237575, return null.

  2. Let offset be the last pointer in index gb18030 ranges that is equal to or less than pointer and let code point offset be its corresponding code point.

  3. Return a code point whose value is code point offset + pointeroffset.

The index gb18030 ranges pointer for code point is the return value of these steps:

  1. Let offset be the last code point in index gb18030 ranges that is equal to or less than code point and let pointer offset be its corresponding pointer.

  2. Return a pointer whose value is pointer offset + code pointoffset.

All indexes are also available as non-normative indexes.json resource. (index gb18030 ranges has a slightly different format here, to be able to represent ranges.)

6 Decode and encode

The algorithms decode, utf-8 decode, and encode are intended for usage by other specifications. utf-8 decode is to be used by new formats. The get an encoding algorithm can be used first to turn a label into an encoding.

To decode a byte stream stream using fallback encoding encoding, run these steps:

  1. Let offset be 0.

  2. For each of the rows in the following table, starting with the first one and going down, if the first bytes of stream match all the bytes given in the first column (ergo stream contains at least two or three bytes), then set encoding to the encoding given in the cell in the second column of that row, and set offset to the offset given in the cell in the third column of that row.

    Byte order markEncodingOffset
    0xEF 0xBB 0xBFutf-83
    0xFE 0xFFutf-16be2
    0xFF 0xFEutf-16le2

    For compatibility with deployed content, the byte order mark (also known as BOM) is considered more authoritative than anything else.

  3. Return the result of running encoding's decoder with byte pointer set to offset, on stream.

To utf-8 decode a byte stream stream, run these steps:

  1. Let offset be 0.

  2. If stream contains at least three bytes and its first three bytes match 0xEF 0xBB 0xBF, set offset to 3.

  3. Return the result of running the utf-8 decoder with byte pointer set to offset, on stream.


To encode a code point stream stream using encoding encoding, return the result of running encoding's encoder on stream.

To utf-8 encode a code point stream stream, return the result of encoding stream using encoding utf-8.

If the input to this algorithm stems from a DOMString, the convert a DOMString to a sequence of Unicode characters from Web IDL is to be used first.

7 API

This section uses terminology from the DOM, Typed Arrays, and Web IDL. Non-browser implementations are not required to implement this API. [DOM] [TYPEDARRAY] [WEBIDL]

The following example uses the TextEncoder object to encode an array of strings into an ArrayBuffer. The result is a Uint8Array containing the number of strings (as a Uint32Array), followed by the length of the first string (as a Uint32Array), the utf-8 encoded string data, the length of the second string (as a Uint32Array), the string data, and so on.

function encodeArrayOfStrings(strings, encoding) {
  var encoder, encoded, len, i, bytes, view, offset;

  encoder = new TextEncoder(encoding);
  encoded = [];

  len = Uint32Array.BYTES_PER_ELEMENT;
  for (i = 0; i < strings.length; i += 1) {
    len += Uint32Array.BYTES_PER_ELEMENT;
    encoded[i] = new TextEncoder(encoding).encode(strings[i]);
    len += encoded[i].byteLength;
  }

  bytes = new Uint8Array(len);
  view = new DataView(bytes.buffer);
  offset = 0;

  view.setUint32(offset, strings.length);
  offset += Uint32Array.BYTES_PER_ELEMENT;
  for (i = 0; i < encoded.length; i += 1) {
    len = encoded[i].byteLength;
    view.setUint32(offset, len);
    offset += Uint32Array.BYTES_PER_ELEMENT;
    bytes.set(encoded[i], offset);
    offset += len;
  }
  return bytes.buffer;
}

The following example decodes an ArrayBuffer containing data encoded in the format produced by the previous example back into an array of strings.

function decodeArrayOfStrings(buffer, encoding) {
  var decoder, view, offset, num_strings, strings, i, len;

  decoder = new TextDecoder(encoding);
  view = new DataView(buffer);
  offset = 0;
  strings = [];

  num_strings = view.getUint32(offset);
  offset += Uint32Array.BYTES_PER_ELEMENT;
  for (i = 0; i < num_strings; i += 1) {
    len = view.getUint32(offset);
    offset += Uint32Array.BYTES_PER_ELEMENT;
    strings[i] = decoder.decode(
      new DataView(view.buffer, offset, len));
    offset += len;
  }
  return strings;
}

7.1 Interface TextDecoder

dictionary TextDecoderOptions {
  boolean fatal = false;
};

dictionary TextDecodeOptions {
  boolean stream = false;
};

[Constructor(optional DOMString label = "utf-8", optional TextDecoderOptions options)]
interface TextDecoder {
  readonly attribute DOMString encoding;
  DOMString decode();
  DOMString decode(ArrayBufferView input, optional TextDecodeOptions options);
};

A TextDecoder object has an associated encoding, encoding state, stream, BOM seen flag (initially unset), fatal flag (initially unset) and streaming flag (initially unset).

decoder = new TextDecoder([label = "utf-8" [, options]])

Returns a new TextDecoder object.

If label is either not a label or is a label for replacement, throws a TypeError.

decoder . encoding

Returns encoding's name.

decoder . decode([input [, options]])

Returns the result of running encoding's decoder. If options's stream is set to true the method can be invoked multiple times to process a fragmented stream.

If the fatal flag is set and encoding's decoder terminates with failure, throws an "EncodingError".

The TextDecoder(label, options) constructor must run these steps:

  1. Let encoding be the result of getting an encoding from label.

  2. If encoding is failure or replacement, throw a TypeError.

  3. Let dec be a new TextDecoder object.

  4. Set dec's encoding to encoding.

  5. Set dec's encoding state to the default values of dec's encoding's decoder's associated variables.

  6. If options's fatal member is true, set dec's fatal flag.

  7. Return dec.

The encoding attribute must return encoding's name.

The decode(input, options) method must run these steps:

  1. If the streaming flag is unset, set the encoding state to the default values of the encoding's decoder's associated variables, unset the BOM seen flag, and empty the stream.

  2. If options's stream is true, set the streaming flag, and unset the streaming flag otherwise.

  3. If input is given, then given input's buffer, byteOffset, and byteLength, append byteLength bytes from buffer, starting at byteOffset, to the stream.

  4. If the BOM seen flag is unset, and the stream either holds at least two bytes, or at least three bytes if the encoding is utf-8, then set the BOM seen flag, and for each of the rows in the following table, starting with the first one and going down, if the first bytes of the stream match all the bytes given in the first column, and the encoding matches the encoding given in the cell in the second column of that row, then remove those bytes at the start of the stream.

    Byte order markEncoding
    0xEF 0xBB 0xBFutf-8
    0xFE 0xFFutf-16be
    0xFF 0xFEutf-16le

    This algorithm is intentionally different from the decode algorithm used by the rest of the platform to give API users more control.

  5. If the streaming flag is unset, append the EOF byte to the stream.

  6. Return the output of running encoding's decoder, with its error handling mode set to fatal if the fatal flag is set, on the stream. If encoding's decoder terminates with failure, throw an "EncodingError".

    In addition to the reason given above with respect to the byte order mark, this also does not use the encode algorithm as it assumes a continuous stream rather than one delivered in fragments.

7.2 Interface TextEncoder

dictionary TextEncodeOptions {
  boolean stream = false;
};

[Constructor(optional DOMString utfLabel = "utf-8")]
interface TextEncoder {
  readonly attribute DOMString encoding;
  Uint8Array encode(optional [EnsureUTF16] DOMString input = "", optional TextEncodeOptions options);
};

A TextEncoder object has an associated encoding, encoding state, stream, and streaming flag (initially unset).

encoder = new TextEncoder([utfLabel = "utf-8"])

Returns a new TextEncoder object.

If utfLabel is not a label for utf-8, utf-16be, or utf-16le, throws a TypeError.

encoder . encoding

Returns encoding's name.

encoder . encode([input [, options]])

Returns the result of running encoding's encoder. If options's stream is set to true, the method can be invoked multiple times to process a fragmented stream.

The TextEncoder(utfLabel) constructor must run these steps:

  1. Let encoding be the result of getting an encoding from utfLabel.

  2. If encoding is failure, or is none of utf-8, utf-16be, and utf-16le, throw a TypeError.

  3. Let enc be a new TextEncoder object.

  4. Set enc's encoding to encoding.

  5. Set enc's encoding state to the default values of enc's encoding's encoder's associated variables.

  6. Return enc.

The encoding attribute must return encoding's name.

The encode(input, options) method must run these steps:

  1. If the streaming flag is unset, then set the encoding state to the default values of the encoding's encoder's associated variables, and empty the stream.

  2. If options's stream is true, set the streaming flag, and unset the streaming flag otherwise.

  3. Append input to the stream.

  4. If the streaming flag is unset, append the EOF code point to the stream.

  5. Let bytes be the output of running encoding's encoder on the stream.

    This does not use the encode algorithm as it assumes a continuous stream rather than one delivered in fragments.

  6. Return a Uint8Array object wrapping an ArrayBuffer containing bytes.

8 The encoding

8.1 utf-8

The utf-8 code point, utf-8 bytes seen, and utf-8 bytes needed concepts are all initially 0. The utf-8 lower boundary is initially 0x80 and the utf-8 upper boundary is initially 0xBF.

The utf-8 decoder (decoder for utf-8) is:

  1. Let byte be the value at byte pointer.

  2. If byte is the EOF byte and utf-8 bytes needed is not 0, set utf-8 bytes needed to 0 and run error.

  3. If byte is the EOF byte, emit the EOF code point.

  4. Increase the byte pointer by one.

  5. If utf-8 bytes needed is 0, based on byte:

    0x00 to 0x7F

    Emit a code point whose value is byte.

    0xC2 to 0xDF

    Set utf-8 bytes needed to 1 and utf-8 code point to byte − 0xC0.

    0xE0 to 0xEF
    1. If byte is 0xE0, set utf-8 lower boundary to 0xA0.

    2. If byte is 0xED, set utf-8 upper boundary to 0x9F.

    3. Set utf-8 bytes needed to 2 and utf-8 code point to byte − 0xE0.

    0xF0 to 0xF4
    1. If byte is 0xF0, set utf-8 lower boundary to 0x90.

    2. If byte is 0xF4, set utf-8 upper boundary to 0x8F.

    3. Set utf-8 bytes needed to 3 and utf-8 code point to byte − 0xF0.

    Otherwise

    Run error.

    Then (byte is in the range 0xC2 to 0xF4) set utf-8 code point to utf-8 code point × 64utf-8 bytes needed and continue.

  6. If byte is not in the range utf-8 lower boundary to utf-8 upper boundary, run these substeps:

    1. Set utf-8 code point, utf-8 bytes needed, and utf-8 bytes seen to 0, set utf-8 lower boundary to 0x80, and set utf-8 upper boundary to 0xBF.

    2. Decrease the byte pointer by one.

    3. Run error.

  7. Set utf-8 lower boundary to 0x80 and utf-8 upper boundary to 0xBF.

  8. Increase utf-8 bytes seen by one and set utf-8 code point to utf-8 code point + (byte − 0x80) × 64utf-8 bytes neededutf-8 bytes seen.

  9. If utf-8 bytes seen is not equal to utf-8 bytes needed, continue.

  10. Let code point be utf-8 code point.

  11. Set utf-8 code point, utf-8 bytes needed, and utf-8 bytes seen to 0.

  12. Emit a code point whose value is code point.

The constraints in the utf-8 decoder above match “Best Practices for Using U+FFFD” from the Unicode standard. No other behavior is permitted per the Encoding Standard (other algorithms that achieve the same result are obviously fine, even encouraged).

The utf-8 encoder (encoder for utf-8) is:

  1. Let code point be the value at code point pointer.

  2. If code point is in the range 0xD800 to 0xDFFF, run error for code point.

  3. If code point is the EOF code point, emit the EOF byte.

  4. Increase code point pointer by one.

  5. If code point is in the range U+0000 to U+007F, emit a byte whose value is code point.

  6. Set count and offset based on the range code point is in:

    U+0080 to U+07FF
    1 and 0xC0
    U+0800 to U+FFFF
    2 and 0xE0
    U+10000 to U+10FFFF
    3 and 0xF0
  7. Let bytes be a list of bytes whose first byte is code point / 64count + offset.

  8. Run these substeps while count is greater than 0:

    1. Set temp to code point / 64count − 1.

    2. Append to bytes 0x80 + (temp % 64).

    3. Decrease count by one.

  9. Emit bytes bytes, in list order.

9 Legacy single-byte encodings

An encoding where each byte is either a single code point or nothing, is a single-byte encoding. Single-byte encodings share the decoder and encoder. Index single-byte, as referenced by the single-byte decoder and single-byte encoder, is defined by the following table, and depends on the single-byte encoding in use. All but two single-byte encodings have a unique index.

NameIndex
ibm866index-ibm866.txt
iso-8859-2index-iso-8859-2.txt
iso-8859-3index-iso-8859-3.txt
iso-8859-4index-iso-8859-4.txt
iso-8859-5index-iso-8859-5.txt
iso-8859-6index-iso-8859-6.txt
iso-8859-7index-iso-8859-7.txt
iso-8859-8index-iso-8859-8.txt
iso-8859-8-iindex-iso-8859-8.txt
iso-8859-10index-iso-8859-10.txt
iso-8859-13index-iso-8859-13.txt
iso-8859-14index-iso-8859-14.txt
iso-8859-15index-iso-8859-15.txt
iso-8859-16index-iso-8859-16.txt
koi8-rindex-koi8-r.txt
koi8-uindex-koi8-u.txt
macintoshindex-macintosh.txt
windows-874index-windows-874.txt
windows-1250index-windows-1250.txt
windows-1251index-windows-1251.txt
windows-1252index-windows-1252.txt
windows-1253index-windows-1253.txt
windows-1254index-windows-1254.txt
windows-1255index-windows-1255.txt
windows-1256index-windows-1256.txt
windows-1257index-windows-1257.txt
windows-1258index-windows-1258.txt
x-mac-cyrillicindex-x-mac-cyrillic.txt

iso-8859-8 and iso-8859-8-i are distinct encoding names, because iso-8859-8 has influence on the layout direction. And although historically this might have been the case for iso-8859-6 and "iso-8859-6-i" as well, that is no longer true.


The single-byte decoder (decoder for single-byte encodings) is:

  1. Let byte be the value at byte pointer.

  2. If byte is the EOF byte, emit the EOF code point.

  3. Increase the byte pointer by one.

  4. If byte is in the range 0x00 to 0x7F, emit a code point whose value is byte.

  5. Let code point be the index code point for byte − 0x80 in index single-byte.

  6. If code point is null, run error.

  7. Emit a code point whose value is code point.

The single-byte encoder (encoder for single-byte encodings) is:

  1. Let code point be the value at code point pointer.

  2. If code point is the EOF code point, emit the EOF byte.

  3. Increase code point pointer by one.

  4. If code point is in the range U+0000 to U+007F, emit a byte whose value is code point.

  5. Let pointer be the index pointer for code point in index single-byte.

  6. If pointer is null, run error for code point.

  7. Emit a byte whose value is pointer + 0x80.

10 Legacy multi-byte Chinese (simplified) encodings

10.1 gb18030

The gb18030 first, gb18030 second, and gb18030 third, are all initially 0x00.

The gb18030 decoder (decoder for gb18030) is:

  1. Let byte be the value at byte pointer.

  2. If byte is the EOF byte and gb18030 first, gb18030 second, and gb18030 third are 0x00, emit the EOF code point.

  3. If byte is the EOF byte, and gb18030 first, gb18030 second, or gb18030 third is not 0x00, set gb18030 first, gb18030 second, and gb18030 third to 0x00, and run error.

  4. Increase the byte pointer by one.

  5. If gb18030 third is not 0x00, run these substeps:

    1. Let code point be null.

    2. If byte is in the range 0x30 to 0x39, set code point to the index gb18030 ranges code point for (((gb18030 first − 0x81) × 10 + gb18030 second − 0x30) × 126 + gb18030 third − 0x81) × 10 + byte − 0x30.

    3. Set gb18030 first, gb18030 second, and gb18030 third to 0x00.

    4. If code point is null, decrease the byte pointer by three and run error.

    5. Emit a code point whose value is code point.

  6. If gb18030 second is not 0x00, run these substeps:

    1. If byte is in the range 0x81 to 0xFE, set gb18030 third to byte and continue.

    2. Decrease the byte pointer by two, set gb18030 first and gb18030 second to 0x00, and run error.

  7. If gb18030 first is not 0x00, run these substeps:

    1. If byte is in the range 0x30 to 0x39, set gb18030 second to byte and continue.

    2. Let lead be gb18030 first, let pointer be null, and set gb18030 first to 0x00.

    3. Let offset be 0x40 if byte is less than 0x7F and 0x41 otherwise.

    4. If byte is in the range 0x40 to 0x7E or 0x80 to 0xFE, set pointer to (lead − 0x81) × 190 + (byteoffset).

    5. Let code point be null if pointer is null and the index code point for pointer in index gb18030 otherwise.

    6. If pointer is null, decrease the byte pointer by one.

    7. If code point is null, run error.

    8. Emit a code point whose value is code point.

  8. If byte is in the range 0x00 to 0x7F, emit a code point whose value is byte.

  9. If byte is 0x80, emit code point U+20AC.

  10. If byte is in the range 0x81 to 0xFE, set gb18030 first to byte and continue.

  11. Run error.

The gb18030 encoder (encoder for gb18030) is:

  1. Let code point be the value at code point pointer.

  2. If code point is the EOF code point, emit the EOF byte.

  3. Increase code point pointer by one.

  4. If code point is in the range U+0000 to U+007F, emit a byte whose value is code point.

  5. Let pointer be the index pointer for code point in index gb18030.

  6. If pointer is not null, run these substeps:

    1. Let lead be pointer / 190 + 0x81.

    2. Let trail be pointer % 190.

    3. Let offset be 0x40 if trail is less than 0x3F and 0x41 otherwise.

    4. Emit two bytes whose values are lead and trail + offset.

  7. Set pointer to the index gb18030 ranges pointer for code point.

  8. Let byte1 be pointer / 10 / 126 / 10.

  9. Set pointer to pointerbyte1 × 10 × 126 × 10.

  10. Let byte2 be pointer / 10 / 126.

  11. Set pointer to pointerbyte2 × 10 × 126.

  12. Let byte3 be pointer / 10.

  13. Let byte4 be pointerbyte3 × 10.

  14. Emit four bytes whose values are byte1 + 0x81, byte2 + 0x30, byte3 + 0x81, byte4 + 0x30.

10.2 hz-gb-2312

The hz-gb-2312 flag is initially unset. The hz-gb-2312 lead is initially 0x00.

The hz-gb-2312 decoder (decoder for hz-gb-2312) is:

  1. Let byte be the value at byte pointer.

  2. If byte is the EOF byte and hz-gb-2312 lead is 0x00, emit the EOF code point.

  3. If byte is the EOF byte and hz-gb-2312 lead is not 0x00, set hz-gb-2312 lead to 0x00 and run error.

  4. Increase the byte pointer by one.

  5. If hz-gb-2312 lead is 0x7E, set hz-gb-2312 lead to 0x00, and based on byte:

    0x7B

    Set the hz-gb-2312 flag and continue.

    0x7D

    Unset the hz-gb-2312 flag and continue.

    0x7E

    Emit code point U+007E.

    0x0A

    Continue.

    Otherwise

    Decrease the byte pointer by one and run error.

  6. If hz-gb-2312 lead is not 0x00, let lead be hz-gb-2312 lead, set hz-gb-2312 lead to 0x00, and then run these substeps:

    1. If byte is in the range 0x21 to 0x7E, let code point be the index code point for (lead − 1) × 190 + (byte + 0x3F) in index gb18030.

    2. If byte is 0x0A, unset the hz-gb-2312 flag.

    3. If code point is null, run error.

    4. Emit a code point whose value is code point.

  7. If byte is 0x7E, set hz-gb-2312 lead to 0x7E and continue.

  8. If the hz-gb-2312 flag is set:

    1. If byte is in the range 0x20 to 0x7F, set hz-gb-2312 lead to byte and continue.

    2. If byte is 0x0A, unset the hz-gb-2312 flag.

    3. Run error.

  9. If byte is in the range 0x00 to 0x7F, emit a code point whose value is byte.

  10. Run error.

The hz-gb-2312 encoder (encoder for hz-gb-2312) is:

  1. Let code point be the value at code point pointer.

  2. If code point is the EOF code point, emit the EOF byte.

  3. Increase code point pointer by one.

  4. If code point is in the range U+0000 to U+007F and the hz-gb-2312 flag is set, decrease code point pointer by one, unset the hz-gb-2312 flag, and emit two bytes 0x7E 0x7D.

  5. If code point is 0x007E, emit two bytes 0x7E 0x7E.

  6. If code point is in the range U+0000 to U+007F, emit a byte whose value is code point.

  7. Let pointer be the index pointer for code point in index gb18030.

  8. If pointer is null, run error for code point.

  9. If the hz-gb-2312 flag is unset, decrease code point pointer by one, set the hz-gb-2312 flag, and emit two bytes 0x7E 0x7B.

  10. Let lead be pointer / 190 + 1.

  11. Let trail be pointer % 190 − 0x3F.

  12. If either lead or trail is less than 0x21, run error for code point.

  13. Emit two bytes whose values are lead and trail.

11 Legacy multi-byte Chinese (traditional) encodings

11.1 big5

The big5 lead is initially 0x00.

The big5 decoder (decoder for big5) is:

  1. Let byte be the value at byte pointer.

  2. If byte is the EOF byte and big5 lead is 0x00, emit the EOF code point.

  3. If byte is the EOF byte and big5 lead is not 0x00, set big5 lead to 0x00 and run error.

  4. Increase the byte pointer by one.

  5. If big5 lead is not 0x00, let lead be big5 lead, let pointer be null, set big5 lead to 0x00, and then run these substeps:

    1. Let offset be 0x40 if byte is less than 0x7F and 0x62 otherwise.

    2. If byte is in the range 0x40 to 0x7E or 0xA1 to 0xFE, set pointer to (lead − 0x81) × 157 + (byteoffset).

    3. If there is a row in the table below whose first column is pointer, emit the two code points listed in its second column (the third column is irrelevant):

      PointerCode pointsNotes
      1133U+00CA U+0304Ê̄ (LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND MACRON)
      1135U+00CA U+030CÊ̌ (LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND CARON)
      1164U+00EA U+0304ê̄ (LATIN SMALL LETTER E WITH CIRCUMFLEX AND MACRON)
      1166U+00EA U+030Cê̌ (LATIN SMALL LETTER E WITH CIRCUMFLEX AND CARON)

      Since indexes are limited to single code points this table is used for these pointers.

    4. Let code point be null if pointer is null and the index code point for pointer in index big5 otherwise.

    5. If pointer is null and byte is in the range 0x00 to 0x7F, decrease byte pointer by one.

    6. If code point is null, run error.

    7. Emit a code point whose value is code point.

  6. If byte is in the range 0x00 to 0x7F, emit a code point whose value is byte.

  7. If byte is in the range 0x81 to 0xFE, set big5 lead to byte and continue.

  8. Run error.

The big5 encoder (encoder for big5) is:

  1. Let code point be the value at code point pointer.

  2. If code point is the EOF code point, emit the EOF byte.

  3. Increase code point pointer by one.

  4. If code point is in the range U+0000 to U+007F, emit a byte whose value is code point.

  5. Let pointer be the index pointer for code point in index big5.

  6. If pointer is null, run error for code point.

  7. Let lead be pointer / 157 + 0x81.

  8. If lead is less than 0xA1, run error for code point.

    Avoid emitting Hong Kong Supplementary Character Set extensions literally.

  9. Let trail be pointer % 157.

  10. Let offset be 0x40 if trail is less than 0x3F and 0x62 otherwise.

  11. Emit two bytes whose values are lead and trail + offset.

12 Legacy multi-byte Japanese encodings

12.1 euc-jp

The euc-jp jis0212 flag is initially unset.

The euc-jp lead is initially 0x00.

The euc-jp decoder (decoder for euc-jp) is:

  1. Let byte be the value at byte pointer.

  2. If byte is the EOF byte and euc-jp lead is 0x00, emit the EOF code point.

  3. If byte is the EOF byte and euc-jp lead is not 0x00, set euc-jp lead to 0x00, and run error.

  4. Increase byte pointer by one.

  5. If euc-jp lead is 0x8E and byte is in the range 0xA1 to 0xDF, set euc-jp lead to 0x00 and emit a code point whose value is 0xFF61 + byte − 0xA1.

  6. If euc-jp lead is 0x8F and byte is in the range 0xA1 to 0xFE, set the euc-jp jis0212 flag, set euc-jp lead to byte, and continue.

  7. If euc-jp lead is not 0x00, let lead be euc-jp lead, set euc-jp lead to 0x00, and run these substeps:

    1. Let code point be null.

    2. If lead and byte are both in the range 0xA1 to 0xFE, set code point to the index code point for (lead − 0xA1) × 94 + byte − 0xA1 in index jis0208 if the euc-jp jis0212 flag is unset and in index jis0212 otherwise.

    3. Unset the euc-jp jis0212 flag.

    4. If byte is not in the range 0xA1 to 0xFE, decrease byte pointer by one.

    5. If code point is null, run error.

    6. Emit a code point whose value is code point.

  8. If byte is in the range 0x00 to 0x7F, emit a code point whose value is byte.

  9. If byte is 0x8E, 0x8F, or in the range 0xA1 to 0xFE, set euc-jp lead to byte and continue.

  10. Run error.

The euc-jp encoder (encoder for euc-jp) is:

  1. Let code point be the value at code point pointer.

  2. If code point is the EOF code point, emit the EOF byte.

  3. Increase code point pointer by one.

  4. If code point is in the range U+0000 to U+007F, emit a byte whose value is code point.

  5. If code point is U+00A5, emit byte 0x5C.

  6. If code point is U+203E, emit byte 0x7E.

  7. If code point is in the range U+FF61 to U+FF9F, emit two bytes whose values are 0x8E and code point − 0xFF61 + 0xA1.

  8. Let pointer be the index pointer for code point in index jis0208.

  9. If pointer is null, run error for code point.

  10. Let lead be pointer / 94 + 0xA1.

  11. Let trail be pointer % 94 + 0xA1.

  12. Emit two bytes whose values are lead and trail.

The index jis0212 is not used by the euc-jp encoder due to lack of widespread support.

12.2 iso-2022-jp

The iso-2022-jp state is initially ASCII state.

The iso-2022-jp jis0212 flag is initially unset.

The iso-2022-jp lead is initially 0x00.

The iso-2022-jp decoder (decoder for iso-2022-jp) is:

  1. Let byte be the value at byte pointer.

  2. If byte is not the EOF byte, increase byte pointer by one.

  3. Based on iso-2022-jp state:

    ASCII state

    Based on byte:

    0x1B

    Set iso-2022-jp state to escape start state and continue.

    0x00 to 0x7F

    Emit a code point whose value is byte.

    EOF byte

    Emit the EOF code point.

    Otherwise

    Run error.

    Escape start state
    1. If byte is either 0x24 or 0x28, set iso-2022-jp lead to byte, iso-2022-jp state to escape middle state, and continue.

    2. If byte is not the EOF byte, decrease byte pointer by one.

    3. Set iso-2022-jp state to ASCII state and run error.

    Escape middle state
    1. Let lead be iso-2022-jp lead and set iso-2022-jp lead to 0x00.

    2. If lead is 0x24 and byte is either 0x40 or 0x42, unset the iso-2022-jp jis0212 flag, set iso-2022-jp state to lead state, and continue.

    3. If lead is 0x24 and byte is 0x28, set iso-2022-jp state to escape final state and continue.

    4. If lead is 0x28 and byte is either 0x42 or 0x4A, set iso-2022-jp state to ASCII state and continue.

    5. If lead is 0x28 and byte is 0x49, set iso-2022-jp state to Katakana state and continue.

    6. If byte is the EOF byte, decrease byte pointer by one, and decrease it by two otherwise.

    7. Set iso-2022-jp state to ASCII state and run error.

    Escape final state
    1. If byte is 0x44, set the iso-2022-jp jis0212 flag, set iso-2022-jp state to lead state, and continue.

    2. If byte is the EOF byte, decrease byte pointer by two, and decrease it by three otherwise.

    3. Set iso-2022-jp state to ASCII state and run error.

    Lead state

    Based on byte:

    0x0A

    Set iso-2022-jp state to ASCII state and emit code point U+000A.

    0x1B

    Set iso-2022-jp state to escape start state and continue.

    EOF byte

    Emit the EOF code point.

    Otherwise

    Set iso-2022-jp lead to byte, iso-2022-jp state to trail state, and continue.

    Trail state
    1. Set the iso-2022-jp state to lead state.

    2. If byte is the EOF byte, run error.

    3. Let code point be null and let pointer be (iso-2022-jp lead − 0x21) × 94 + byte − 0x21.

    4. If iso-2022-jp lead and byte are both in the range 0x21 to 0x7E, set code point to the index code point for pointer in index jis0208 if the iso-2022-jp jis0212 flag is unset and in index jis0212 otherwise.

    5. If code point is null, run error.

    6. Emit a code point whose value is code point.

    Katakana state

    Based on byte:

    0x1B

    Set iso-2022-jp state to escape start state and continue.

    0x21 to 0x5F

    Emit a code point whose value is 0xFF61 + byte − 0x21.

    EOF byte

    Emit the EOF code point.

    Otherwise

    Run error.

The iso-2022-jp encoder (encoder for iso-2022-jp) is:

  1. Let code point be the value at code point pointer.

  2. If code point is the EOF code point, emit the EOF byte.

  3. Increase code point pointer by one.

  4. If code point is in the range U+0000 to U+007F, or is U+00A5 or U+203E, and iso-2022-jp state is not ASCII state, decrease code point pointer by one, set iso-2022-jp state to ASCII state, and emit three bytes 0x1B 0x28 0x42.

  5. If code point is in the range U+0000 to U+007F, emit a byte whose value is code point.

  6. If code point is U+00A5, emit byte 0x5C.

  7. If code point is U+203E, emit byte 0x7E.

  8. If code point is in the range U+FF61 to U+FF9F and iso-2022-jp state is not Katakana state, decrease code point pointer by one, set iso-2022-jp state to Katakana state, and emit three bytes 0x1B 0x28 0x49.

  9. If code point is in the range U+FF61 to U+FF9F, emit a byte whose value is code point − 0xFF61 + 0x21.

  10. Let pointer be the index pointer for code point in index jis0208.

  11. If pointer is null, run error for code point.

  12. If iso-2022-jp state is not lead state, decrease code point pointer by one, set iso-2022-jp state to lead state, and emit three bytes 0x1B 0x24 0x42.

  13. Let lead be pointer / 94 + 0x21.

  14. Let trail be pointer % 94 + 0x21.

  15. Emit two bytes whose values are lead and trail.

The index jis0212 is not used by the iso-2022-jp encoder due to lack of widespread support.

12.3 shift_jis

The shift_jis lead is initially 0x00.

The shift_jis decoder (decoder for shift_jis) is:

  1. Let byte be the value at byte pointer.

  2. If byte is the EOF byte and shift_jis lead is 0x00, emit the EOF code point.

  3. If byte is the EOF byte, shift_jis lead is not 0x00, set shift_jis lead to 0x00 and run error.

  4. Increase byte pointer by one.

  5. If shift_jis lead is not 0x00, let lead be shift_jis lead, let pointer be null, set shift_jis lead to 0x00, and then run these substeps:

    1. Let offset be 0x40 if byte is less than 0x7F and 0x41 otherwise.

    2. Let lead offset be 0x81 if lead is less than 0xA0 and 0xC1 otherwise.

    3. If byte is in the range 0x40 to 0x7E or 0x80 to 0xFC, set pointer to (leadlead offset) × 188 + byteoffset.

    4. Let code point be null if pointer is null and the index code point for pointer in index jis0208 otherwise.

    5. If pointer is null, decrease byte pointer by one.

    6. If code point is null, run error.

    7. Emit a code point whose value is code point.

  6. If byte is in the range 0x00 to 0x80, emit a code point whose value is byte.

  7. If byte is in the range 0xA1 to 0xDF, emit a code point whose value is 0xFF61 + byte − 0xA1.

  8. If byte is in the range 0x81 to 0x9F or 0xE0 to 0xFC, set shift_jis lead to byte and continue.

  9. Run error.

The shift_jis encoder (encoder for shift_jis) is:

  1. Let code point be the value at code point pointer.

  2. If code point is the EOF code point, emit the EOF byte.

  3. Increase code point pointer by one.

  4. If code point is in the range U+0000 to U+0080, emit a byte whose value is code point.

  5. If code point is U+00A5, emit byte 0x5C.

  6. If code point is U+203E, emit byte 0x7E.

  7. If code point is in the range U+FF61 to U+FF9F, emit a byte whose value is code point − 0xFF61 + 0xA1.

  8. Let pointer be the index pointer for code point in index jis0208.

  9. If pointer is null, run error for code point.

  10. Let lead be pointer / 188.

  11. Let lead offset be 0x81 if lead is less than 0x1F and 0xC1 otherwise.

  12. Let trail be pointer % 188.

  13. Let offset be 0x40 if trail is less than 0x3F and 0x41 otherwise.

  14. Emit two bytes whose values are lead + lead offset and trail + offset.

13 Legacy multi-byte Korean encodings

13.1 euc-kr

The euc-kr lead is initially 0x00.

The euc-kr decoder (decoder for euc-kr) is:

  1. Let byte be the value at byte pointer.

  2. If byte is the EOF byte and euc-kr lead is 0x00, emit the EOF code point.

  3. If byte is the EOF byte and euc-kr lead is not 0x00, set euc-kr lead to 0x00 and run error.

  4. Increase byte pointer by one.

  5. If euc-kr lead is not 0x00, let lead be euc-kr lead, let pointer be null, set euc-kr lead to 0x00, and then run these substeps:

    1. If lead is in the range 0x81 to 0xC6, let temp be (26 + 26 + 126) × (lead − 0x81), and then set pointer to the result of the equation below, depending on byte:

      0x41 to 0x5A

      temp + byte − 0x41

      0x61 to 0x7A

      temp + 26 + byte − 0x61

      0x81 to 0xFE

      temp + 26 + 26 + byte − 0x81

    2. If lead is in the range 0xC7 to 0xFE and byte is in the range 0xA1 to 0xFE, set pointer to (26 + 26 + 126) × (0xC7 − 0x81) + (lead − 0xC7) × 94 + (byte − 0xA1).

    3. Let code point be null if pointer is null and the index code point for pointer in index euc-kr otherwise.

    4. If pointer is null, decrease byte pointer by one.

    5. If code point is null, run error.

    6. Emit a code point whose value is code point.

  6. If byte is in the range 0x00 to 0x7F, emit a code point whose value is byte.

  7. If byte is in the range 0x81 to 0xFE, set euc-kr lead to byte and continue.

  8. Run error.

The euc-kr encoder (encoder for euc-kr) is:

  1. Let code point be the value at code point pointer.

  2. If code point is the EOF code point, emit the EOF byte.

  3. Increase code point pointer by one.

  4. If code point is in the range U+0000 to U+007F, emit a byte whose value is code point.

  5. Let pointer be the index pointer for code point in index euc-kr.

  6. If pointer is null, run error for code point.

  7. If pointer is less than (26 + 26 + 126) × (0xC7 − 0x81), run these substeps:

    1. Let lead be pointer / (26 + 26 + 126) + 0x81.

    2. Let trail be pointer % (26 + 26 + 126).

    3. Let offset be 0x41 if trail is less than 26, 0x47 if trail is less than 26 + 26, and 0x4D otherwise.

    4. Emit two bytes whose values are lead and trail + offset.

  8. Set pointer to pointer − (26 + 26 + 126) × (0xC7 − 0x81).

  9. Let lead be pointer / 94 + 0xC7.

  10. Let trail be pointer % 94 + 0xA1.

  11. Emit two bytes whose values are lead and trail.

14 Legacy miscellaneous encodings

14.1 replacement

The replacement encoding exists to prevent certain attacks that abuse a mismatch between encodings supported on the server and the client.

The replacement decoder (decoder for replacement) is to run error and then, if the decoder is not terminated, emit the EOF code point.

The replacement encoder (encoder for replacement) is the utf-8 encoder.

14.2 Common infrastructure for utf-16be and utf-16le

In violation of the Unicode standard, which does not allow for handling a byte order mark in its definition of utf-16be and utf-16le, checking and using a byte order mark happens before an encoding to decode a byte stream is chosen, as seen in the decode algorithm.

The utf-16 lead byte and utf-16 lead surrogate are initially null and the utf-16be flag is initially unset.

The utf-16 decoder is:

  1. Let byte be the value at byte pointer.

  2. If byte is the EOF byte and utf-16 lead byte and utf-16 lead surrogate are null, emit the EOF code point.

  3. If byte is the EOF byte and either utf-16 lead byte or utf-16 lead surrogate is not null, set utf-16 lead byte and utf-16 lead surrogate to null, and run error.

  4. Increase byte pointer by one.

  5. If utf-16 lead byte is null, set utf-16 lead byte to byte and continue.

  6. Let code point be the result of:

    utf-16be flag is set

    (utf-16 lead byte << 8) + byte.

    utf-16be flag is unset

    (byte << 8) + utf-16 lead byte.

    Then set utf-16 lead byte to null.

  7. If utf-16 lead surrogate is not null, let lead surrogate be utf-16 lead surrogate, set utf-16 lead surrogate to null, and then run these substeps:

    1. If code point is in the range U+DC00 to U+DFFF, emit a code point whose value is 0x10000 + (lead surrogate − 0xD800) × 0x400 + (code point − 0xDC00).

    2. Decrease byte pointer by two and run error.

  8. If code point is in the range U+D800 to U+DBFF, set utf-16 lead surrogate to code point and continue.

  9. If code point is in the range U+DC00 to U+DFFF, run error.

  10. Emit code point code point.

To convert a code unit to bytes run these steps:

  1. Let byte1 be code unit >> 8.

  2. Let byte2 be code unit & 0x00FF.

  3. Then return the bytes in order:

    utf-16be flag is set

    byte1, then byte2.

    utf-16be flag is unset

    byte2, then byte1.

The utf-16 encoder is:

  1. Let code point be the value at code point pointer.

  2. If code point is in the range 0xD800 to 0xDFFF, run error for code point.

  3. If code point is the EOF code point, emit the EOF byte.

  4. Increase code point pointer by one.

  5. If code point is in the range 0x00 to 0xFFFF, emit the sequence resulting of converting code point to bytes.

  6. Let lead be (code point − 0x10000) / 0x400 + 0xD800, converted to bytes.

  7. Let trail be (code point − 0x10000) % 0x400 + 0xDC00, converted to bytes.

  8. Emit a sequence of bytes that consists of lead followed by trail.

14.3 utf-16be

The utf-16be decoder (decoder for utf-16be) is the utf-16 decoder with the utf-16be flag set.

The utf-16be encoder (encoder for utf-16be) is the utf-16 encoder with the utf-16be flag set.

14.4 utf-16le

In violation of the Unicode standard, "utf-16" is a label for utf-16le rather than its own standalone encoding.

The utf-16le decoder (decoder for utf-16le) is the utf-16 decoder.

The utf-16le encoder (encoder for utf-16le) is the utf-16 encoder.

14.5 x-user-defined

While technically this is a single-byte encoding, it is defined separately as it can be implemented algorithmically.

The x-user-defined decoder (decoder for x-user-defined) is:

  1. Let byte be the value at byte pointer.

  2. If byte is the EOF byte, emit the EOF code point.

  3. Increase byte pointer by one.

  4. If byte is in the range 0x00 to 0x7F, emit a code point whose value is byte.

  5. Emit a code point whose value is 0xF780 + byte − 0x80.

The x-user-defined encoder (encoder for x-user-defined) is:

  1. Let code point be the value at code point pointer.

  2. If code point is the EOF code point, emit the EOF byte.

  3. Increase code point pointer by one.

  4. If code point is in the range U+0000 to U+007F, emit a byte whose value is code point.

  5. If code point is in the range U+F780 to U+F7FF, emit a byte whose value is code point − 0xF780 + 0x80.

  6. Run error for code point.

References

[DOM]
DOM, Anne van Kesteren, Aryeh Gregor and Ms2ger. WHATWG.
[HTML]
(Non-normative) HTML, Ian Hickson. WHATWG.
[RFC2119]
Key words for use in RFCs to Indicate Requirement Levels, Scott Bradner. IETF.
[TYPEDARRAY]
Typed Array, David Herman and Kenneth Russell. Khronos.
[UNICODE]
Unicode Standard. Unicode Consortium.
[URL]
(Non-normative) URL, Anne van Kesteren. WHATWG.
[WEBIDL]
Web IDL, Cameron McCormack. W3C.
[XML]
(Non-normative) Extensible Markup Language, Tim Bray, Jean Paoli, C. M. Sperberg-McQueen et al.. W3C.

Acknowledgments

There have been a lot of people that have helped make encodings more interoperable over the years and thereby furthered the goals of this standard. Likewise many people have helped making this standard what it is today.

Ideally they are all listed here so please contact the editor with any omissions.

With that, many thanks to Alan Chaney, Allen Wirfs-Brock, Ben Noordhuis, Boris Zbarsky, Cameron McCormack, Charles McCathieNeville, David Carlisle, Doug Ewell, Erik van der Poel, 譚永鋒 (Frank Yung-Fong Tang), Glenn Maynard, Gordon P. Hemsley, Henri Sivonen, Ian Hickson, James Graham, John Tamplin, Joshua Bell, 신정식 (Jungshik Shin), 川幡太一 (Kawabata Taichi), Ken Lunde, Kenneth Russell, Leif Halvard Silli, Makoto Kato, Mark Callow, Mark Davis, Martin Dürst, Masatoshi Kimura, Ms2ger, Nigel Tao, Norbert Lindenberg, Øistein E. Andersen, Peter Krefting, Philip Jägenstedt, Philip Taylor, Robbert Broersma, Robert Mustacchi, Ryan Dahl, Shawn Steele, Simon Montagu, Simon Pieters, Simon Sapin, and 成瀬ゆい (Yui Naruse) for being awesome.