8.2.4 Tokenization
Implementations must act as if they used the following state
machine to tokenize HTML. The state machine must start in the
data state. Most states consume a single character,
which may have various side-effects, and either switches the state
machine to a new state to reconsume the same character, or
switches it to a new state to consume the next character, or stays
in the same state to consume the next character. Some states have
more complicated behavior and can consume several characters before
switching to another state. In some cases, the tokenizer state is
also changed by the tree construction stage.
The exact behavior of certain states depends on the
insertion mode and the stack of open
elements. Certain states also use a temporary
buffer to track progress.
The output of the tokenization step is a series of zero or more
of the following tokens: DOCTYPE, start tag, end tag, comment,
character, end-of-file. DOCTYPE tokens have a name, a public
identifier, a system identifier, and a force-quirks
flag. When a DOCTYPE token is created, its name, public
identifier, and system identifier must be marked as missing (which
is a distinct state from the empty string), and the force-quirks
flag must be set to off (its other state is
on). Start and end tag tokens have a tag name, a
self-closing flag, and a list of attributes, each of which
has a name and a value. When a start or end tag token is created,
its self-closing flag must be unset (its other state is that
it be set), and its attributes list must be empty. Comment and
character tokens have data.
When a token is emitted, it must immediately be handled by the
tree construction stage. The tree construction stage
can affect the state of the tokenization stage, and can insert
additional characters into the stream. (For example, the
script
element can result in scripts executing and
using the dynamic markup insertion APIs to insert
characters into the stream being tokenized.)
When a start tag token is emitted with its self-closing
flag set, if the flag is not acknowledged when it is processed by the
tree construction stage, that is a parse error.
When an end tag token is emitted with attributes, that is a
parse error.
When an end tag token is emitted with its self-closing
flag set, that is a parse error.
An appropriate end tag token is an end tag token whose
tag name matches the tag name of the last start tag to have been
emitted from this tokenizer, if any. If no start tag has been
emitted from this tokenizer, then no end tag token is
appropriate.
Before each step of the tokenizer, the user agent must first
check the parser pause flag. If it is true, then the
tokenizer must abort the processing of any nested invocations of the
tokenizer, yielding control back to the caller.
The tokenizer state machine consists of the states defined in the
following subsections.
8.2.4.1 Data state
Consume the next input character:
- U+0026 AMPERSAND (&)
- Switch to the character reference in data
state.
- U+003C LESS-THAN SIGN (<)
- Switch to the tag open state.
- U+0000 NULL
- Parse error. Emit the current input
character as a character token.
- EOF
- Emit an end-of-file token.
- Anything else
- Emit the current input character as a character
token.
8.2.4.2 Character reference in data state
Switch to the data state.
Attempt to consume a character reference, with no
additional allowed character.
If nothing is returned, emit a U+0026 AMPERSAND character (&)
token.
Otherwise, emit the character tokens that were returned.
8.2.4.3 RCDATA state
Consume the next input character:
- U+0026 AMPERSAND (&)
- Switch to the character reference in RCDATA
state.
- U+003C LESS-THAN SIGN (<)
- Switch to the RCDATA less-than sign state.
- U+0000 NULL
- Parse error. Emit a U+FFFD REPLACEMENT CHARACTER
character token.
- EOF
- Emit an end-of-file token.
- Anything else
- Emit the current input character as a character
token.
8.2.4.4 Character reference in RCDATA state
Switch to the RCDATA state.
Attempt to consume a character reference, with no
additional allowed character.
If nothing is returned, emit a U+0026 AMPERSAND character (&)
token.
Otherwise, emit the character tokens that were returned.
8.2.4.5 RAWTEXT state
Consume the next input character:
- U+003C LESS-THAN SIGN (<)
- Switch to the RAWTEXT less-than sign state.
- U+0000 NULL
- Parse error. Emit a U+FFFD REPLACEMENT CHARACTER
character token.
- EOF
- Emit an end-of-file token.
- Anything else
- Emit the current input character as a character
token.
8.2.4.6 Script data state
Consume the next input character:
- U+003C LESS-THAN SIGN (<)
- Switch to the script data less-than sign state.
- U+0000 NULL
- Parse error. Emit a U+FFFD REPLACEMENT CHARACTER
character token.
- EOF
- Emit an end-of-file token.
- Anything else
- Emit the current input character as a character
token.
8.2.4.7 PLAINTEXT state
Consume the next input character:
- U+0000 NULL
- Parse error. Emit a U+FFFD REPLACEMENT CHARACTER
character token.
- EOF
- Emit an end-of-file token.
- Anything else
- Emit the current input character as a character
token.
8.2.4.8 Tag open state
Consume the next input character:
- "!" (U+0021)
- Switch to the markup declaration open state.
- "/" (U+002F)
- Switch to the end tag open state.
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
- Create a new start tag token, set its tag name to the
lowercase version of the current input character (add 0x0020 to the
character's code point), then switch to the tag name
state. (Don't emit the token yet; further details will
be filled in before it is emitted.)
- U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
- Create a new start tag token, set its tag name to the
current input character, then switch to the tag
name state. (Don't emit the token yet; further details will
be filled in before it is emitted.)
- "?" (U+003F)
- Parse error. Switch to the bogus
comment state.
- Anything else
- Parse error. Switch to the data
state. Emit a U+003C LESS-THAN SIGN character token.
Reconsume the current input character.
8.2.4.9 End tag open state
Consume the next input character:
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
- Create a new end tag token, set its tag name to the lowercase
version of the current input character (add 0x0020 to
the character's code point), then switch to the tag name
state. (Don't emit the token yet; further details will be
filled in before it is emitted.)
- U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
- Create a new end tag token, set its tag name to the
current input character, then switch to the tag
name state. (Don't emit the token yet; further details will
be filled in before it is emitted.)
- U+003E GREATER-THAN SIGN (>)
- Parse error. Switch to the data
state.
- EOF
- Parse error. Switch to the data
state. Emit a U+003C LESS-THAN SIGN character token and a
U+002F SOLIDUS character token. Reconsume the EOF character.
- Anything else
- Parse error. Switch to the bogus
comment state.
8.2.4.10 Tag name state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Switch to the before attribute name state.
- "/" (U+002F)
- Switch to the self-closing start tag state.
- U+003E GREATER-THAN SIGN (>)
- Switch to the data state. Emit the current tag
token.
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
- Append the lowercase version of the current input
character (add 0x0020 to the character's code point) to the
current tag token's tag name.
- U+0000 NULL
- Parse error. Append a U+FFFD REPLACEMENT CHARACTER
character to the current tag token's tag name.
- EOF
- Parse error. Switch to the data
state. Reconsume the EOF character.
- Anything else
- Append the current input character to the current
tag token's tag name.
8.2.4.11 RCDATA less-than sign state
Consume the next input character:
- "/" (U+002F)
- Set the temporary buffer to the empty string. Switch
to the RCDATA end tag open state.
- Anything else
- Switch to the RCDATA state. Emit a U+003C
LESS-THAN SIGN character token. Reconsume the current
input character.
8.2.4.12 RCDATA end tag open state
Consume the next input character:
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
- Create a new end tag token, and set its tag name to the
lowercase version of the current input character (add
0x0020 to the character's code point). Append the current
input character to the temporary buffer. Finally,
switch to the RCDATA end tag name state. (Don't emit
the token yet; further details will be filled in before it is
emitted.)
- U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
- Create a new end tag token, and set its tag name to the
current input character. Append the current
input character to the temporary buffer. Finally,
switch to the RCDATA end tag name state. (Don't emit
the token yet; further details will be filled in before it is
emitted.)
- Anything else
- Switch to the RCDATA state. Emit a U+003C
LESS-THAN SIGN character token and a U+002F SOLIDUS character token.
Reconsume the current input character.
8.2.4.13 RCDATA end tag name state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- If the current end tag token is an appropriate end tag
token, then switch to the before attribute name
state. Otherwise, treat it as per the "anything else" entry
below.
- "/" (U+002F)
- If the current end tag token is an appropriate end tag
token, then switch to the self-closing start tag
state. Otherwise, treat it as per the "anything else" entry
below.
- U+003E GREATER-THAN SIGN (>)
- If the current end tag token is an appropriate end tag
token, then switch to the data state and emit
the current tag token. Otherwise, treat it as per the "anything
else" entry below.
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
- Append the lowercase version of the current input
character (add 0x0020 to the character's code point) to the
current tag token's tag name. Append the current input
character to the temporary buffer.
- U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
- Append the current input character to the current
tag token's tag name. Append the current input
character to the temporary buffer.
- Anything else
- Switch to the RCDATA state. Emit a U+003C
LESS-THAN SIGN character token, a U+002F SOLIDUS character token,
and a character token for each of the characters in the
temporary buffer (in the order they were added to the
buffer). Reconsume the current input character.
8.2.4.14 RAWTEXT less-than sign state
Consume the next input character:
- "/" (U+002F)
- Set the temporary buffer to the empty string. Switch
to the RAWTEXT end tag open state.
- Anything else
- Switch to the RAWTEXT state. Emit a U+003C
LESS-THAN SIGN character token. Reconsume the current
input character.
8.2.4.15 RAWTEXT end tag open state
Consume the next input character:
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
- Create a new end tag token, and set its tag name to the
lowercase version of the current input character (add
0x0020 to the character's code point). Append the current
input character to the temporary buffer. Finally,
switch to the RAWTEXT end tag name state. (Don't emit
the token yet; further details will be filled in before it is
emitted.)
- U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
- Create a new end tag token, and set its tag name to the
current input character. Append the current
input character to the temporary buffer. Finally,
switch to the RAWTEXT end tag name state. (Don't emit
the token yet; further details will be filled in before it is
emitted.)
- Anything else
- Switch to the RAWTEXT state. Emit a U+003C
LESS-THAN SIGN character token and a U+002F SOLIDUS character
token. Reconsume the current input character.
8.2.4.16 RAWTEXT end tag name state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- If the current end tag token is an appropriate end tag
token, then switch to the before attribute name
state. Otherwise, treat it as per the "anything else" entry
below.
- "/" (U+002F)
- If the current end tag token is an appropriate end tag
token, then switch to the self-closing start tag
state. Otherwise, treat it as per the "anything else" entry
below.
- U+003E GREATER-THAN SIGN (>)
- If the current end tag token is an appropriate end tag
token, then switch to the data state and emit
the current tag token. Otherwise, treat it as per the "anything
else" entry below.
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
- Append the lowercase version of the current input
character (add 0x0020 to the character's code point) to the
current tag token's tag name. Append the current input
character to the temporary buffer.
- U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
- Append the current input character to the current
tag token's tag name. Append the current input
character to the temporary buffer.
- Anything else
- Switch to the RAWTEXT state. Emit a U+003C
LESS-THAN SIGN character token, a U+002F SOLIDUS character token,
and a character token for each of the characters in the
temporary buffer (in the order they were added to the
buffer). Reconsume the current input character.
8.2.4.17 Script data less-than sign state
Consume the next input character:
- "/" (U+002F)
- Set the temporary buffer to the empty string. Switch
to the script data end tag open state.
- "!" (U+0021)
- Switch to the script data escape start state. Emit
a U+003C LESS-THAN SIGN character token and a U+0021 EXCLAMATION
MARK character token.
- Anything else
- Switch to the script data state. Emit a U+003C
LESS-THAN SIGN character token. Reconsume the current
input character.
8.2.4.18 Script data end tag open state
Consume the next input character:
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
- Create a new end tag token, and set its tag name to the
lowercase version of the current input character (add
0x0020 to the character's code point). Append the current
input character to the temporary buffer. Finally,
switch to the script data end tag name state. (Don't emit
the token yet; further details will be filled in before it is
emitted.)
- U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
- Create a new end tag token, and set its tag name to the
current input character. Append the current
input character to the temporary buffer. Finally,
switch to the script data end tag name state. (Don't emit
the token yet; further details will be filled in before it is
emitted.)
- Anything else
- Switch to the script data state. Emit a U+003C
LESS-THAN SIGN character token and a U+002F SOLIDUS character token.
Reconsume the current input character.
8.2.4.19 Script data end tag name state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- If the current end tag token is an appropriate end tag
token, then switch to the before attribute name
state. Otherwise, treat it as per the "anything else" entry
below.
- "/" (U+002F)
- If the current end tag token is an appropriate end tag
token, then switch to the self-closing start tag
state. Otherwise, treat it as per the "anything else" entry
below.
- U+003E GREATER-THAN SIGN (>)
- If the current end tag token is an appropriate end tag
token, then switch to the data state and emit
the current tag token. Otherwise, treat it as per the "anything
else" entry below.
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
- Append the lowercase version of the current input
character (add 0x0020 to the character's code point) to the
current tag token's tag name. Append the current input
character to the temporary buffer.
- U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
- Append the current input character to the current
tag token's tag name. Append the current input
character to the temporary buffer.
- Anything else
- Switch to the script data state. Emit a U+003C
LESS-THAN SIGN character token, a U+002F SOLIDUS character token,
and a character token for each of the characters in the
temporary buffer (in the order they were added to the
buffer). Reconsume the current input character.
8.2.4.20 Script data escape start state
Consume the next input character:
- "-" (U+002D)
- Switch to the script data escape start dash
state. Emit a U+002D HYPHEN-MINUS character token.
- Anything else
- Switch to the script data state. Reconsume the
current input character.
8.2.4.21 Script data escape start dash state
Consume the next input character:
- "-" (U+002D)
- Switch to the script data escaped dash dash
state. Emit a U+002D HYPHEN-MINUS character token.
- Anything else
- Switch to the script data state. Reconsume the
current input character.
8.2.4.22 Script data escaped state
Consume the next input character:
- "-" (U+002D)
- Switch to the script data escaped dash state. Emit
a U+002D HYPHEN-MINUS character token.
- U+003C LESS-THAN SIGN (<)
- Switch to the script data escaped less-than sign
state.
- U+0000 NULL
- Parse error. Emit a U+FFFD REPLACEMENT CHARACTER
character token.
- EOF
- Switch to the data state. Parse
error. Reconsume the EOF character.
- Anything else
- Emit the current input character as a character
token.
8.2.4.23 Script data escaped dash state
Consume the next input character:
- "-" (U+002D)
- Switch to the script data escaped dash dash
state. Emit a U+002D HYPHEN-MINUS character token.
- U+003C LESS-THAN SIGN (<)
- Switch to the script data escaped less-than sign
state.
- U+0000 NULL
- Parse error. Switch to the script data
escaped state. Emit a U+FFFD REPLACEMENT CHARACTER character
token.
- EOF
- Parse error. Switch to the data
state. Reconsume the EOF character.
- Anything else
- Switch to the script data escaped state. Emit the
current input character as a character token.
8.2.4.24 Script data escaped dash dash state
Consume the next input character:
- "-" (U+002D)
- Emit a U+002D HYPHEN-MINUS character token.
- U+003C LESS-THAN SIGN (<)
- Switch to the script data escaped less-than sign
state.
- U+003E GREATER-THAN SIGN (>)
- Switch to the script data state. Emit a U+003E
GREATER-THAN SIGN character token.
- U+0000 NULL
- Parse error. Switch to the script data
escaped state. Emit a U+FFFD REPLACEMENT CHARACTER character
token.
- EOF
- Parse error. Switch to the data
state. Reconsume the EOF character.
- Anything else
- Switch to the script data escaped state. Emit the
current input character as a character token.
8.2.4.25 Script data escaped less-than sign state
Consume the next input character:
- "/" (U+002F)
- Set the temporary buffer to the empty string. Switch
to the script data escaped end tag open state.
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
- Set the temporary buffer to the empty string. Append
the lowercase version of the current input character
(add 0x0020 to the character's code point) to the temporary
buffer. Switch to the script data double escape start
state. Emit a U+003C LESS-THAN SIGN character token and the
current input character as a character token.
- U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
- Set the temporary buffer to the empty string. Append
the current input character to the temporary
buffer. Switch to the script data double escape start
state. Emit a U+003C LESS-THAN SIGN character token and the
current input character as a character token.
- Anything else
- Switch to the script data state. Emit a U+003C
LESS-THAN SIGN character token. Reconsume the current
input character.
8.2.4.26 Script data escaped end tag open state
Consume the next input character:
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
- Create a new end tag token, and set its tag name to the
lowercase version of the current input character (add
0x0020 to the character's code point). Append the current
input character to the temporary buffer. Finally,
switch to the script data escaped end tag name
state. (Don't emit the token yet; further details will be
filled in before it is emitted.)
- U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
- Create a new end tag token, and set its tag name to the
current input character. Append the current
input character to the temporary buffer. Finally,
switch to the script data escaped end tag name
state. (Don't emit the token yet; further details will be
filled in before it is emitted.)
- Anything else
- Switch to the script data escaped state. Emit a
U+003C LESS-THAN SIGN character token and a U+002F SOLIDUS
character token. Reconsume the current input
character.
8.2.4.27 Script data escaped end tag name state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- If the current end tag token is an appropriate end tag
token, then switch to the before attribute name
state. Otherwise, treat it as per the "anything else" entry
below.
- "/" (U+002F)
- If the current end tag token is an appropriate end tag
token, then switch to the self-closing start tag
state. Otherwise, treat it as per the "anything else" entry
below.
- U+003E GREATER-THAN SIGN (>)
- If the current end tag token is an appropriate end tag
token, then switch to the data state and emit
the current tag token. Otherwise, treat it as per the "anything
else" entry below.
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
- Append the lowercase version of the current input
character (add 0x0020 to the character's code point) to the
current tag token's tag name. Append the current input
character to the temporary buffer.
- U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
- Append the current input character to the current
tag token's tag name. Append the current input
character to the temporary buffer.
- Anything else
- Switch to the script data escaped state. Emit a
U+003C LESS-THAN SIGN character token, a U+002F SOLIDUS character
token, and a character token for each of the characters in the
temporary buffer (in the order they were added to the
buffer). Reconsume the current input character.
8.2.4.28 Script data double escape start state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- "/" (U+002F)
- U+003E GREATER-THAN SIGN (>)
- If the temporary buffer is the string "
script
", then switch to the script data
double escaped state. Otherwise, switch to the script
data escaped state. Emit the current input
character as a character token.
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
- Append the lowercase version of the current input
character (add 0x0020 to the character's code point) to the
temporary buffer. Emit the current input
character as a character token.
- U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
- Append the current input character to the
temporary buffer. Emit the current input
character as a character token.
- Anything else
- Switch to the script data escaped state. Reconsume
the current input character.
8.2.4.29 Script data double escaped state
Consume the next input character:
- "-" (U+002D)
- Switch to the script data double escaped dash
state. Emit a U+002D HYPHEN-MINUS character token.
- U+003C LESS-THAN SIGN (<)
- Switch to the script data double escaped less-than
sign state. Emit a U+003C LESS-THAN SIGN character
token.
- U+0000 NULL
- Parse error. Emit a U+FFFD REPLACEMENT CHARACTER
character token.
- EOF
- Parse error. Switch to the data
state. Reconsume the EOF character.
- Anything else
- Emit the current input character as a character
token.
8.2.4.30 Script data double escaped dash state
Consume the next input character:
- "-" (U+002D)
- Switch to the script data double escaped dash dash
state. Emit a U+002D HYPHEN-MINUS character token.
- U+003C LESS-THAN SIGN (<)
- Switch to the script data double escaped less-than
sign state. Emit a U+003C LESS-THAN SIGN character
token.
- U+0000 NULL
- Parse error. Switch to the script data
double escaped state. Emit a U+FFFD REPLACEMENT CHARACTER
character token.
- EOF
- Parse error. Switch to the data
state. Reconsume the EOF character.
- Anything else
- Switch to the script data double escaped
state. Emit the current input character as a
character token.
8.2.4.31 Script data double escaped dash dash state
Consume the next input character:
- "-" (U+002D)
- Emit a U+002D HYPHEN-MINUS character token.
- U+003C LESS-THAN SIGN (<)
- Switch to the script data double escaped less-than
sign state. Emit a U+003C LESS-THAN SIGN character
token.
- U+003E GREATER-THAN SIGN (>)
- Switch to the script data state. Emit a U+003E
GREATER-THAN SIGN character token.
- U+0000 NULL
- Parse error. Switch to the script data
double escaped state. Emit a U+FFFD REPLACEMENT CHARACTER
character token.
- EOF
- Parse error. Switch to the data
state. Reconsume the EOF character.
- Anything else
- Switch to the script data double escaped
state. Emit the current input character as a
character token.
8.2.4.32 Script data double escaped less-than sign state
Consume the next input character:
- "/" (U+002F)
- Set the temporary buffer to the empty string. Switch
to the script data double escape end state. Emit a
U+002F SOLIDUS character token.
- Anything else
- Switch to the script data double escaped state.
Reconsume the current input character.
8.2.4.33 Script data double escape end state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- "/" (U+002F)
- U+003E GREATER-THAN SIGN (>)
- If the temporary buffer is the string "
script
", then switch to the script data
escaped state. Otherwise, switch to the script data
double escaped state. Emit the current input
character as a character token.
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
- Append the lowercase version of the current input
character (add 0x0020 to the character's code point) to the
temporary buffer. Emit the current input
character as a character token.
- U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
- Append the current input character to the
temporary buffer. Emit the current input
character as a character token.
- Anything else
- Switch to the script data double escaped state.
Reconsume the current input character.
8.2.4.34 Before attribute name state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Ignore the character.
- "/" (U+002F)
- Switch to the self-closing start tag state.
- U+003E GREATER-THAN SIGN (>)
- Switch to the data state. Emit the current tag
token.
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
- Start a new attribute in the current tag token. Set that
attribute's name to the lowercase version of the current input
character (add 0x0020 to the character's code point), and its
value to the empty string. Switch to the attribute name
state.
- U+0000 NULL
- Parse error. Start a new attribute in the current
tag token. Set that attribute's name to a U+FFFD REPLACEMENT
CHARACTER character, and its value to the empty string. Switch to
the attribute name state.
- """ (U+0022)
- "'" (U+0027)
- U+003C LESS-THAN SIGN (<)
- "=" (U+003D)
- Parse error. Treat it as per the "anything else"
entry below.
- EOF
- Parse error. Switch to the data
state. Reconsume the EOF character.
- Anything else
- Start a new attribute in the current tag token. Set that
attribute's name to the current input character, and
its value to the empty string. Switch to the attribute name
state.
8.2.4.35 Attribute name state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Switch to the after attribute name state.
- "/" (U+002F)
- Switch to the self-closing start tag state.
- "=" (U+003D)
- Switch to the before attribute value state.
- U+003E GREATER-THAN SIGN (>)
- Switch to the data state. Emit the current tag
token.
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
- Append the lowercase version of the current input
character (add 0x0020 to the character's code point) to the
current attribute's name.
- U+0000 NULL
- Parse error. Append a U+FFFD REPLACEMENT CHARACTER
character to the current attribute's name.
- """ (U+0022)
- "'" (U+0027)
- U+003C LESS-THAN SIGN (<)
- Parse error. Treat it as per the "anything else"
entry below.
- EOF
- Parse error. Switch to the data
state. Reconsume the EOF character.
- Anything else
- Append the current input character to the current
attribute's name.
When the user agent leaves the attribute name state (and before
emitting the tag token, if appropriate), the complete attribute's
name must be compared to the other attributes on the same token;
if there is already an attribute on the token with the exact same
name, then this is a parse error and the new
attribute must be dropped, along with the value that gets
associated with it (if any).
8.2.4.36 After attribute name state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Ignore the character.
- "/" (U+002F)
- Switch to the self-closing start tag state.
- "=" (U+003D)
- Switch to the before attribute value state.
- U+003E GREATER-THAN SIGN (>)
- Switch to the data state. Emit the current tag
token.
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
- Start a new attribute in the current tag token. Set that
attribute's name to the lowercase version of the current
input character (add 0x0020 to the character's code point),
and its value to the empty string. Switch to the attribute
name state.
- U+0000 NULL
- Parse error. Start a new attribute in the current
tag token. Set that attribute's name to a U+FFFD REPLACEMENT
CHARACTER character, and its value to the empty string. Switch to
the attribute name state.
- """ (U+0022)
- "'" (U+0027)
- U+003C LESS-THAN SIGN (<)
- Parse error. Treat it as per the "anything else"
entry below.
- EOF
- Parse error. Switch to the data
state. Reconsume the EOF character.
- Anything else
- Start a new attribute in the current tag token. Set that
attribute's name to the current input character, and
its value to the empty string. Switch to the attribute name
state.
8.2.4.37 Before attribute value state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Ignore the character.
- """ (U+0022)
- Switch to the attribute value (double-quoted) state.
- U+0026 AMPERSAND (&)
- Switch to the attribute value (unquoted) state.
Reconsume the current input character.
- "'" (U+0027)
- Switch to the attribute value (single-quoted) state.
- U+0000 NULL
- Parse error. Append a U+FFFD REPLACEMENT CHARACTER
character to the current attribute's value. Switch to the
attribute value (unquoted) state.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Switch to the data
state. Emit the current tag token.
- U+003C LESS-THAN SIGN (<)
- "=" (U+003D)
- "`" (U+0060)
- Parse error. Treat it as per the "anything else"
entry below.
- EOF
- Parse error. Switch to the data
state. Reconsume the EOF character.
- Anything else
- Append the current input character to the current
attribute's value. Switch to the attribute value (unquoted)
state.
8.2.4.38 Attribute value (double-quoted) state
Consume the next input character:
- """ (U+0022)
- Switch to the after attribute value (quoted)
state.
- U+0026 AMPERSAND (&)
- Switch to the character reference in attribute value
state, with the additional allowed character
being """ (U+0022).
- U+0000 NULL
- Parse error. Append a U+FFFD REPLACEMENT CHARACTER
character to the current attribute's value.
- EOF
- Parse error. Switch to the data
state. Reconsume the EOF character.
- Anything else
- Append the current input character to the current
attribute's value.
8.2.4.39 Attribute value (single-quoted) state
Consume the next input character:
- "'" (U+0027)
- Switch to the after attribute value (quoted)
state.
- U+0026 AMPERSAND (&)
- Switch to the character reference in attribute value
state, with the additional allowed character
being "'" (U+0027).
- U+0000 NULL
- Parse error. Append a U+FFFD REPLACEMENT CHARACTER
character to the current attribute's value.
- EOF
- Parse error. Switch to the data
state. Reconsume the EOF character.
- Anything else
- Append the current input character to the current
attribute's value.
8.2.4.40 Attribute value (unquoted) state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Switch to the before attribute name state.
- U+0026 AMPERSAND (&)
- Switch to the character reference in attribute value
state, with the additional allowed character
being U+003E GREATER-THAN SIGN (>).
- U+003E GREATER-THAN SIGN (>)
- Switch to the data state. Emit the current tag
token.
- U+0000 NULL
- Parse error. Append a U+FFFD REPLACEMENT CHARACTER
character to the current attribute's value.
- """ (U+0022)
- "'" (U+0027)
- U+003C LESS-THAN SIGN (<)
- "=" (U+003D)
- "`" (U+0060)
- Parse error. Treat it as per the "anything else"
entry below.
- EOF
- Parse error. Switch to the data
state. Reconsume the EOF character.
- Anything else
- Append the current input character to the current
attribute's value.
8.2.4.41 Character reference in attribute value state
Attempt to consume a character reference.
If nothing is returned, append a U+0026 AMPERSAND character
(&) to the current attribute's value.
Otherwise, append the returned character tokens to the current
attribute's value.
Finally, switch back to the attribute value state that switched
into this state.
8.2.4.42 After attribute value (quoted) state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Switch to the before attribute name state.
- "/" (U+002F)
- Switch to the self-closing start tag state.
- U+003E GREATER-THAN SIGN (>)
- Switch to the data state. Emit the current tag
token.
- EOF
- Parse error. Switch to the data
state. Reconsume the EOF character.
- Anything else
- Parse error. Switch to the before attribute
name state. Reconsume the character.
8.2.4.43 Self-closing start tag state
Consume the next input character:
- U+003E GREATER-THAN SIGN (>)
- Set the self-closing flag of the current tag
token. Switch to the data state. Emit the current tag
token.
- EOF
- Parse error. Switch to the data
state. Reconsume the EOF character.
- Anything else
- Parse error. Switch to the before attribute
name state. Reconsume the character.
Consume every character up to and including the first U+003E
GREATER-THAN SIGN character (>) or the end of the file (EOF),
whichever comes first. Emit a comment token whose data is the
concatenation of all the characters starting from and including the
character that caused the state machine to switch into the bogus
comment state, up to and including the character immediately before
the last consumed character (i.e. up to the character just before
the U+003E or EOF character), but with any U+0000 NULL characters
replaced by U+FFFD REPLACEMENT CHARACTER characters. (If the comment
was started by the end of the file (EOF), the token is empty.)
Switch to the data state.
If the end of the file was reached, reconsume the EOF
character.
8.2.4.45 Markup declaration open state
If the next two characters are both "-" (U+002D) characters, consume those two characters, create a comment token
whose data is the empty string, and switch to the comment
start state.
Otherwise, if the next seven characters are an ASCII
case-insensitive match for the word "DOCTYPE", then consume
those characters and switch to the DOCTYPE state.
Otherwise, if there is a current node and it is not
an element in the HTML namespace and the next seven
characters are a case-sensitive match for the string
"[CDATA[" (the five uppercase letters "CDATA" with a U+005B LEFT
SQUARE BRACKET character before and after), then consume those
characters and switch to the CDATA section state.
Otherwise, this is a parse error. Switch to the
bogus comment state. The next character that is
consumed, if any, is the first character that will be in the
comment.
Consume the next input character:
- "-" (U+002D)
- Switch to the comment start dash state.
- U+0000 NULL
- Parse error. Append a U+FFFD REPLACEMENT CHARACTER
character to the comment token's data. Switch to the comment
state.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Switch to the data
state. Emit the comment token.
- EOF
- Parse error. Switch to the data
state. Emit the comment token. Reconsume the EOF character.
- Anything else
- Append the current input character to the comment
token's data. Switch to the comment state.
Consume the next input character:
- "-" (U+002D)
- Switch to the comment end state
- U+0000 NULL
- Parse error. Append a "-" (U+002D) character and a U+FFFD REPLACEMENT CHARACTER character to the
comment token's data. Switch to the comment
state.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Switch to the data
state. Emit the comment token.
- EOF
- Parse error. Switch to the data
state. Emit the comment token. Reconsume the EOF
character.
- Anything else
- Append a "-" (U+002D) character and the
current input character to the comment token's
data. Switch to the comment state.
Consume the next input character:
- "-" (U+002D)
- Switch to the comment end dash state
- U+0000 NULL
- Parse error. Append a U+FFFD REPLACEMENT CHARACTER
character to the comment token's data.
- EOF
- Parse error. Switch to the data
state. Emit the comment token. Reconsume the EOF
character.
- Anything else
- Append the current input character to the comment
token's data.
Consume the next input character:
- "-" (U+002D)
- Switch to the comment end state
- U+0000 NULL
- Parse error. Append a "-" (U+002D) character and a U+FFFD REPLACEMENT CHARACTER character to the
comment token's data. Switch to the comment
state.
- EOF
- Parse error. Switch to the data
state. Emit the comment token. Reconsume the EOF
character.
- Anything else
- Append a "-" (U+002D) character and the
current input character to the comment token's
data. Switch to the comment state.
Consume the next input character:
- U+003E GREATER-THAN SIGN (>)
- Switch to the data state. Emit the comment
token.
- U+0000 NULL
- Parse error. Append two "-" (U+002D) characters and a U+FFFD REPLACEMENT CHARACTER character to the
comment token's data. Switch to the comment
state.
- "!" (U+0021)
- Parse error. Switch to the comment end bang
state.
- "-" (U+002D)
- Parse error. Append a "-" (U+002D) character to the comment token's data.
- EOF
- Parse error. Switch to the data
state. Emit the comment token. Reconsume the EOF
character.
- Anything else
- Parse error. Append two "-" (U+002D) characters and the current input character to the
comment token's data. Switch to the comment
state.
Consume the next input character:
- "-" (U+002D)
- Append two "-" (U+002D) characters and a "!" (U+0021) character to the comment token's data. Switch
to the comment end dash state.
- U+003E GREATER-THAN SIGN (>)
- Switch to the data state. Emit the comment
token.
- U+0000 NULL
- Parse error. Append two "-" (U+002D) characters, a "!" (U+0021) character, and a
U+FFFD REPLACEMENT CHARACTER character to the comment token's data.
Switch to the comment state.
- EOF
- Parse error. Switch to the data
state. Emit the comment token. Reconsume the EOF
character.
- Anything else
- Append two "-" (U+002D) characters, a "!" (U+0021) character, and the current input
character to the comment token's data. Switch to the
comment state.
8.2.4.52 DOCTYPE state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Switch to the before DOCTYPE name state.
- EOF
- Parse error. Switch to the data
state. Create a new DOCTYPE token. Set its force-quirks
flag to on. Emit the token. Reconsume the EOF
character.
- Anything else
- Parse error. Switch to the before DOCTYPE
name state. Reconsume the character.
8.2.4.53 Before DOCTYPE name state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Ignore the character.
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
- Create a new DOCTYPE token. Set the token's name to the
lowercase version of the current input character (add 0x0020 to the
character's code point). Switch to the DOCTYPE name
state.
- U+0000 NULL
- Parse error. Create a new DOCTYPE token. Set the
token's name to a U+FFFD REPLACEMENT CHARACTER character. Switch to
the DOCTYPE name state.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Create a new DOCTYPE token. Set its
force-quirks flag to on. Switch to the data
state. Emit the token.
- EOF
- Parse error. Switch to the data
state. Create a new DOCTYPE token. Set its force-quirks
flag to on. Emit the token. Reconsume the EOF
character.
- Anything else
- Create a new DOCTYPE token. Set the token's name to the
current input character. Switch to the DOCTYPE name
state.
8.2.4.54 DOCTYPE name state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Switch to the after DOCTYPE name state.
- U+003E GREATER-THAN SIGN (>)
- Switch to the data state. Emit the current DOCTYPE
token.
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
- Append the lowercase version of the current input
character (add 0x0020 to the character's code point) to the
current DOCTYPE token's name.
- U+0000 NULL
- Parse error. Append a U+FFFD REPLACEMENT CHARACTER
character to the current DOCTYPE token's name.
- EOF
- Parse error. Switch to the data
state. Set the DOCTYPE token's force-quirks flag to
on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Append the current input character to the current
DOCTYPE token's name.
8.2.4.55 After DOCTYPE name state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Ignore the character.
- U+003E GREATER-THAN SIGN (>)
- Switch to the data state. Emit the current DOCTYPE
token.
- EOF
- Parse error. Switch to the data
state. Set the DOCTYPE token's force-quirks flag to
on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
-
If the six characters starting from the current input
character are an ASCII case-insensitive match
for the word "PUBLIC", then consume those characters and switch to
the after DOCTYPE public keyword state.
Otherwise, if the six characters starting from the
current input character are an ASCII
case-insensitive match for the word "SYSTEM", then consume
those characters and switch to the after DOCTYPE system
keyword state.
Otherwise, this is the parse error. Set the
DOCTYPE token's force-quirks flag to on. Switch to
the bogus DOCTYPE state.
8.2.4.56 After DOCTYPE public keyword state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Switch to the before DOCTYPE public identifier
state.
- """ (U+0022)
- Parse error. Set the DOCTYPE token's public
identifier to the empty string (not missing), then switch to the
DOCTYPE public identifier (double-quoted) state.
- "'" (U+0027)
- Parse error. Set the DOCTYPE token's public
identifier to the empty string (not missing), then switch to the
DOCTYPE public identifier (single-quoted) state.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Set the DOCTYPE token's
force-quirks flag to on. Switch to the data
state. Emit that DOCTYPE token.
- EOF
- Parse error. Switch to the data
state. Set the DOCTYPE token's force-quirks flag to
on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Parse error. Set the DOCTYPE token's
force-quirks flag to on. Switch to the bogus
DOCTYPE state.
8.2.4.57 Before DOCTYPE public identifier state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Ignore the character.
- """ (U+0022)
- Set the DOCTYPE token's public identifier to the empty string
(not missing), then switch to the DOCTYPE public identifier
(double-quoted) state.
- "'" (U+0027)
- Set the DOCTYPE token's public identifier to the empty string
(not missing), then switch to the DOCTYPE public identifier
(single-quoted) state.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Set the DOCTYPE token's
force-quirks flag to on. Switch to the data
state. Emit that DOCTYPE token.
- EOF
- Parse error. Switch to the data
state. Set the DOCTYPE token's force-quirks flag to
on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Parse error. Set the DOCTYPE token's
force-quirks flag to on. Switch to the bogus
DOCTYPE state.
8.2.4.58 DOCTYPE public identifier (double-quoted) state
Consume the next input character:
- """ (U+0022)
- Switch to the after DOCTYPE public identifier state.
- U+0000 NULL
- Parse error. Append a U+FFFD REPLACEMENT CHARACTER
character to the current DOCTYPE token's public identifier.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Set the DOCTYPE token's
force-quirks flag to on. Switch to the data
state. Emit that DOCTYPE token.
- EOF
- Parse error. Switch to the data
state. Set the DOCTYPE token's force-quirks flag to
on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Append the current input character to the current
DOCTYPE token's public identifier.
8.2.4.59 DOCTYPE public identifier (single-quoted) state
Consume the next input character:
- "'" (U+0027)
- Switch to the after DOCTYPE public identifier state.
- U+0000 NULL
- Parse error. Append a U+FFFD REPLACEMENT CHARACTER
character to the current DOCTYPE token's public identifier.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Set the DOCTYPE token's
force-quirks flag to on. Switch to the data
state. Emit that DOCTYPE token.
- EOF
- Parse error. Switch to the data
state. Set the DOCTYPE token's force-quirks flag to
on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Append the current input character to the current
DOCTYPE token's public identifier.
8.2.4.60 After DOCTYPE public identifier state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Switch to the between DOCTYPE public and system
identifiers state.
- U+003E GREATER-THAN SIGN (>)
- Switch to the data state. Emit the current DOCTYPE
token.
- """ (U+0022)
- Parse error. Set the DOCTYPE token's system
identifier to the empty string (not missing), then switch to the
DOCTYPE system identifier (double-quoted) state.
- "'" (U+0027)
- Parse error. Set the DOCTYPE token's system
identifier to the empty string (not missing), then switch to the
DOCTYPE system identifier (single-quoted) state.
- EOF
- Parse error. Switch to the data
state. Set the DOCTYPE token's force-quirks flag to
on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Parse error. Set the DOCTYPE token's
force-quirks flag to on. Switch to the bogus
DOCTYPE state.
8.2.4.61 Between DOCTYPE public and system identifiers state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Ignore the character.
- U+003E GREATER-THAN SIGN (>)
- Switch to the data state. Emit the current DOCTYPE
token.
- """ (U+0022)
- Set the DOCTYPE token's system identifier to the empty string
(not missing), then switch to the DOCTYPE system identifier
(double-quoted) state.
- "'" (U+0027)
- Set the DOCTYPE token's system identifier to the empty string
(not missing), then switch to the DOCTYPE system identifier
(single-quoted) state.
- EOF
- Parse error. Switch to the data
state. Set the DOCTYPE token's force-quirks flag to
on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Parse error. Set the DOCTYPE token's
force-quirks flag to on. Switch to the bogus
DOCTYPE state.
8.2.4.62 After DOCTYPE system keyword state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Switch to the before DOCTYPE system identifier
state.
- """ (U+0022)
- Parse error. Set the DOCTYPE token's system
identifier to the empty string (not missing), then switch to the
DOCTYPE system identifier (double-quoted) state.
- "'" (U+0027)
- Parse error. Set the DOCTYPE token's system
identifier to the empty string (not missing), then switch to the
DOCTYPE system identifier (single-quoted) state.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Set the DOCTYPE token's
force-quirks flag to on. Switch to the data
state. Emit that DOCTYPE token.
- EOF
- Parse error. Switch to the data
state. Set the DOCTYPE token's force-quirks flag to
on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Parse error. Set the DOCTYPE token's
force-quirks flag to on. Switch to the bogus
DOCTYPE state.
8.2.4.63 Before DOCTYPE system identifier state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Ignore the character.
- """ (U+0022)
- Set the DOCTYPE token's system identifier to the empty string
(not missing), then switch to the DOCTYPE system identifier
(double-quoted) state.
- "'" (U+0027)
- Set the DOCTYPE token's system identifier to the empty string
(not missing), then switch to the DOCTYPE system identifier
(single-quoted) state.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Set the DOCTYPE token's
force-quirks flag to on. Switch to the data
state. Emit that DOCTYPE token.
- EOF
- Parse error. Switch to the data
state. Set the DOCTYPE token's force-quirks flag to
on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Parse error. Set the DOCTYPE token's
force-quirks flag to on. Switch to the bogus
DOCTYPE state.
8.2.4.64 DOCTYPE system identifier (double-quoted) state
Consume the next input character:
- """ (U+0022)
- Switch to the after DOCTYPE system identifier
state.
- U+0000 NULL
- Parse error. Append a U+FFFD REPLACEMENT CHARACTER
character to the current DOCTYPE token's system identifier.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Set the DOCTYPE token's
force-quirks flag to on. Switch to the data
state. Emit that DOCTYPE token.
- EOF
- Parse error. Switch to the data
state. Set the DOCTYPE token's force-quirks flag to
on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Append the current input character to the current
DOCTYPE token's system identifier.
8.2.4.65 DOCTYPE system identifier (single-quoted) state
Consume the next input character:
- "'" (U+0027)
- Switch to the after DOCTYPE system identifier
state.
- U+0000 NULL
- Parse error. Append a U+FFFD REPLACEMENT CHARACTER
character to the current DOCTYPE token's system identifier.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Set the DOCTYPE token's
force-quirks flag to on. Switch to the data
state. Emit that DOCTYPE token.
- EOF
- Parse error. Switch to the data
state. Set the DOCTYPE token's force-quirks flag to
on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Append the current input character to the current
DOCTYPE token's system identifier.
8.2.4.66 After DOCTYPE system identifier state
Consume the next input character:
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Ignore the character.
- U+003E GREATER-THAN SIGN (>)
- Switch to the data state. Emit the current DOCTYPE
token.
- EOF
- Parse error. Switch to the data
state. Set the DOCTYPE token's force-quirks flag to
on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Parse error. Switch to the bogus DOCTYPE
state. (This does not set the DOCTYPE token's
force-quirks flag to on.)
8.2.4.67 Bogus DOCTYPE state
Consume the next input character:
- U+003E GREATER-THAN SIGN (>)
- Switch to the data state. Emit the DOCTYPE
token.
- EOF
- Switch to the data state. Emit the DOCTYPE token.
Reconsume the EOF character.
- Anything else
- Ignore the character.
8.2.4.68 CDATA section state
Switch to the data state.
Consume every character up to the next occurrence of the three
character sequence U+005D RIGHT SQUARE BRACKET U+005D RIGHT SQUARE
BRACKET U+003E GREATER-THAN SIGN (]]>
), or the
end of the file (EOF), whichever comes first. Emit a series of
character tokens consisting of all the characters consumed except
the matching three character sequence at the end (if one was found
before the end of the file).
If the end of the file was reached, reconsume the EOF
character.
8.2.4.69 Tokenizing character references
This section defines how to consume a character
reference. This definition is used when parsing character
references in
text and in attributes.
The behavior depends on the identity of the next character (the
one immediately after the U+0026 AMPERSAND character):
- "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- U+003C LESS-THAN SIGN
- U+0026 AMPERSAND
- EOF
- The additional allowed character, if there is one
- Not a character reference. No characters are consumed, and
nothing is returned. (This is not an error, either.)
- "#" (U+0023)
-
Consume the U+0023 NUMBER SIGN.
The behavior further depends on the character after the U+0023
NUMBER SIGN:
- U+0078 LATIN SMALL LETTER X
- U+0058 LATIN CAPITAL LETTER X
-
Consume the X.
Follow the steps below, but using the range of characters
ASCII digits, U+0061 LATIN
SMALL LETTER A to U+0066 LATIN SMALL LETTER F, and U+0041 LATIN
CAPITAL LETTER A to U+0046 LATIN CAPITAL LETTER F (in other
words, 0-9, A-F, a-f).
When it comes to interpreting the number, interpret it as a
hexadecimal number.
- Anything else
-
Follow the steps below, but using the range of characters
ASCII digits.
When it comes to interpreting the number, interpret it as a
decimal number.
Consume as many characters as match the range of characters
given above.
If no characters match the range, then don't consume any
characters (and unconsume the U+0023 NUMBER SIGN character and, if
appropriate, the X character). This is a parse
error; nothing is returned.
Otherwise, if the next character is a U+003B SEMICOLON, consume
that too. If it isn't, there is a parse
error.
If one or more characters match the range, then take them all
and interpret the string of characters as a number (either
hexadecimal or decimal as appropriate).
If that number is one of the numbers in the first column of the
following table, then this is a parse error. Find the
row with that number in the first column, and return a character
token for the Unicode character given in the second column of that
row.
Number | Unicode character
|
---|
0x00 | U+FFFD | REPLACEMENT CHARACTER
|
0x0D | U+000D | CARRIAGE RETURN (CR)
|
0x80 | U+20AC | EURO SIGN (€)
|
0x81 | U+0081 | <control>
|
0x82 | U+201A | SINGLE LOW-9 QUOTATION MARK (‚)
|
0x83 | U+0192 | LATIN SMALL LETTER F WITH HOOK (ƒ)
|
0x84 | U+201E | DOUBLE LOW-9 QUOTATION MARK („)
|
0x85 | U+2026 | HORIZONTAL ELLIPSIS (…)
|
0x86 | U+2020 | DAGGER (†)
|
0x87 | U+2021 | DOUBLE DAGGER (‡)
|
0x88 | U+02C6 | MODIFIER LETTER CIRCUMFLEX ACCENT (ˆ)
|
0x89 | U+2030 | PER MILLE SIGN (‰)
|
0x8A | U+0160 | LATIN CAPITAL LETTER S WITH CARON (Š)
|
0x8B | U+2039 | SINGLE LEFT-POINTING ANGLE QUOTATION MARK (‹)
|
0x8C | U+0152 | LATIN CAPITAL LIGATURE OE (Œ)
|
0x8D | U+008D | <control>
|
0x8E | U+017D | LATIN CAPITAL LETTER Z WITH CARON (Ž)
|
0x8F | U+008F | <control>
|
0x90 | U+0090 | <control>
|
0x91 | U+2018 | LEFT SINGLE QUOTATION MARK (‘)
|
0x92 | U+2019 | RIGHT SINGLE QUOTATION MARK (’)
|
0x93 | U+201C | LEFT DOUBLE QUOTATION MARK (“)
|
0x94 | U+201D | RIGHT DOUBLE QUOTATION MARK (”)
|
0x95 | U+2022 | BULLET (•)
|
0x96 | U+2013 | EN DASH (–)
|
0x97 | U+2014 | EM DASH (—)
|
0x98 | U+02DC | SMALL TILDE (˜)
|
0x99 | U+2122 | TRADE MARK SIGN (™)
|
0x9A | U+0161 | LATIN SMALL LETTER S WITH CARON (š)
|
0x9B | U+203A | SINGLE RIGHT-POINTING ANGLE QUOTATION MARK (›)
|
0x9C | U+0153 | LATIN SMALL LIGATURE OE (œ)
|
0x9D | U+009D | <control>
|
0x9E | U+017E | LATIN SMALL LETTER Z WITH CARON (ž)
|
0x9F | U+0178 | LATIN CAPITAL LETTER Y WITH DIAERESIS (Ÿ)
|
Otherwise, if the number is in the range 0xD800 to 0xDFFF or is greater than 0x10FFFF, then this is a
parse error. Return a U+FFFD REPLACEMENT
CHARACTER.
Otherwise, return a character token for the Unicode character
whose code point is that number.
If the number is in the range 0x0001 to 0x0008, 0x000E to 0x001F, 0x007F to 0x009F, 0xFDD0 to
0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF,
0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE,
0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF,
0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE,
0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF,
0x10FFFE, or 0x10FFFF, then this is a parse
error.
- Anything else
-
Consume the maximum number of characters possible, with the
consumed characters matching one of the identifiers in the first
column of the named character references table (in a
case-sensitive manner).
If no match can be made, then no characters are consumed, and
nothing is returned. In this case, if the characters after the
U+0026 AMPERSAND character (&) consist of a sequence of one or
more characters in the range ASCII digits, lowercase ASCII letters, and uppercase ASCII letters, followed by a ";" (U+003B) character, then this
is a parse error.
If the character reference is being consumed as part of an
attribute, and the last character matched is not a ";" (U+003B) character, and the next character is either a "=" (U+003D) character or in the range ASCII digits, uppercase ASCII letters, or lowercase ASCII letters, then, for historical reasons, all the
characters that were matched after the U+0026 AMPERSAND character
(&) must be unconsumed, and nothing is returned.
Otherwise, a character reference is parsed. If the last
character matched is not a ";" (U+003B) character, there
is a parse error.
Return one or two character tokens for the character(s)
corresponding to the character reference name (as given by the
second column of the named character references
table).
If the markup contains (not in an attribute) the string I'm ¬it; I tell you
, the character
reference is parsed as "not", as in, I'm ¬it;
I tell you
(and this is a parse error). But if the markup
was I'm ∉ I tell you
, the
character reference would be parsed as "notin;", resulting in
I'm ∉ I tell you
(and no parse
error).