
HTML 5
A vocabulary and
associated APIs for HTML and XHTML
8.2.3. 8.2.4
Tokenisation
Implementations must act as if they used the following state
machine to tokenise HTML. The state machine must start in the
data state . Most states consume a single
character, which may have various side-effects, and either switches
the state machine to a new state to reconsume the same
character, or switches it to a new state (to consume the next
character), or repeats the same state (to consume the next
character). Some states have more complicated behaviour behavior and
can consume several characters before switching to another
state.
The exact behaviour behavior of certain states depends on a content model flag that is set after certain
tokens are emitted. The flag has several states: PCDATA , RCDATA , CDATA
, and PLAINTEXT . Initially it must be in the
PCDATA state. In the RCDATA and CDATA states, a further escape flag is used to control the behaviour behavior of
the tokeniser. It is either true or false, and initially must be
set to the false state. The
insertion mode and the stack of open elements also affects tokenisation.
The output of the tokenisation step is a series of zero or more
of the following tokens: DOCTYPE, start tag, end tag, comment,
character, end-of-file. DOCTYPE tokens have a name, a public
identifier, a system identifier, and a correctness flag. force-quirks flag . When a DOCTYPE token is
created, its name, public identifier, and system identifier must be
marked as missing, missing (which is a distinct state from the empty
string), and the correctness
force-quirks flag must be set to
correct off (its other state is incorrect on ).
Start and end tag tokens have a tag name name, a
self-closing flag , and a list
of attributes, each of which has a name and a value. When a start or end tag token is created, its
self-closing flag must be unset (its other state is that it be set), and
its attributes list must be empty. Comment and character
tokens have data.
When a token is emitted, it must immediately be handled by the
tree construction stage. The tree
construction stage can affect the state of the content model flag , and can insert additional
characters into the stream. (For example, the script
element can result in scripts
executing and using the dynamic markup
insertion APIs to insert characters into the stream being
tokenised.)
When a start tag token is emitted with
its self-closing flag
set, if the flag is not acknowledged when it
is processed by the tree construction stage, that is a
parse error .
When an end tag token is emitted,
the content model flag must be switched to
the PCDATA state.
When an end tag token is emitted with attributes, that is a
parse error .
A permitted slash When an end tag token is a
U+002F SOLIDUS character emitted with
its self-closing flag
set, that is immediately followed by a U+003E GREATER-THAN SIGN, if, and only if, the current
token being processed is a start tag token whose tag name is one of
the following: base , link , meta , hr , br , img , embed , param ,
area , col , input parse error .
Before each step of the tokeniser, the user agent may check to
see if either one of the scripts in the list of
scripts that will execute as soon as possible or the first
script in the list of scripts that will execute
asynchronously , has completed loading . If one
has, then it must be executed and removed from its
list.
The tokeniser state machine is as follows:
- Data state
-
Consume the next input character :
- U+0026 AMPERSAND (&)
- When the content model flag is set to
one of the PCDATA or RCDATA
states:
states and the escape flag is false: switch to the entity character reference data state .
- Otherwise: treat it as per the "anything else" entry
below.
- U+002D HYPHEN-MINUS (-)
-
If the content model flag is set to
either the RCDATA state or the CDATA state, and the escape flag is false, and there are at least three
characters before this one in the input stream, and the last four
characters in the input stream, including this one, are U+003C
LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+002D HYPHEN-MINUS, and
U+002D HYPHEN-MINUS ("<!--"), then set the escape flag to true.
In any case, emit the input character as a character token. Stay
in the data state .
- U+003C LESS-THAN SIGN (<)
- When the content model flag is set to
the PCDATA state: switch to the tag open
state .
- When the content model flag is set to
either the RCDATA state or the CDATA state and the escape flag is false: switch to the tag open state .
- Otherwise: treat it as per the "anything else" entry
below.
- U+003E GREATER-THAN SIGN (>)
-
If the content model flag is set to
either the RCDATA state or the CDATA state, and the escape flag is true, and the last three characters in
the input stream including this one are U+002D HYPHEN-MINUS, U+002D
HYPHEN-MINUS, U+003E GREATER-THAN SIGN ("-->"), set the escape flag to false.
In any case, emit the input character as a character token. Stay
in the data state .
- EOF
- Emit an end-of-file token.
- Anything else
- Emit the input character as a character token. Stay in the
data state .
Entity Character reference data
state
-
(This cannot happen if the content model
flag is set to the CDATA state.)
Attempt to consume an
entity a character reference
,with no additional allowed
character .
If nothing is returned, emit a U+0026 AMPERSAND character
token.
Otherwise, emit the character token that was returned.
Finally, switch to the data state
.
- Tag open state
-
The behaviour behavior of this state depends on the content model flag .
- If the content model flag is set to the
RCDATA or CDATA states
-
Consume the next input character . If
it is a U+002F SOLIDUS (/) character, switch to the close tag open state . Otherwise, emit a U+003C
LESS-THAN SIGN character token and reconsume the current input
character in the data state .
- If the content model flag is set to the
PCDATA state
-
Consume the next input character :
- U+0021 EXCLAMATION MARK (!)
- Switch to the markup declaration open
state .
- U+002F SOLIDUS (/)
- Switch to the close tag open state .
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL
LETTER Z
- Create a new start tag token, set its tag name to the lowercase
version of the input character (add 0x0020 to the character's code
point), then switch to the tag name state
. (Don't emit the token yet; further details will be filled in
before it is emitted.)
- U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL
LETTER Z
- Create a new start tag token, set its tag name to the input
character, then switch to the tag name
state . (Don't emit the token yet; further details will be
filled in before it is emitted.)
- U+003E GREATER-THAN SIGN (>)
- Parse error . Emit a U+003C LESS-THAN
SIGN character token and a U+003E GREATER-THAN SIGN character
token. Switch to the data state .
- U+003F QUESTION MARK (?)
- Parse error . Switch to the bogus comment state .
- Anything else
- Parse error . Emit a U+003C LESS-THAN
SIGN character token and reconsume the current input character in
the data state .
- Close tag open state
-
If the content model flag is set to the
RCDATA or CDATA states but no start tag token has ever been emitted
by this instance of the tokeniser ( fragment
case ), or, if the content model flag
is set to the RCDATA or CDATA states and the next few characters do
not match the tag name of the last start tag token emitted (case
insensitively), or if they do but they are not immediately followed
by one of the following characters:
- U+0009 CHARACTER TABULATION
- U+000A LINE FEED (LF)
- U+000B LINE TABULATION
- U+000C FORM FEED (FF)
- U+0020 SPACE
- U+003E GREATER-THAN SIGN (>)
- U+002F SOLIDUS (/)
- EOF
...then emit a U+003C LESS-THAN SIGN character token, a U+002F
SOLIDUS character token, and switch to the data state to process the next input character .
Otherwise, if the content model flag is
set to the PCDATA state, or if the next few characters do
match that tag name, consume the next input
character :
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL
LETTER Z
- Create a new end tag token, set its tag name to the lowercase
version of the input character (add 0x0020 to the character's code
point), then switch to the tag name state
. (Don't emit the token yet; further details will be filled in
before it is emitted.)
- U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL
LETTER Z
- Create a new end tag token, set its tag name to the input
character, then switch to the tag name
state . (Don't emit the token yet; further details will be
filled in before it is emitted.)
- U+003E GREATER-THAN SIGN (>)
- Parse error . Switch to the data state .
- EOF
- Parse error . Emit a U+003C LESS-THAN
SIGN character token and a U+002F SOLIDUS character token.
Reconsume the EOF character in the data
state .
- Anything else
- Parse error . Switch to the bogus comment state .
- Tag name state
-
Consume the next input character :
- U+0009 CHARACTER TABULATION
- U+000A LINE FEED (LF)
- U+000B LINE TABULATION
- U+000C FORM FEED (FF)
- U+0020 SPACE
- Switch to the before attribute name state
.
- U+003E GREATER-THAN SIGN (>)
- Emit the current tag token. Switch to the data state .
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL
LETTER Z
- Append the lowercase version of the current input character
(add 0x0020 to the character's code point) to the current tag
token's tag name. Stay in the tag name
state .
- EOF
- Parse error . Emit the current tag token.
Reconsume the EOF character in the data
state .
- U+002F SOLIDUS (/)
Parse error unless this is a permitted
slash . Switch to the before attribute
name self-closing start tag state .
- Anything else
- Append the current input character to the current tag token's
tag name. Stay in the tag name state
.
- Before attribute name state
-
Consume the next input character :
- U+0009 CHARACTER TABULATION
- U+000A LINE FEED (LF)
- U+000B LINE TABULATION
- U+000C FORM FEED (FF)
- U+0020 SPACE
- Stay in the before attribute name state
.
- U+003E GREATER-THAN SIGN (>)
- Emit the current tag token. Switch to the data state .
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL
LETTER Z
- Start a new attribute in the current tag token. Set that
attribute's name to the lowercase version of the current input
character (add 0x0020 to the character's code point), and its value
to the empty string. Switch to the attribute
name state .
- U+002F SOLIDUS (/)
- Switch to the self-closing start tag
state .
- U+0022 QUOTATION MARK (")
- U+0027 APOSTROPHE (')
- U+003D EQUALS SIGN (=)
- Parse error
unless this
is a permitted slash . Stay
in Treat it as per the
before attribute name state .
"anything else" entry below.
- EOF
- Parse error . Emit the current tag token.
Reconsume the EOF character in the data
state .
- Anything else
- Start a new attribute in the current tag token. Set that
attribute's name to the current input character, and its value to
the empty string. Switch to the attribute
name state .
- Attribute name state
-
Consume the next input character :
- U+0009 CHARACTER TABULATION
- U+000A LINE FEED (LF)
- U+000B LINE TABULATION
- U+000C FORM FEED (FF)
- U+0020 SPACE
- Switch to the after attribute name state
.
- U+003D EQUALS SIGN (=)
- Switch to the before attribute value
state .
- U+003E GREATER-THAN SIGN (>)
- Emit the current tag token. Switch to the data state .
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL
LETTER Z
- Append the lowercase version of the current input character
(add 0x0020 to the character's code point) to the current
attribute's name. Stay in the attribute name
state .
- U+002F SOLIDUS (/)
Parse error unless this is a permitted
slash . Switch to the before attribute
name self-closing start tag state .
- U+0022 QUOTATION MARK (")
- U+0027 APOSTROPHE (')
- Parse error
.Treat it as per the "anything else" entry
below.
- EOF
- Parse error . Emit the current tag token.
Reconsume the EOF character in the data
state .
- Anything else
- Append the current input character to the current attribute's
name. Stay in the attribute name state
.
When the user agent leaves the attribute name state (and before
emitting the tag token, if appropriate), the complete attribute's
name must be compared to the other attributes on the same token; if
there is already an attribute on the token with the exact same
name, then this is a parse error and the new
attribute must be dropped, along with the value that gets
associated with it (if any).
- After attribute name state
-
Consume the next input character :
- U+0009 CHARACTER TABULATION
- U+000A LINE FEED (LF)
- U+000B LINE TABULATION
- U+000C FORM FEED (FF)
- U+0020 SPACE
- Stay in the after attribute name state
.
- U+003D EQUALS SIGN (=)
- Switch to the before attribute value
state .
- U+003E GREATER-THAN SIGN (>)
- Emit the current tag token. Switch to the data state .
- U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL
LETTER Z
- Start a new attribute in the current tag token. Set that
attribute's name to the lowercase version of the current input
character (add 0x0020 to the character's code point), and its value
to the empty string. Switch to the attribute
name state .
- U+002F SOLIDUS (/)
Parse error unless this is a permitted
slash . Switch to the before attribute
name self-closing start tag state .
- EOF
- Parse error . Emit the current tag token.
Reconsume the EOF character in the data
state .
- Anything else
- Start a new attribute in the current tag token. Set that
attribute's name to the current input character, and its value to
the empty string. Switch to the attribute
name state .
- Before attribute value state
-
Consume the next input character :
- U+0009 CHARACTER TABULATION
- U+000A LINE FEED (LF)
- U+000B LINE TABULATION
- U+000C FORM FEED (FF)
- U+0020 SPACE
- Stay in the before attribute value state
.
- U+0022 QUOTATION MARK (")
- Switch to the attribute value
(double-quoted) state .
- U+0026 AMPERSAND (&)
- Switch to the attribute value (unquoted)
state and reconsume this input character.
- U+0027 APOSTROPHE (')
- Switch to the attribute value
(single-quoted) state .
- U+003E GREATER-THAN SIGN (>)
- Emit the current tag token. Switch to the data state .
- U+003D EQUALS SIGN (=)
- Parse error
.Treat it as per the "anything else" entry
below.
- EOF
- Parse error . Emit the current tag token.
Reconsume the character in the data state
.
- Anything else
- Append the current input character to the current attribute's
value. Switch to the attribute value
(unquoted) state .
- Attribute value (double-quoted)
state
-
Consume the next input character :
- U+0022 QUOTATION MARK (")
- Switch to the
before after attribute name value (quoted)
state .
- U+0026 AMPERSAND (&)
- Switch to the
entity character reference in
attribute value state . ,
with the additional allowed
character being U+0022 QUOTATION
MARK (").
- EOF
- Parse error . Emit the current tag token.
Reconsume the character in the data state
.
- Anything else
- Append the current input character to the current attribute's
value. Stay in the attribute value
(double-quoted) state .
- Attribute value (single-quoted)
state
-
Consume the next input character :
- U+0027 APOSTROPHE (')
- Switch to the
before after attribute name value (quoted)
state .
- U+0026 AMPERSAND (&)
- Switch to the
entity character reference in
attribute value state . ,
with the additional allowed
character being U+0027 APOSTROPHE
(').
- EOF
- Parse error . Emit the current tag token.
Reconsume the character in the data state
.
- Anything else
- Append the current input character to the current attribute's
value. Stay in the attribute value
(single-quoted) state .
- Attribute value (unquoted)
state
-
Consume the next input character :
- U+0009 CHARACTER TABULATION
- U+000A LINE FEED (LF)
- U+000B LINE TABULATION
- U+000C FORM FEED (FF)
- U+0020 SPACE
- Switch to the before attribute name state
.
- U+0026 AMPERSAND (&)
- Switch to the
entity character reference in
attribute value state , with no
additional allowed
character .
- U+003E GREATER-THAN SIGN (>)
- Emit the current tag token. Switch to the data state .
- U+0022 QUOTATION MARK (")
- U+0027 APOSTROPHE (')
- U+003D EQUALS SIGN (=)
- Parse error
.Treat it as per the "anything else" entry
below.
- EOF
- Parse error . Emit the current tag token.
Reconsume the character in the data state
.
- Anything else
- Append the current input character to the current attribute's
value. Stay in the attribute value (unquoted)
state .
Entity Character reference in
attribute value state
-
Attempt to consume an
entity a character reference
.
If nothing is returned, append a U+0026 AMPERSAND character to
the current attribute's value.
Otherwise, append the returned character token to the current
attribute's value.
Finally, switch back to the attribute value state that you were
in when were switched into this state.
- After attribute value
(quoted) state
-
Consume the next input character
:
- U+0009 CHARACTER TABULATION
- U+000A LINE FEED (LF)
- U+000B LINE TABULATION
- U+000C FORM FEED (FF)
- U+0020 SPACE
- Switch to the before attribute name
state .
- U+003E GREATER-THAN SIGN
(>)
- Emit the current tag token. Switch to
the data
state .
- U+002F SOLIDUS (/)
- Switch to the self-closing start tag
state .
- EOF
- Parse error
.Emit the current tag token. Reconsume the
EOF character in the data state .
- Anything else
- Parse error
.Reconsume the character in the
before attribute name
state .
- Self-closing start
tag state
-
Consume the next input character
:
- U+003E GREATER-THAN SIGN
(>)
- Set the self-closing flag of the
current tag token. Emit the current tag token. Switch to the
data state
.
- EOF
- Parse error
.Emit the current tag token. Reconsume the
EOF character in the data state .
- Anything else
- Parse error
.Reconsume the character in the
before attribute name
state .
- Bogus comment state
-
(This can only happen if the content
model flag is set to the PCDATA state.)
Consume every character up to the first U+003E GREATER-THAN SIGN
character (>) or the end of the file (EOF), whichever comes
first. Emit a comment token whose data is the concatenation of all
the characters starting from and including the character that
caused the state machine to switch into the bogus comment state, up
to and including the last consumed character before the U+003E
character, if any, or up to the end of the file otherwise. (If the
comment was started by the end of the file (EOF), the token is
empty.)
Switch to the data state .
If the end of the file was reached, reconsume the EOF
character.
- Markup declaration open state
-
(This can only happen if the content
model flag is set to the PCDATA state.)
If the next two characters are both U+002D HYPHEN-MINUS (-)
characters, consume those two characters, create a comment token
whose data is the empty string, and switch to the comment start state .
Otherwise Otherwise, if the next seven characters are a
case-insensitive match for the word "DOCTYPE", then
consume those characters and switch to the DOCTYPE state .
Otherwise, if the insertion mode is " in
foreign content " and the
current node is
not an element in the HTML namespace
and the next seven characters are a
case-sensitive match for the string "[CDATA[" (the five uppercase
letters "CDATA" with a U+005B LEFT SQUARE BRACKET character before
and after), then consume those characters and switch to the
CDATA block state
(which is unrelated to the content model flag
's CDATA state).
Otherwise, this is a parse error . Switch to the bogus
comment state . The next character that is consumed, if any, is
the first character that will be in the comment.
-
Consume the next input character :
- U+002D HYPHEN-MINUS (-)
- Switch to the comment start dash state
.
- U+003E GREATER-THAN SIGN (>)
- Parse error . Emit the comment token.
Switch to the data state .
- EOF
- Parse error . Emit the comment token.
Reconsume the EOF character in the data
state .
- Anything else
- Append the input character to the comment token's data. Switch
to the comment state .
-
Consume the next input character :
- U+002D HYPHEN-MINUS (-)
- Switch to the comment end state
- U+003E GREATER-THAN SIGN (>)
- Parse error . Emit the comment token.
Switch to the data state .
- EOF
- Parse error . Emit the comment token.
Reconsume the EOF character in the data
state .
- Anything else
- Append a U+002D HYPHEN-MINUS (-) character and the input
character to the comment token's data. Switch to the comment state .
-
Consume the next input character :
- U+002D HYPHEN-MINUS (-)
- Switch to the comment end dash
state
- EOF
- Parse error . Emit the comment token.
Reconsume the EOF character in the data
state .
- Anything else
- Append the input character to the comment token's data. Stay in
the comment state .
-
Consume the next input character :
- U+002D HYPHEN-MINUS (-)
- Switch to the comment end state
- EOF
- Parse error . Emit the comment token.
Reconsume the EOF character in the data
state .
- Anything else
- Append a U+002D HYPHEN-MINUS (-) character and the input
character to the comment token's data. Switch to the comment state .
-
Consume the next input character :
- U+003E GREATER-THAN SIGN (>)
- Emit the comment token. Switch to the data state .
- U+002D HYPHEN-MINUS (-)
- Parse error . Append a U+002D
HYPHEN-MINUS (-) character to the comment token's data. Stay in the
comment end state .
- EOF
- Parse error . Emit the comment token.
Reconsume the EOF character in the data
state .
- Anything else
- Parse error . Append two U+002D
HYPHEN-MINUS (-) characters and the input character to the comment
token's data. Switch to the comment state
.
- DOCTYPE state
-
Consume the next input character :
- U+0009 CHARACTER TABULATION
- U+000A LINE FEED (LF)
- U+000B LINE TABULATION
- U+000C FORM FEED (FF)
- U+0020 SPACE
- Switch to the before DOCTYPE name state
.
- Anything else
- Parse error . Reconsume the current
character in the before DOCTYPE name state
.
- Before DOCTYPE name state
-
Consume the next input character :
- U+0009 CHARACTER TABULATION
- U+000A LINE FEED (LF)
- U+000B LINE TABULATION
- U+000C FORM FEED (FF)
- U+0020 SPACE
- Stay in the before DOCTYPE name state
.
- U+003E GREATER-THAN SIGN (>)
- Parse error . Create a new DOCTYPE token.
Set its
correctness force-quirks flag to incorrect on .
Emit the token. Switch to the data state
.
- EOF
- Parse error . Create a new DOCTYPE token.
Set its
correctness force-quirks flag to incorrect on .
Emit the token. Reconsume the EOF character in the data state .
- Anything else
- Create a new DOCTYPE token. Set the token's name
name to the current input character. Switch to the
DOCTYPE name state .
- DOCTYPE name state
-
First, consume the next input
character :
- U+0009 CHARACTER TABULATION
- U+000A LINE FEED (LF)
- U+000B LINE TABULATION
- U+000C FORM FEED (FF)
- U+0020 SPACE
- Switch to the after DOCTYPE name state
.
- U+003E GREATER-THAN SIGN (>)
- Emit the current DOCTYPE token. Switch to the data state .
- EOF
- Parse error . Set the DOCTYPE token's
correctness force-quirks flag to incorrect on .
Emit that DOCTYPE token. Reconsume the EOF character in the
data state .
- Anything else
- Append the current input character to the current DOCTYPE
token's name. Stay in the DOCTYPE name
state .
- After DOCTYPE name state
-
Consume the next input character :
- U+0009 CHARACTER TABULATION
- U+000A LINE FEED (LF)
- U+000B LINE TABULATION
- U+000C FORM FEED (FF)
- U+0020 SPACE
- Stay in the after DOCTYPE name state
.
- U+003E GREATER-THAN SIGN (>)
- Emit the current DOCTYPE token. Switch to the data state .
- EOF
- Parse error . Set the DOCTYPE token's
correctness force-quirks flag to incorrect on .
Emit that DOCTYPE token. Reconsume the EOF character in the
data state .
- Anything else
-
If the next six characters are a case-insensitive
match for the word "PUBLIC", then consume those characters and
switch to the before DOCTYPE public identifier
state .
Otherwise, if the next six characters are a
case-insensitive match for the word "SYSTEM", then
consume those characters and switch to the before DOCTYPE system identifier state .
Otherwise, this is the parse error .
Set the DOCTYPE token's force-quirks flag to on . Switch
to the bogus DOCTYPE state .
- Before DOCTYPE public identifier
state
-
Consume the next input character :
- U+0009 CHARACTER TABULATION
- U+000A LINE FEED (LF)
- U+000B LINE TABULATION
- U+000C FORM FEED (FF)
- U+0020 SPACE
- Stay in the before DOCTYPE public identifier
state .
- U+0022 QUOTATION MARK (")
- Set the DOCTYPE token's public identifier to the empty
string, string
(not missing), then switch to the DOCTYPE
public identifier (double-quoted) state .
- U+0027 APOSTROPHE (')
- Set the DOCTYPE token's public identifier to the empty
string, string
(not missing), then switch to the DOCTYPE
public identifier (single-quoted) state .
- U+003E GREATER-THAN SIGN (>)
- Parse error . Set the DOCTYPE token's
correctness force-quirks flag to incorrect on .
Emit that DOCTYPE token. Switch to the data
state .
- EOF
- Parse error . Set the DOCTYPE token's
correctness force-quirks flag to incorrect on .
Emit that DOCTYPE token. Reconsume the EOF character in the
data state .
- Anything else
- Parse error . Set
the DOCTYPE token's force-quirks
flag to on . Switch to the bogus
DOCTYPE state .
- DOCTYPE public identifier (double-quoted)
state
-
Consume the next input character :
- U+0022 QUOTATION MARK (")
- Switch to the after DOCTYPE public identifier
state .
- U+003E GREATER-THAN SIGN
(>)
- Parse error
.Set the DOCTYPE token's force-quirks flag to on
.Emit that DOCTYPE token. Switch to the
data state
.
- EOF
- Parse error . Set the DOCTYPE token's
correctness force-quirks flag to incorrect on .
Emit that DOCTYPE token. Reconsume the EOF character in the
data state .
- Anything else
- Append the current input character to the current DOCTYPE
token's public identifier. Stay in the DOCTYPE
public identifier (double-quoted) state .
- DOCTYPE public identifier (single-quoted)
state
-
Consume the next input character :
- U+0027 APOSTROPHE (')
- Switch to the after DOCTYPE public identifier
state .
- U+003E GREATER-THAN SIGN
(>)
- Parse error
.Set the DOCTYPE token's force-quirks flag to on
.Emit that DOCTYPE token. Switch to the
data state
.
- EOF
- Parse error . Set the DOCTYPE token's
correctness force-quirks flag to incorrect on .
Emit that DOCTYPE token. Reconsume the EOF character in the
data state .
- Anything else
- Append the current input character to the current DOCTYPE
token's public identifier. Stay in the DOCTYPE
public identifier (single-quoted) state .
- After DOCTYPE public identifier
state
-
Consume the next input character :
- U+0009 CHARACTER TABULATION
- U+000A LINE FEED (LF)
- U+000B LINE TABULATION
- U+000C FORM FEED (FF)
- U+0020 SPACE
- Stay in the after DOCTYPE public identifier
state .
- U+0022 QUOTATION MARK (")
- Set the DOCTYPE token's system identifier to the empty
string, string
(not missing), then switch to the DOCTYPE
system identifier (double-quoted) state .
- U+0027 APOSTROPHE (')
- Set the DOCTYPE token's system identifier to the empty
string, string
(not missing), then switch to the DOCTYPE
system identifier (single-quoted) state .
- U+003E GREATER-THAN SIGN (>)
- Emit the current DOCTYPE token. Switch to the data state .
- EOF
- Parse error . Set the DOCTYPE token's
correctness force-quirks flag to incorrect on .
Emit that DOCTYPE token. Reconsume the EOF character in the
data state .
- Anything else
- Parse error . Set
the DOCTYPE token's force-quirks
flag to on . Switch to the bogus
DOCTYPE state .
- Before DOCTYPE system identifier
state
-
Consume the next input character :
- U+0009 CHARACTER TABULATION
- U+000A LINE FEED (LF)
- U+000B LINE TABULATION
- U+000C FORM FEED (FF)
- U+0020 SPACE
- Stay in the before DOCTYPE system identifier
state .
- U+0022 QUOTATION MARK (")
- Set the DOCTYPE token's system identifier to the empty
string, string
(not missing), then switch to the DOCTYPE
system identifier (double-quoted) state .
- U+0027 APOSTROPHE (')
- Set the DOCTYPE token's system identifier to the empty
string, string
(not missing), then switch to the DOCTYPE
system identifier (single-quoted) state .
- U+003E GREATER-THAN SIGN (>)
- Parse error . Set the DOCTYPE token's
correctness force-quirks flag to incorrect on .
Emit that DOCTYPE token. Switch to the data
state .
- EOF
- Parse error . Set the DOCTYPE token's
correctness force-quirks flag to incorrect on .
Emit that DOCTYPE token. Reconsume the EOF character in the
data state .
- Anything else
- Parse error . Set
the DOCTYPE token's force-quirks
flag to on . Switch to the bogus
DOCTYPE state .
- DOCTYPE system identifier (double-quoted)
state
-
Consume the next input character :
- U+0022 QUOTATION MARK (")
- Switch to the after DOCTYPE system identifier
state .
- U+003E GREATER-THAN SIGN
(>)
- Parse error
.Set the DOCTYPE token's force-quirks flag to on
.Emit that DOCTYPE token. Switch to the
data state
.
- EOF
- Parse error . Set the DOCTYPE token's
correctness force-quirks flag to incorrect on .
Emit that DOCTYPE token. Reconsume the EOF character in the
data state .
- Anything else
- Append the current input character to the current DOCTYPE
token's system identifier. Stay in the DOCTYPE
system identifier (double-quoted) state .
- DOCTYPE system identifier (single-quoted)
state
-
Consume the next input character :
- U+0027 APOSTROPHE (')
- Switch to the after DOCTYPE system identifier
state .
- U+003E GREATER-THAN SIGN
(>)
- Parse error
.Set the DOCTYPE token's force-quirks flag to on
.Emit that DOCTYPE token. Switch to the
data state
.
- EOF
- Parse error . Set the DOCTYPE token's
correctness force-quirks flag to incorrect on .
Emit that DOCTYPE token. Reconsume the EOF character in the
data state .
- Anything else
- Append the current input character to the current DOCTYPE
token's system identifier. Stay in the DOCTYPE
system identifier (single-quoted) state .
- After DOCTYPE system identifier
state
-
Consume the next input character :
- U+0009 CHARACTER TABULATION
- U+000A LINE FEED (LF)
- U+000B LINE TABULATION
- U+000C FORM FEED (FF)
- U+0020 SPACE
- Stay in the after DOCTYPE system identifier
state .
- U+003E GREATER-THAN SIGN (>)
- Emit the current DOCTYPE token. Switch to the data state .
- EOF
- Parse error . Set the DOCTYPE token's
correctness force-quirks flag to incorrect on .
Emit that DOCTYPE token. Reconsume the EOF character in the
data state .
- Anything else
- Parse error . Switch to the bogus DOCTYPE state . (This
does not set the DOCTYPE token's force-quirks flag to on .)
- Bogus DOCTYPE state
-
Consume the next input character :
- U+003E GREATER-THAN SIGN (>)
Set the DOCTYPE token's correctness flag
to incorrect . Emit that
the DOCTYPE token. Switch to the
data state .
- EOF
Parse error . Set the DOCTYPE token's
correctness flag to incorrect . Emit that the DOCTYPE
token. Reconsume the EOF character in the data state .
- Anything else
- Stay in the bogus DOCTYPE state .
- CDATA block
state
-
(This can only happen if the
content model
flag is set to the PCDATA state,
and is unrelated to the content model flag 's
CDATA state.)
Consume every character up to the next
occurrence of the three character sequence U+005D RIGHT SQUARE
BRACKET U+005D RIGHT SQUARE BRACKET U+003E GREATER-THAN SIGN
( ]]>
), or the end of the file (EOF), whichever
comes first. Emit a series of text tokens consisting of all the
characters consumed except the matching three character sequence at
the end (if one was found before the end of the file).
Switch to the data state .
If the end of the file was reached,
reconsume the EOF character.
8.2.3.1. 8.2.4.1. Tokenising entities character
references
This section defines how to consume
an entity a
character reference . This definition is used when
parsing entities character references in text and in attributes
.
The behaviour behavior depends on the identity of the next
character (the one immediately after the U+0026 AMPERSAND
character):
- U+0009 CHARACTER TABULATION
- U+000A LINE FEED (LF)
- U+000B LINE TABULATION
- U+000C FORM FEED (FF)
- U+0020 SPACE
- U+003C LESS-THAN SIGN
- U+0026 AMPERSAND
- EOF
- The additional allowed
character ,if there is
one
- Not
an entity. a character reference. No characters are consumed,
and nothing is returned. (This is not an error, either.)
- U+0023 NUMBER SIGN (#)
-
Consume the U+0023 NUMBER SIGN.
The behaviour behavior further depends on the character after
the U+0023 NUMBER SIGN:
- U+0078 LATIN SMALL LETTER X
- U+0058 LATIN CAPITAL LETTER X
-
Consume the X.
Follow the steps below, but using the range of characters U+0030
DIGIT ZERO through to U+0039 DIGIT NINE, U+0061 LATIN SMALL LETTER
A through to U+0066 LATIN SMALL LETTER F, and U+0041 LATIN CAPITAL
LETTER A, through to U+0046 LATIN CAPITAL LETTER F (in other words,
0-9, A-F, a-f).
When it comes to interpreting the number, interpret it as a
hexadecimal number.
- Anything else
-
Follow the steps below, but using the range of characters U+0030
DIGIT ZERO through to U+0039 DIGIT NINE (i.e. just 0-9).
When it comes to interpreting the number, interpret it as a
decimal number.
Consume as many characters as match the range of characters
given above.
If no characters match the range, then don't consume any
characters (and unconsume the U+0023 NUMBER SIGN character and, if
appropriate, the X character). This is a parse
error ; nothing is returned.
Otherwise, if the next character is a U+003B SEMICOLON, consume
that too. If it isn't, there is a parse error
.
If one or more characters match the range, then take them all
and interpret the string of characters as a number (either
hexadecimal or decimal as appropriate).
If that number is one of the numbers in the first column of the
following table, then this is a parse error .
Find the row with that number in the first column, and return a
character token for the Unicode character given in the second
column of that row.
Number |
Unicode character |
0x0D |
U+000A |
LINE FEED (LF) |
0x80 |
U+20AC |
EURO SIGN ('€') |
0x81 |
U+FFFD |
REPLACEMENT CHARACTER |
0x82 |
U+201A |
SINGLE LOW-9 QUOTATION MARK ('‚') |
0x83 |
U+0192 |
LATIN SMALL LETTER F WITH HOOK ('ƒ') |
0x84 |
U+201E |
DOUBLE LOW-9 QUOTATION MARK ('„') |
0x85 |
U+2026 |
HORIZONTAL ELLIPSIS ('…') |
0x86 |
U+2020 |
DAGGER ('†') |
0x87 |
U+2021 |
DOUBLE DAGGER ('‡') |
0x88 |
U+02C6 |
MODIFIER LETTER CIRCUMFLEX ACCENT ('ˆ') |
0x89 |
U+2030 |
PER MILLE SIGN ('‰') |
0x8A |
U+0160 |
LATIN CAPITAL LETTER S WITH CARON ('Š') |
0x8B |
U+2039 |
SINGLE LEFT-POINTING ANGLE QUOTATION MARK ('‹') |
0x8C |
U+0152 |
LATIN CAPITAL LIGATURE OE ('Œ') |
0x8D |
U+FFFD |
REPLACEMENT CHARACTER |
0x8E |
U+017D |
LATIN CAPITAL LETTER Z WITH CARON ('Ž') |
0x8F |
U+FFFD |
REPLACEMENT CHARACTER |
0x90 |
U+FFFD |
REPLACEMENT CHARACTER |
0x91 |
U+2018 |
LEFT SINGLE QUOTATION MARK ('‘') |
0x92 |
U+2019 |
RIGHT SINGLE QUOTATION MARK ('’') |
0x93 |
U+201C |
LEFT DOUBLE QUOTATION MARK ('“') |
0x94 |
U+201D |
RIGHT DOUBLE QUOTATION MARK ('”') |
0x95 |
U+2022 |
BULLET ('•') |
0x96 |
U+2013 |
EN DASH ('–') |
0x97 |
U+2014 |
EM DASH ('—') |
0x98 |
U+02DC |
SMALL TILDE ('˜') |
0x99 |
U+2122 |
TRADE MARK SIGN ('™') |
0x9A |
U+0161 |
LATIN SMALL LETTER S WITH CARON ('š') |
0x9B |
U+203A |
SINGLE RIGHT-POINTING ANGLE QUOTATION MARK ('›') |
0x9C |
U+0153 |
LATIN SMALL LIGATURE OE ('œ') |
0x9D |
U+FFFD |
REPLACEMENT CHARACTER |
0x9E |
U+017E |
LATIN SMALL LETTER Z WITH CARON ('ž') |
0x9F |
U+0178 |
LATIN CAPITAL LETTER Y WITH DIAERESIS ('Ÿ') |
Otherwise, if the number is zero, if the
number is higher than 0x10FFFF, or if it's one of the surrogate
characters (characters in the range 0x0000 to 0x0008, 0x000E to 0x001F, 0x007F to
0x009F, 0xD800 to 0xDFFF),
0xDFFF , 0xFDD0 to 0xFDDF, or is one of
0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE,
0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE, 0x6FFFF,
0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF, 0xAFFFE,
0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE, 0xDFFFF,
0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or 0x10FFFF, or is
higher than 0x10FFFF, then this is a parse
error ; return a character token for the U+FFFD REPLACEMENT
CHARACTER character instead.
Otherwise, return a character token for the Unicode character
whose code point is that number.
- Anything else
-
Consume the maximum number of characters possible, with the
consumed characters case-sensitively matching one of the
identifiers in the first column of the entities named character references table.
If no match can be made, then this is a parse
error . No characters are consumed, and nothing is
returned.
If the last character matched is not a U+003B SEMICOLON (
;
), there is a parse
error .
If the entity character reference is being consumed as part of an
attribute , and the last character matched is not a U+003B
SEMICOLON ( ;
), and the next character is in
the range U+0030 DIGIT ZERO to U+0039 DIGIT NINE, U+0041 LATIN
CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z, or U+0061 LATIN
SMALL LETTER A to U+007A LATIN SMALL LETTER Z, then, for historical
reasons, all the characters that were matched after the U+0026
AMPERSAND (&) must be unconsumed, and nothing is returned.
Otherwise, return a character token for the character
corresponding to the entity character reference name (as given by the second
column of the entities named character
references table).
If the markup contains I'm ¬it; I tell
you
, the entity character reference is parsed as "not", as in,
I'm ¬it; I tell you
. But if the markup was
I'm ∉ I tell you
, the entity character
reference would be parsed as "notin;", resulting in
I'm ∉ I tell you
.