This section only describes the rules for text/html
resources. Rules for XML resources are
discussed in the section below entitled "The XHTML
syntax".
This section only applies to documents, authoring tools, and markup generators. In particular, it does not apply to conformance checkers; conformance checkers must use the requirements given in the next section ("parsing HTML documents").
Documents must consist of the following parts, in the given order:
html
element.The various types of content mentioned above are described in the next few sections.
In addition, there are some restrictions on how character encoding declarations are to be serialized, as discussed in the section on that topic.
Space characters before the root html
element, and
space characters at the start of the html
element and
before the head
element, will be dropped when the
document is parsed; space characters after the root
html
element will be parsed as if they were at the end
of the body
element. Thus, space characters around the
root element do not round-trip.
It is suggested that newlines be inserted after the DOCTYPE,
after any comments that are before the root element, after the
html
element's start tag (if it is not omitted), and after any comments
that are inside the html
element but before the
head
element.
Many strings in the HTML syntax (e.g. the names of elements and their attributes) are case-insensitive, but only for characters in the ranges U+0041 .. U+005A (LATIN CAPITAL LETTER A to LATIN CAPITAL LETTER Z) and U+0061 .. U+007A (LATIN SMALL LETTER A to LATIN SMALL LETTER Z). For convenience, in this section this is just referred to as "case-insensitive".
A DOCTYPE is a mostly useless, but required, header.
DOCTYPEs are required for legacy reasons. When omitted, browsers tend to use a different rendering mode that is incompatible with some specifications. Including the DOCTYPE in a document ensures that the browser makes a best-effort attempt at following the relevant specifications.
A DOCTYPE must consist of the following characters, in this order:
<
) character.!
) character.DOCTYPE
".HTML
".>
) character.In other words, <!DOCTYPE HTML>
,
case-insensitively.
For the purposes of HTML generators that cannot output HTML
markup with the short DOCTYPE "<!DOCTYPE
HTML>
", a DOCTYPE legacy string may be inserted
into the DOCTYPE (in the position defined above). This string must
consist of:
SYSTEM
".about:legacy-compat
".In other words, <!DOCTYPE HTML SYSTEM
"about:legacy-compat">
or <!DOCTYPE HTML SYSTEM
'about:legacy-compat'>
, case-insensitively except for the bit
in quotes.
The DOCTYPE legacy string should not be used unless the document is generated from a system that cannot output the shorter string.
There are five different kinds of elements: void elements, CDATA elements, RCDATA elements, foreign elements, and normal elements.
base
, command
,
eventsource
, link
, meta
,
hr
, br
, img
,
embed
, param
, area
,
col
, input
, source
style
, script
title
, textarea
Tags are used to delimit the start and end of elements in the markup. CDATA, RCDATA, and normal elements have a start tag to indicate where they begin, and an end tag to indicate where they end. The start and end tags of certain normal elements can be omitted, as described later. Those that cannot be omitted must not be omitted. Void elements only have a start tag; end tags must not be specified for void elements. Foreign elements must either have a start tag and an end tag, or a start tag that is marked as self-closing, in which case they must not have an end tag.
The contents of the element must be placed between just after the start tag (which might be implied, in certain cases) and just before the end tag (which again, might be implied in certain cases). The exact allowed contents of each individual element depends on the content model of that element, as described earlier in this specification. Elements must not contain content that their content model disallows. In addition to the restrictions placed on the contents by those content models, however, the five types of elements have additional syntactic requirements.
Void elements can't have any contents (since there's no end tag, no content can be put between the start tag and the end tag).
CDATA elements can have text, though it has restrictions described below.
RCDATA elements can have text and character references, but the text must not contain an ambiguous ampersand. There are also further restrictions described below.
Foreign elements whose start tag is marked as self-closing can't
have any contents (since, again, as there's no end tag, no content
can be put between the start tag and the end tag). Foreign elements
whose start tag is not marked as self-closing can have
text, character references, CDATA sections, other elements, and comments, but the text must not
contain the character U+003C LESS-THAN SIGN (<
) or
an ambiguous
ampersand.
Normal elements can have text,
character references, other
elements, and comments, but the text must not
contain the character U+003C LESS-THAN SIGN (<
) or
an ambiguous
ampersand. Some normal elements also have yet more restrictions on what
content they are allowed to hold, beyond the restrictions imposed by
the content model and those described in this paragraph. Those
restrictions are described below.
Tags contain a tag name,
giving the element's name. HTML elements all have names that only
use characters in the range U+0030 DIGIT ZERO .. U+0039 DIGIT NINE,
U+0061 LATIN SMALL LETTER A .. U+007A LATIN SMALL LETTER Z, U+0041
LATIN CAPITAL LETTER A .. U+005A LATIN CAPITAL LETTER Z, and U+002D
HYPHEN-MINUS (-
). In the HTML syntax, tag names may be
written with any mix of lower- and uppercase letters that, when
converted to all-lowercase, matches the element's tag name; tag
names are case-insensitive.
Start tags must have the following format:
<
)./
) character. This character has no
effect on void elements, but on foreign elements it marks the start
tag as self-closing.>
) character.End tags must have the following format:
<
)./
).>
) character.Attributes for an element are expressed inside the element's start tag.
Attributes have a name and a value. Attribute names must consist of one or more characters other than the space characters, U+0000 NULL, U+0022 QUOTATION MARK ("), U+0027 APOSTROPHE ('), U+003E GREATER-THAN SIGN (>), U+002F SOLIDUS (/), and U+003D EQUALS SIGN (=) characters, the control characters, and any characters that are not defined by Unicode. In the HTML syntax, attribute names may be written with any mix of lower- and uppercase letters that are an ASCII case-insensitive match for the attribute's name.
Attribute values are a mixture of text and character references, except with the additional restriction that the text cannot contain an ambiguous ampersand.
Attributes can be specified in four different ways:
Just the attribute name.
In the following example, the disabled
attribute is given with
the empty attribute syntax:
<input disabled>
If an attribute using the empty attribute syntax is to be followed by another attribute, then there must be a space character separating the two.
The attribute name,
followed by zero or more space
characters, followed by a single U+003D EQUALS SIGN
character, followed by zero or more space characters, followed by the attribute value, which, in
addition to the requirements given above for attribute values,
must not contain any literal space
characters, any U+0022 QUOTATION MARK ("
)
characters, U+0027 APOSTROPHE ('
) characters,
U+003D EQUALS SIGN (=
) characters, or U+003E
GREATER-THAN SIGN (>
) characters, and must not be
the empty string.
In the following example, the value
attribute is given
with the unquoted attribute value syntax:
<input value=yes>
If an attribute using the unquoted attribute syntax is to be
followed by another attribute or by one of the optional U+002F
SOLIDUS (/
) characters allowed in step 6 of the start tag syntax above, then there
must be a space character separating the two.
The attribute name,
followed by zero or more space
characters, followed by a single U+003D EQUALS SIGN
character, followed by zero or more space characters, followed by a single U+0027
APOSTROPHE ('
) character, followed by the attribute value, which, in
addition to the requirements given above for attribute values,
must not contain any literal U+0027 APOSTROPHE ('
)
characters, and finally followed by a second single U+0027
APOSTROPHE ('
) character.
In the following example, the type
attribute is given with the
single-quoted attribute value syntax:
<input type='checkbox'>
If an attribute using the single-quoted attribute syntax is to be followed by another attribute, then there must be a space character separating the two.
The attribute name,
followed by zero or more space
characters, followed by a single U+003D EQUALS SIGN
character, followed by zero or more space characters, followed by a single U+0022
QUOTATION MARK ("
) character, followed by the attribute value, which, in
addition to the requirements given above for attribute values,
must not contain any literal U+0022 QUOTATION MARK
("
) characters, and finally followed by a second
single U+0022 QUOTATION MARK ("
) character.
In the following example, the name
attribute is given with the
double-quoted attribute value syntax:
<input name="be evil">
If an attribute using the double-quoted attribute syntax is to be followed by another attribute, then there must be a space character separating the two.
There must never be two or more attributes on the same start tag whose names are an ASCII case-insensitive match for each other.
Certain tags can be omitted.
An html
element's start tag may be omitted if the
first thing inside the html
element is not a comment.
An html
element's end
tag may be omitted if the html
element is not
immediately followed by a comment.
A head
element's start tag may be omitted if the
first thing inside the head
element is an element.
A head
element's end
tag may be omitted if the head
element is not
immediately followed by a space character or a comment.
A body
element's start tag may be omitted if the
first thing inside the body
element is not a
space character or a comment, except if the first thing
inside the body
element is a script
or
style
element.
A body
element's end
tag may be omitted if the body
element is not
immediately followed by a comment.
A li
element's end
tag may be omitted if the li
element is
immediately followed by another li
element or if there
is no more content in the parent element.
A dt
element's end
tag may be omitted if the dt
element is
immediately followed by another dt
element or a
dd
element.
A dd
element's end
tag may be omitted if the dd
element is
immediately followed by another dd
element or a
dt
element, or if there is no more content in the
parent element.
A p
element's end
tag may be omitted if the p
element is
immediately followed by an address
,
article
, aside
, blockquote
,
datagrid
, dialog
, dir
,
div
, dl
, fieldset
,
footer
, form
, h1
,
h2
, h3
, h4
, h5
,
h6
, header
, hr
,
menu
, nav
, ol
,
p
, pre
, section
,
table
, or ul
, element, or if there is no
more content in the parent element and the parent element is not an
a
element.
An rt
element's end
tag may be omitted if the rt
element is
immediately followed by an rt
or rp
element, or if there is no more content in the parent element.
An rp
element's end
tag may be omitted if the rp
element is
immediately followed by an rt
or rp
element, or if there is no more content in the parent element.
An optgroup
element's end tag may be omitted if the
optgroup
element is immediately followed by
another optgroup
element, or if there is no
more content in the parent element.
An option
element's end
tag may be omitted if the option
element is
immediately followed by another option
element, or if
it is immediately followed by an optgroup
element, or
if there is no more content in the parent element.
A colgroup
element's start tag may be omitted if the
first thing inside the colgroup
element is a
col
element, and if the element is not immediately
preceded by another colgroup
element whose end tag has been omitted.
A colgroup
element's end tag may be omitted if the
colgroup
element is not immediately followed by a
space character or a comment.
A thead
element's end
tag may be omitted if the thead
element is
immediately followed by a tbody
or tfoot
element.
A tbody
element's start tag may be omitted if the
first thing inside the tbody
element is a
tr
element, and if the element is not immediately
preceded by a tbody
, thead
, or
tfoot
element whose end
tag has been omitted.
A tbody
element's end
tag may be omitted if the tbody
element is
immediately followed by a tbody
or tfoot
element, or if there is no more content in the parent element.
A tfoot
element's end
tag may be omitted if the tfoot
element is
immediately followed by a tbody
element, or if there is
no more content in the parent element.
A tr
element's end
tag may be omitted if the tr
element is
immediately followed by another tr
element, or if there
is no more content in the parent element.
A td
element's end
tag may be omitted if the td
element is
immediately followed by a td
or th
element, or if there is no more content in the parent element.
A th
element's end
tag may be omitted if the th
element is
immediately followed by a td
or th
element, or if there is no more content in the parent element.
However, a start tag must never be omitted if it has any attributes.
For historical reasons, certain elements have extra restrictions beyond even the restrictions given by their content model.
A table
element must not contain tr
elements, even though these elements are technically allowed inside
table
elements according to the content models
described in this specification. (If a tr
element is
put inside a table
in the markup, it will in fact imply
a tbody
start tag before it.)
A single U+000A LINE FEED (LF) character may be placed
immediately after the start
tag of pre
and textarea
elements. This does not affect the processing of the element. The
otherwise optional U+000A LINE FEED (LF) character must be
included if the element's contents start with that character
(because otherwise the leading newline in the contents would be
treated like the optional newline, and ignored).
The text in CDATA and RCDATA elements must not contain any
occurrences of the string "</
" (U+003C
LESS-THAN SIGN, U+002F SOLIDUS) followed by characters that
case-insensitively match the tag name of the element followed by one
of U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C FORM
FEED (FF), U+0020 SPACE, U+003E GREATER-THAN SIGN (>), or U+002F
SOLIDUS (/), unless that string is part of an escaping text span.
An escaping text span is a span of text that starts with an escaping text span start that is not itself in an escaping text span, and ends at the next escaping text span end. There cannot be any character references inside an escaping text span — sequences of characters that would look like character references do not have special meaning.
An escaping text span
start is a part of text that
consists of the four character sequence "<!--
" (U+003C LESS-THAN SIGN, U+0021 EXCLAMATION
MARK, U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS).
An escaping text span end is
a part of text that consists of the
three character sequence "-->
" (U+002D
HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN) whose
U+003E GREATER-THAN SIGN (>).
An escaping text span start may share its U+002D HYPHEN-MINUS characters with its corresponding escaping text span end.
The text in CDATA and RCDATA elements must not have an escaping text span start that is not followed by an escaping text span end.
Text is allowed inside elements, attributes, and comments. Text must consist of Unicode characters. Text must not contain U+0000 characters. Text must not contain permanently undefined Unicode characters. Text must not contain control characters other than space characters. Extra constraints are placed on what is and what is not allowed in text based on where the text is to be put, as described in the other sections.
Newlines in HTML may be represented either as U+000D CARRIAGE RETURN (CR) characters, U+000A LINE FEED (LF) characters, or pairs of U+000D CARRIAGE RETURN (CR), U+000A LINE FEED (LF) characters in that order.
In certain cases described in other sections, text may be mixed with character references. These can be used to escape characters that couldn't otherwise legally be included in text.
Character references must start with a U+0026 AMPERSAND
(&
). Following this, there are three possible kinds
of character references:
;
) character.#
) character, followed by one or more digits in the
range U+0030 DIGIT ZERO .. U+0039 DIGIT NINE, representing a
base-ten integer that itself is a Unicode code point that is
not U+0000, U+000D, in the range U+0080 .. U+009F, or in the range
0xD800 .. 0xDFFF (surrogates). The digits must then be followed by
a U+003B SEMICOLON character (;
).#
) character, which must be followed by either a
U+0078 LATIN SMALL LETTER X or a U+0058 LATIN CAPITAL LETTER X
character, which must then be followed by one or more digits in the
range U+0030 DIGIT ZERO .. U+0039 DIGIT NINE, U+0061 LATIN SMALL
LETTER A .. U+0066 LATIN SMALL LETTER F, and U+0041 LATIN CAPITAL
LETTER A .. U+0046 LATIN CAPITAL LETTER F, representing a
base-sixteen integer that itself is a Unicode code point that is
not U+0000, U+000D, in the range U+0080 .. U+009F, or in the range
0xD800 .. 0xDFFF (surrogates). The digits must then be followed by
a U+003B SEMICOLON character (;
).An ambiguous
ampersand is a U+0026 AMPERSAND (&
) character
that is followed by some text other
than a space character, a U+003C LESS-THAN SIGN
character ('<'), or another U+0026 AMPERSAND (&
)
character.
CDATA sections must start with
the character sequence U+003C LESS-THAN SIGN, U+0021 EXCLAMATION
MARK, U+005B LEFT SQUARE BRACKET, U+0043 LATIN CAPITAL LETTER C,
U+0044 LATIN CAPITAL LETTER D, U+0041 LATIN CAPITAL LETTER A, U+0054
LATIN CAPITAL LETTER T, U+0041 LATIN CAPITAL LETTER A, U+005B LEFT
SQUARE BRACKET (<![CDATA[
). Following this
sequence, the CDATA section may have text, with the additional restriction
that the text must not contain the three character sequence U+005D
RIGHT SQUARE BRACKET, U+005D RIGHT SQUARE BRACKET, U+003E
GREATER-THAN SIGN (]]>
). Finally, the CDATA
section must be ended by the three character sequence U+005D RIGHT
SQUARE BRACKET, U+005D RIGHT SQUARE BRACKET, U+003E GREATER-THAN
SIGN (]]>
).
Comments must start with the
four character sequence U+003C LESS-THAN SIGN, U+0021 EXCLAMATION
MARK, U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS (<!--
). Following this sequence, the comment may
have text, with the additional
restriction that the text must not start with a single U+003E
GREATER-THAN SIGN ('>') character, nor start with a U+002D
HYPHEN-MINUS (-
) character followed by a
U+003E GREATER-THAN SIGN ('>') character, nor contain two
consecutive U+002D HYPHEN-MINUS (-
)
characters, nor end with a U+002D HYPHEN-MINUS (-
) character. Finally, the comment must be ended by
the three character sequence U+002D HYPHEN-MINUS, U+002D
HYPHEN-MINUS, U+003E GREATER-THAN SIGN (-->
).