This section only applies to documents, authoring tools, and markup generators. In particular, it does not apply to conformance checkers; conformance checkers must use the requirements given in the next section ("parsing HTML documents").
Documents must consist of the following parts, in the given order:
html
element .The various types of content mentioned above are described in the next few sections.
In addition, there are some restrictions on how character
encoding declarations are to be serialised, serialized, as discussed in the section on that
topic.
The U+0000 NULL character must not appear
anywhere in a document. Space characters before the root
html
element, and space characters at the start of the
html
element and before the head
element, will be dropped when the document is
parsed; space characters after the root html
element will be parsed as if they were at
the end of the
element. Thus,
space characters around the root element do not round-trip.html body
It is suggested that newlines be inserted after the DOCTYPE and DOCTYPE,
after any comments that aren't in
are before the root element, after the html
element's start tag (if it is not omitted ), and after any
comments that are inside the html
element but before the head
element.
A DOCTYPE is a mostly useless, but required, header.
DOCTYPEs are required for legacy reasons. When omitted, browsers tend to use a different rendering mode that is incompatible with some specifications. Including the DOCTYPE in a document ensures that the browser makes a best-effort attempt at following the relevant specifications.
A DOCTYPE must consist of the following characters, in this order:
<
) character.!
) character.>
)
character.In other words, <!DOCTYPE HTML>
, case-insensitively.
There are four five different kinds of elements : void elements, CDATA elements,
RCDATA elements, foreign elements, and
normal elements.
base
, link
, meta
,
hr
, br
, img
,
embed
, param
, area
,
col
, input
style
, script
title
,
textarea
Tags are used to delimit the start and end of elements in the markup. CDATA, RCDATA, and normal elements have a start tag to indicate where they begin, and an end tag to indicate where they end. The start and end tags of certain normal elements can be omitted , as described later. Those that cannot be omitted must not be omitted. Void elements only have a start tag; end tags must not be specified for void elements. Foreign elements must either have a start tag and an end tag, or a start tag that is marked as self-closing, in which case they must not have an end tag.
The contents of the element must be placed between just after
the start tag (which might be implied, in certain cases ) and
just before the end tag (which again, might be implied in certain cases ). The
exact allowed contents of each individual element depends on the
content model of that element, as described earlier in this
specification. Elements must not contain content that their content
model disallows. In addition to the restrictions placed on the
contents by those content models, however, the four five types of
elements have additional syntactic requirements.
Void elements can't have any contents (since there's no end tag,
no content can be put between the start tag and the end tag.) tag).
CDATA elements can have text , though it has restrictions described below.
RCDATA elements can have text and character entity
references , but the text must not contain an ambiguous
ampersand . There are also further restrictions described
below.
Foreign elements whose start tag is marked
as self-closing can't have any contents (since, again, as there's
no end tag, no content can be put between the start tag and the end
tag). Foreign elements whose start tag is not marked as
self-closing can have text ,character references ,CDATA blocks
,other elements
,and comments
,but the text must not contain the character
U+003C LESS-THAN SIGN ( <
) or an
ambiguous ampersand .
Normal elements can have text , character entity
references , other elements , and comments , but the text must not contain the
character U+003C LESS-THAN SIGN ( <
) or an
ambiguous
ampersand . Some normal elements also have yet more restrictions on what content
they are allowed to hold, beyond the restrictions imposed by the
content model and those described in this paragraph. Those
restrictions are described below.
Tags contain a tag
name , giving the element's name. HTML elements all have
names that only use characters in the range U+0030 DIGIT ZERO .. U+0039 DIGIT NINE, U+0061
LATIN SMALL LETTER A .. U+007A LATIN SMALL LETTER Z, or, in uppercase, U+0041 LATIN CAPITAL LETTER A ..
U+005A LATIN CAPITAL LETTER Z, and U+002D HYPHEN-MINUS (
-
). In the HTML syntax, tag names may be written with
any mix of lower- and uppercase letters that, when converted to
all-lowercase, matches the element's tag name; tag names are
case-insensitive.
Start tags must have the following format:
<
)./
) character. This
character has no effect >
) character.End tags must have the following format:
<
)./
).>
) character.Attributes for an element are expressed inside the element's start tag.
Attributes have a name and a value. Attribute names must consist of one
character other than the space characters ,
U+003E GREATER-THAN SIGN (>), and U+002F SOLIDUS (/), followed
by zero or more characters other than the space characters , U+0000 NULL, U+0022 QUOTATION MARK ("), U+0027
APOSTROPHE ('), U+003E GREATER-THAN SIGN (>), U+002F
SOLIDUS (/), and U+003D EQUALS SIGN (=). (=) characters, the
control characters, and any characters that are not defined by
Unicode. In the HTML syntax, attribute names may be written
with any mix of lower- and uppercase letters that, when converted
to all-lowercase, matches the attribute's name; attribute names are
case-insensitive.
Attribute
values are a mixture of text and character entity
references , except with the additional restriction that the
text cannot contain an ambiguous ampersand .
Attributes can be specified in four different ways:
Just the attribute name .
In the following example, the disabled
attribute is given with the
empty attribute syntax:
<input disabled >
If an attribute using the empty attribute syntax is to be followed by another attribute, then there must be a space character separating the two.
The attribute
name , followed by zero or more space characters , followed by a single
U+003D EQUALS SIGN character, followed by zero or more space characters , followed by
the attribute
value , which, in addition to the requirements given above for
attribute values, must not contain any literal space characters or
U+003E GREATER-THAN SIGN ( > ) characters, and must not,
furthermore, start with either , a literal U+0022 QUOTATION MARK ( "
)
character or a literal characters, U+0027 APOSTROPHE ( '
)
character. characters, U+003D EQUALS SIGN ( =
) characters, or
U+003E GREATER-THAN SIGN ( >
)
characters.
In the following example, the value
attribute is given with the
unquoted attribute value syntax:
<input value=yes >
If an attribute using the unquoted attribute syntax is to be
followed by another attribute or by one of the optional U+002F
SOLIDUS ( /
) characters allowed in step 6 of the
start tag syntax above, then
there must be a space character separating the
two.
The attribute
name , followed by zero or more space characters , followed by a single
U+003D EQUALS SIGN character, followed by zero or more space characters , followed by
a single U+0027 APOSTROPHE ( '
) character, followed
by the attribute value , which, in addition
to the requirements given above for attribute values, must not
contain any literal U+0027 APOSTROPHE ( '
)
characters, and finally followed by a second single U+0027
APOSTROPHE ( '
) character.
In the following example, the type
attribute is given with the
single-quoted attribute value syntax:
<input type='checkbox' >
If an attribute using the single-quoted attribute syntax is to be followed by another attribute, then there must be a space character separating the two.
The attribute
name , followed by zero or more space characters , followed by a single
U+003D EQUALS SIGN character, followed by zero or more space characters , followed by
a single U+0022 QUOTATION MARK ( "
) character,
followed by the attribute value , which, in addition
to the requirements given above for attribute values, must not
contain any literal U+0022 QUOTATION MARK ( "
)
characters, and finally followed by a second single U+0022
QUOTATION MARK ( "
) character.
In the following example, the name
attribute is given with the
double-quoted attribute value syntax:
<input name="be evil" >
If an attribute using the double-quoted attribute syntax is to be followed by another attribute, then there must be a space character separating the two.
Certain tags can be omitted .
An html
element's start tag may be omitted if the first
thing inside the html
element is
not a space character or a comment .
An html
element's end tag may be omitted if the
html
element is not immediately
followed by a space character or a
comment . and the element
contains a body
element that
is either not empty or whose start tag
has not been omitted.
A head
element's start tag may be omitted if the first
thing inside the head
element is
an element.
A head
element's end tag may be omitted if the
head
element is not immediately
followed by a space character or a comment .
A body
element's start tag may be omitted if the first
thing inside the body
element is
not a space character or a comment , except if the
first thing inside the body
element is a script
or
style
element . element.
A body
element's end tag may be omitted if the
body
element is not immediately
followed by a space character or a
comment . and the element is
either not empty or its start tag
has not been omitted.
A li
element's end tag may be omitted if the
li
element is immediately followed
by another li
element or if there is
no more content in the parent element.
A dt
element's end tag may be omitted if the
dt
element is immediately followed
by another dt
element or a
dd
element.
A dd
element's end tag may be omitted if the
dd
element is immediately followed
by another dd
element or a
dt
element, or if there is no more
content in the parent element.
A p
element's end tag may be omitted if the
p
element is immediately followed by
an address
, blockquote
, dl
, fieldset
, form
,
h1
, h2
, h3
,
h4
, h5
, h6
,
hr
, menu
, ol
,
p
, pre
, table
,
or ul
element, or if there is no
more content in the parent element.
An optgroup
element's end tag may be omitted if the
optgroup
element is immediately followed by another
optgroup
element, or if there is no more content in
the parent element.
An option
element's end tag may be omitted if the
option
element is immediately followed by another
option
element, or if there is no more content in the
parent element.
A colgroup
element's
start tag may be omitted if
the first thing inside the colgroup
element is a col
element, and if the element is not
immediately preceded by another colgroup
element whose end tag has been omitted.
A colgroup
element's
end tag may be omitted if the
colgroup
element is not
immediately followed by a space character or a
comment .
A thead
element's
end tag may be omitted if the
thead
element is immediately
followed by a tbody
or
tfoot
element.
A tbody
element's start tag may be omitted if the first
thing inside the tbody
element is
a tr
element, and if the element is
not immediately preceded by a tbody
, thead
, or tfoot
element whose end tag has been omitted.
A tbody
element's end tag may be omitted if the
tbody
element is immediately
followed by a tbody
or
tfoot
element, or if there is no
more content in the parent element.
A tfoot
element's
end tag may be omitted if the
tfoot
element is immediately
followed by a tbody
element, or
if there is no more content in the parent element.
A tr
element's end tag may be omitted if the
tr
element is immediately followed
by another tr
element, or if there
is no more content in the parent element.
A td
element's end tag may be omitted if the
td
element is immediately followed
by a td
or th
element, or if there is no more content in the
parent element.
A th
element's end tag may be omitted if the
th
element is immediately followed
by a td
or th
element, or if there is no more content in the
parent element.
However , a start tag must never be omitted if it has any attributes.
For historical reasons, certain elements have extra restrictions beyond even the restrictions given by their content model.
A p element must not contain blockquote ,
dl , menu , ol , pre , table , or ul elements, even though these
elements are technically allowed inside p elements according to the
content models described in this specification. (In fact, if one of
those elements is put inside a p element in the markup, it will
instead imply a p element end tag before it.) An
optgroup
element must not contain
optgroup
elements, even though these elements are
technically allowed to be nested according to the content models
described in this specification. (If an optgroup
element is put inside another in the markup, it will in fact imply
an optgroup
end tag before it.)
A table
element must not
contain tr
elements, even though
these elements are technically allowed inside table
elements according to the content models
described in this specification. (If a tr
element is put inside a table
in the markup, it will in fact imply a
tbody
start tag before it.)
A single U+000A LINE FEED (LF) character may be placed
immediately after the start
tag of pre
and
textarea
elements. This does not affect the processing
of the element. The otherwise optional U+000A LINE FEED (LF)
character must be included if the element's contents start
with that character (because otherwise the leading newline in the
contents would be treated like the optional newline, and
ignored).
The text in CDATA and RCDATA elements must not contain any
occurences occurrences of the string " </
" (U+003C LESS-THAN SIGN, U+002F SOLIDUS) followed
by characters that case-insensitively match the tag name of the
element followed by one of U+0009 CHARACTER TABULATION, U+000A LINE
FEED (LF), U+000B LINE TABULATION, U+000C FORM FEED (FF), U+0020
SPACE, U+003E GREATER-THAN SIGN (> ), or U+002F SOLIDUS (/),
unless that string is part of an escaping text span .
An escaping text
span is a span of text (in CDATA and RCDATA
elements) and character entity references (in RCDATA
elements) that starts with an escaping text span start that is not
itself in an escaping
text span , and ends at the next escaping text span end . There cannot be any character
references inside an escaping
text span .
An escaping text
span start is a part of text that consists of the four character sequence
" <!--
" (U+003C LESS-THAN SIGN, U+0021
EXCLAMATION MARK, U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS).
An escaping text
span end is a part of text that consists of the three character
sequence " -->
" (U+002D HYPHEN-MINUS,
U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN) whose U+003E
GREATER-THAN SIGN (>).
An escaping text span start may share its U+002D HYPHEN-MINUS characters with its corresponding escaping text span end .
The text in CDATA and RCDATA elements must not have an escaping text span start that is not followed by an escaping text span end .
Text is allowed inside
elements, attributes, and comments. Text must consist of
valid Unicode characters other than U+0000. characters. Text should must not contain
U+0000 characters. Text must not contain permanently undefined
Unicode characters. Text must not contain control characters
other than space
characters . Extra constraints are placed on what is and what
is not allowed in text based on where the text is to be put, as
described in the other sections.
Newlines in HTML may be represented either as U+000D CARRIAGE RETURN (CR) characters, U+000A LINE FEED (LF) characters, or pairs of U+000D CARRIAGE RETURN (CR), U+000A LINE FEED (LF) characters in that order.
In certain cases described in other sections, text may be mixed with character entity
references . These can be used to escape characters that
couldn't otherwise legally be included in text .
Character entity references must
start with a U+0026 AMPERSAND ( &
). Following
this, there are three possible kinds of character entity references:
;
) character.#
) character, followed by one or more digits in the
range U+0030 DIGIT ZERO .. U+0039 DIGIT NINE, representing a
base-ten integer that itself is a ;
).#
) character, which must be followed by either a
U+0078 LATIN SMALL LETTER X or a U+0058 LATIN CAPITAL LETTER X
character, which must then be followed by one or more digits in the
range U+0030 DIGIT ZERO .. U+0039 DIGIT NINE, U+0061 LATIN SMALL
LETTER A .. U+0066 LATIN SMALL LETTER F, and U+0041 LATIN CAPITAL
LETTER A .. U+0046 LATIN CAPITAL LETTER F, representing a
base-sixteen integer that itself is a ;
).An ambiguous ampersand is a U+0026
AMPERSAND ( &
) character that is not the last character in the file, that is not
followed by some text
other than a space
character , that is not followed by
a start tag that has not been omitted, and
that is not followed by U+003C
LESS-THAN SIGN character ('<'), or another U+0026
AMPERSAND ( &
) character.
CDATA blocks must
start with the character sequence U+003C LESS-THAN SIGN, U+0021
EXCLAMATION MARK, U+005B LEFT SQUARE BRACKET, U+0043 LATIN CAPITAL
LETTER C, U+0044 LATIN CAPITAL LETTER D, U+0041 LATIN CAPITAL
LETTER A, U+0054 LATIN CAPITAL LETTER T, U+0041 LATIN CAPITAL
LETTER A, U+005B LEFT SQUARE BRACKET ( <![CDATA[
). Following this sequence, the block may have
text ,with the
additional restriction that the text must not contain the three
character sequence U+005D RIGHT SQUARE BRACKET, U+005D RIGHT SQUARE
BRACKET, U+003E GREATER-THAN SIGN ( ]]>
). Finally, the CDATA block must be ended by the three
character sequence U+005D RIGHT SQUARE BRACKET, U+005D RIGHT SQUARE
BRACKET, U+003E GREATER-THAN SIGN ( ]]>
).
Comments must
start with the four character sequence U+003C LESS-THAN SIGN,
U+0021 EXCLAMATION MARK, U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS (
<!--
). Following this sequence, the
comment may have text ,
with the additional restriction that the text must not start with a single U+003E GREATER-THAN SIGN ('>')
character, nor start with a U+002D HYPHEN-MINUS (
-
) character followed by a U+003E GREATER-THAN SIGN
('>') character, nor contain two consecutive U+002D
HYPHEN-MINUS ( -
) characters, nor end with a
U+002D HYPHEN-MINUS ( -
) character. Finally,
the comment must be ended by the three character sequence U+002D
HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN (
-->
).