This section only describes the rules for resources labeled with an HTML MIME type. Rules for XML resources are discussed in the section below entitled "The XHTML syntax".
Status: Last call for comments
This section only applies to documents, authoring tools, and markup generators. In particular, it does not apply to conformance checkers; conformance checkers must use the requirements given in the next section ("parsing HTML documents").
Documents must consist of the following parts, in the given order:
html element.The various types of content mentioned above are described in the next few sections.
In addition, there are some restrictions on how character encoding declarations are to be serialized, as discussed in the section on that topic.
Space characters before the root html element, and
   space characters at the start of the html element and
   before the head element, will be dropped when the
   document is parsed; space characters after the root
   html element will be parsed as if they were at the end
   of the body element. Thus, space characters around the
   root element do not round-trip.
It is suggested that newlines be inserted after the DOCTYPE,
   after any comments that are before the root element, after the
   html element's start tag (if it is not omitted), and after any comments
   that are inside the html element but before the
   head element.
Many strings in the HTML syntax (e.g. the names of elements and their attributes) are case-insensitive, but only for characters in the ranges U+0041 to U+005A (LATIN CAPITAL LETTER A to LATIN CAPITAL LETTER Z) and U+0061 to U+007A (LATIN SMALL LETTER A to LATIN SMALL LETTER Z). For convenience, in this section this is just referred to as "case-insensitive".
Status: Last call for comments. ISSUE-4 (html-versioning) and ISSUE-84 (legacy-doctypes) block progress to Last Call
A DOCTYPE is a required preamble.
DOCTYPEs are required for legacy reasons. When omitted, browsers tend to use a different rendering mode that is incompatible with some specifications. Including the DOCTYPE in a document ensures that the browser makes a best-effort attempt at following the relevant specifications.
A DOCTYPE must consist of the following characters, in this order:
<!DOCTYPE".HTML".In other words, <!DOCTYPE HTML>,
  case-insensitively.
For the purposes of HTML generators that cannot output HTML
  markup with the short DOCTYPE "<!DOCTYPE
  HTML>", a DOCTYPE legacy string may be inserted
  into the DOCTYPE (in the position defined above). This string must
  consist of:
SYSTEM".about:legacy-compat".In other words, <!DOCTYPE HTML SYSTEM
  "about:legacy-compat"> or <!DOCTYPE HTML SYSTEM
  'about:legacy-compat'>, case-insensitively except for the bit
  in single or double quotes.
The DOCTYPE legacy string should not be used unless the document is generated from a system that cannot output the shorter string.
To help authors transition from HTML4 and XHTML1, an obsolete permitted DOCTYPE string can be inserted into the DOCTYPE (in the position defined above). This string must consist of:
PUBLIC".| Public identifier | System identifier | 
|---|---|
| -//W3C//DTD HTML 4.0//EN | |
| -//W3C//DTD HTML 4.0//EN | http://www.w3.org/TR/REC-html40/strict.dtd | 
| -//W3C//DTD HTML 4.01//EN | |
| -//W3C//DTD HTML 4.01//EN | http://www.w3.org/TR/html4/strict.dtd | 
| -//W3C//DTD XHTML 1.0 Strict//EN | http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd | 
| -//W3C//DTD XHTML 1.1//EN | http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd | 
A DOCTYPE containing an obsolete permitted DOCTYPE string is an obsolete permitted DOCTYPE. Authors should not use obsolete permitted DOCTYPEs, as they are unnecessarily long.
Status: Last call for comments
There are five different kinds of elements: void elements, raw text elements, RCDATA elements, foreign elements, and normal elements.
area, base, br,
   col, command, embed,
   hr, img, input,
   keygen, link, meta,
   param, source, track,
   wbrscript, styletextarea, titleTags are used to delimit the start and end of elements in the markup. Raw text, RCDATA, and normal elements have a start tag to indicate where they begin, and an end tag to indicate where they end. The start and end tags of certain normal elements can be omitted, as described later. Those that cannot be omitted must not be omitted. Void elements only have a start tag; end tags must not be specified for void elements. Foreign elements must either have a start tag and an end tag, or a start tag that is marked as self-closing, in which case they must not have an end tag.
The contents of the element must be placed between just after the start tag (which might be implied, in certain cases) and just before the end tag (which again, might be implied in certain cases). The exact allowed contents of each individual element depends on the content model of that element, as described earlier in this specification. Elements must not contain content that their content model disallows. In addition to the restrictions placed on the contents by those content models, however, the five types of elements have additional syntactic requirements.
Void elements can't have any contents (since there's no end tag, no content can be put between the start tag and the end tag).
Raw text elements can have text, though it has restrictions described below.
RCDATA elements can have text and character references, but the text must not contain an ambiguous ampersand. There are also further restrictions described below.
Foreign elements whose start tag is marked as self-closing can't have any contents (since, again, as there's no end tag, no content can be put between the start tag and the end tag). Foreign elements whose start tag is not marked as self-closing can have text, character references, CDATA sections, other elements, and comments, but the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand.
The HTML syntax does not support namespace declarations, even in foreign elements.
For instance, consider the following HTML fragment:
<p> <svg> <metadata> <!-- this is invalid --> <cdr:license xmlns:cdr="http://www.example.com/cdr/metadata" name="MIT"/> </metadata> </svg> </p>
The innermost element, cdr:license, is
   actually in the SVG namespace, as the "xmlns:cdr" attribute has no effect (unlike in
   XML). In fact, as the comment in the fragment above says, the
   fragment is actually non-conforming. This is because the SVG
   specification does not define any elements called "cdr:license" in the SVG namespace.
Normal elements can have text, character references, other elements, and comments, but the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand. Some normal elements also have yet more restrictions on what content they are allowed to hold, beyond the restrictions imposed by the content model and those described in this paragraph. Those restrictions are described below.
Tags contain a tag name, giving the element's name. HTML elements all have names that only use characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0061 LATIN SMALL LETTER A to U+007A LATIN SMALL LETTER Z, and U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z. In the HTML syntax, tag names, even those for foreign elements, may be written with any mix of lower- and uppercase letters that, when converted to all-lowercase, matches the element's tag name; tag names are case-insensitive.
Status: Last call for comments
Start tags must have the following format:
Status: Last call for comments
End tags must have the following format:
Status: Last call for comments
Attributes for an element are expressed inside the element's start tag.
Attributes have a name and a value. Attribute names must consist of one or more characters other than the space characters, U+0000 NULL, U+0022 QUOTATION MARK ("), U+0027 APOSTROPHE ('), U+003E GREATER-THAN SIGN (>), U+002F SOLIDUS (/), and U+003D EQUALS SIGN (=) characters, the control characters, and any characters that are not defined by Unicode. In the HTML syntax, attribute names, even those for foreign elements, may be written with any mix of lower- and uppercase letters that are an ASCII case-insensitive match for the attribute's name.
Attribute values are a mixture of text and character references, except with the additional restriction that the text cannot contain an ambiguous ampersand.
Attributes can be specified in four different ways:
Just the attribute name. The value is implicitly the empty string.
In the following example, the disabled attribute is given with
     the empty attribute syntax:
<input disabled>
If an attribute using the empty attribute syntax is to be followed by another attribute, then there must be a space character separating the two.
The attribute name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN character, followed by zero or more space characters, followed by the attribute value, which, in addition to the requirements given above for attribute values, must not contain any literal space characters, any U+0022 QUOTATION MARK characters ("), U+0027 APOSTROPHE characters ('), U+003D EQUALS SIGN characters (=), U+003C LESS-THAN SIGN characters (<), U+003E GREATER-THAN SIGN characters (>), or U+0060 GRAVE ACCENT characters (`), and must not be the empty string.
In the following example, the value attribute is given
     with the unquoted attribute value syntax:
<input value=yes>
If an attribute using the unquoted attribute syntax is to be followed by another attribute or by the optional U+002F SOLIDUS character (/) allowed in step 6 of the start tag syntax above, then there must be a space character separating the two.
The attribute name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN character, followed by zero or more space characters, followed by a single U+0027 APOSTROPHE character ('), followed by the attribute value, which, in addition to the requirements given above for attribute values, must not contain any literal U+0027 APOSTROPHE characters ('), and finally followed by a second single U+0027 APOSTROPHE character (').
In the following example, the type attribute is given with the
     single-quoted attribute value syntax:
<input type='checkbox'>
If an attribute using the single-quoted attribute syntax is to be followed by another attribute, then there must be a space character separating the two.
The attribute name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN character, followed by zero or more space characters, followed by a single U+0022 QUOTATION MARK character ("), followed by the attribute value, which, in addition to the requirements given above for attribute values, must not contain any literal U+0022 QUOTATION MARK characters ("), and finally followed by a second single U+0022 QUOTATION MARK character (").
In the following example, the name attribute is given with the
     double-quoted attribute value syntax:
<input name="be evil">
If an attribute using the double-quoted attribute syntax is to be followed by another attribute, then there must be a space character separating the two.
There must never be two or more attributes on the same start tag whose names are an ASCII case-insensitive match for each other.
When a foreign element has one of the namespaced attributes given by the local name and namespace of the first and second cells of a row from the following table, it must be written using the name given by the third cell from the same row.
| Local name | Namespace | Attribute name | 
|---|---|---|
| actuate | XLink namespace | xlink:actuate | 
| arcrole | XLink namespace | xlink:arcrole | 
| href | XLink namespace | xlink:href | 
| role | XLink namespace | xlink:role | 
| show | XLink namespace | xlink:show | 
| title | XLink namespace | xlink:title | 
| type | XLink namespace | xlink:type | 
| base | XML namespace | xml:base | 
| lang | XML namespace | xml:lang | 
| space | XML namespace | xml:space | 
| xmlns | XMLNS namespace | xmlns | 
| xlink | XMLNS namespace | xmlns:xlink | 
No other namespaced attribute can be expressed in the HTML syntax.
Status: Last call for comments
Certain tags can be omitted.
Omitting an element's start tag does not mean the element
  is not present; it is implied, but it is still there. An HTML
  document always has a root html element, even if the
  string <html> doesn't appear anywhere in
  the markup.
An html element's start tag may be omitted if the
  first thing inside the html element is not a comment.
An html element's end
  tag may be omitted if the html element is not
  immediately followed by a comment.
A head element's start tag may be omitted if the
  element is empty, or if the first thing inside the
  head element is an element.
A head element's end
  tag may be omitted if the head element is not
  immediately followed by a space character or a comment.
A body element's start tag may be omitted if the
  element is empty, or if the first thing inside the body
  element is not a space character or a comment, except if the first thing
  inside the body element is a script or
  style element. 
A body element's end
  tag may be omitted if the body element is not
  immediately followed by a comment.
A li element's end
  tag may be omitted if the li element is
  immediately followed by another li element or if there
  is no more content in the parent element.
A dt element's end
  tag may be omitted if the dt element is
  immediately followed by another dt element or a
  dd element.
A dd element's end
  tag may be omitted if the dd element is
  immediately followed by another dd element or a
  dt element, or if there is no more content in the
  parent element.
A p element's end
  tag may be omitted if the p element is
  immediately followed by an address,
  article, aside, blockquote,
   dir,
  div, dl, fieldset,
  footer, form, h1,
  h2, h3, h4, h5,
  h6, header, hgroup,
  hr, menu, nav,
  ol, p, pre,
  section, table, or ul,
  element, or if there is no more content in the parent element and
  the parent element is not an a element.
An rt element's end
  tag may be omitted if the rt element is
  immediately followed by an rt or rp
  element, or if there is no more content in the parent element.
An rp element's end
  tag may be omitted if the rp element is
  immediately followed by an rt or rp
  element, or if there is no more content in the parent element.
An optgroup element's end tag may be omitted if the
  optgroup element  is immediately followed by
  another optgroup element, or if  there is no
  more content in the parent element.
An option element's end
  tag may be omitted if the option element is
  immediately followed by another option element, or if
  it is immediately followed by an optgroup element, or
  if there is no more content in the parent element.
A colgroup element's start tag may be omitted if the
  first thing inside the colgroup element is a
  col element, and if the element is not immediately
  preceded by another colgroup element whose end tag has been omitted. (It can't be
  omitted if the element is empty.)
A colgroup element's end tag may be omitted if the
  colgroup element is not immediately followed by a
  space character or a comment.
A thead element's end
  tag may be omitted if the thead element is
  immediately followed by a tbody or tfoot
  element.
A tbody element's start tag may be omitted if the
  first thing inside the tbody element is a
  tr element, and if the element is not immediately
  preceded by a tbody, thead, or
  tfoot element whose end
  tag has been omitted. (It can't be omitted if the element is
  empty.)
A tbody element's end
  tag may be omitted if the tbody element is
  immediately followed by a tbody or tfoot
  element, or if there is no more content in the parent element.
A tfoot element's end
  tag may be omitted if the tfoot element is
  immediately followed by a tbody element, or if there is
  no more content in the parent element.
A tr element's end
  tag may be omitted if the tr element is
  immediately followed by another tr element, or if there
  is no more content in the parent element.
A td element's end
  tag may be omitted if the td element is
  immediately followed by a td or th
  element, or if there is no more content in the parent element.
A th element's end
  tag may be omitted if the th element is
  immediately followed by a td or th
  element, or if there is no more content in the parent element.
However, a start tag must never be omitted if it has any attributes.
Status: Last call for comments
For historical reasons, certain elements have extra restrictions beyond even the restrictions given by their content model.
A table element must not contain tr
  elements, even though these elements are technically allowed inside
  table elements according to the content models
  described in this specification. (If a tr element is
  put inside a table in the markup, it will in fact imply
  a tbody start tag before it.)
A single newline may be
  placed immediately after the start
  tag of pre and textarea
  elements. This does not affect the processing of the element. The
  otherwise optional newline
  must be included if the element's contents themselves start
  with a newline (because
  otherwise the leading newline in the contents would be treated like
  the optional newline, and ignored).
Status: Last call for comments
The text in raw text and
  RCDATA elements must not contain any occurrences of the
  string "</" (U+003C LESS-THAN SIGN, U+002F
  SOLIDUS) followed by characters that case-insensitively match the
  tag name of the element followed by one of U+0009 CHARACTER
  TABULATION, U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+000D
  CARRIAGE RETURN (CR), U+0020 SPACE, U+003E GREATER-THAN SIGN (>), or
  U+002F SOLIDUS (/).
Status: Last call for comments
Text is allowed inside elements, attributes, and comments. Text must consist of Unicode characters. Text must not contain U+0000 characters. Text must not contain permanently undefined Unicode characters (noncharacters). Text must not contain control characters other than space characters. Extra constraints are placed on what is and what is not allowed in text based on where the text is to be put, as described in the other sections.
Status: Last call for comments
Newlines in HTML may be represented either as U+000D CARRIAGE RETURN (CR) characters, U+000A LINE FEED (LF) characters, or pairs of U+000D CARRIAGE RETURN (CR), U+000A LINE FEED (LF) characters in that order.
Where character references are allowed, a character reference of a U+000A LINE FEED (LF) character (but not a U+000D CARRIAGE RETURN (CR) character) also represents a newline.
Status: Last call for comments
In certain cases described in other sections, text may be mixed with character references. These can be used to escape characters that couldn't otherwise legally be included in text.
Character references must start with a U+0026 AMPERSAND character (&). Following this, there are three possible kinds of character references:
The numeric character reference forms described above are allowed to reference any Unicode code point other than U+0000, U+000D, permanently undefined Unicode characters (noncharacters), and control characters other than space characters.
An ambiguous ampersand is a U+0026 AMPERSAND character (&) that is followed by one or more characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0061 LATIN SMALL LETTER A to U+007A LATIN SMALL LETTER Z, and U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z, followed by a U+003B SEMICOLON character (;), where these characters do not match any of the names given in the named character references section.
Status: Last call for comments
CDATA sections must start with
  the character sequence U+003C LESS-THAN SIGN, U+0021 EXCLAMATION
  MARK, U+005B LEFT SQUARE BRACKET, U+0043 LATIN CAPITAL LETTER C,
  U+0044 LATIN CAPITAL LETTER D, U+0041 LATIN CAPITAL LETTER A, U+0054
  LATIN CAPITAL LETTER T, U+0041 LATIN CAPITAL LETTER A, U+005B LEFT
  SQUARE BRACKET (<![CDATA[). Following this
  sequence, the CDATA section may have text, with the additional restriction
  that the text must not contain the three character sequence U+005D
  RIGHT SQUARE BRACKET, U+005D RIGHT SQUARE BRACKET, U+003E
  GREATER-THAN SIGN (]]>). Finally, the CDATA
  section must be ended by the three character sequence U+005D RIGHT
  SQUARE BRACKET, U+005D RIGHT SQUARE BRACKET, U+003E GREATER-THAN
  SIGN (]]>).
CDATA sections can only be used in foreign content (MathML or
   SVG). In this example, a CDATA section is used to escape the
   contents of an ms element:
<p>You can add a string to a number, but this stringifies the number:</p> <math> <ms><![CDATA[x<y]]></ms> <mo>+</mo> <mn>3</mn> <mo>=</mo> <ms><![CDATA[x<y3]]></ms> </math>
Status: Last call for comments
Comments must start with the
  four character sequence U+003C LESS-THAN SIGN, U+0021 EXCLAMATION
  MARK, U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS (<!--). Following this sequence, the comment may
  have text, with the additional
  restriction that the text must not start with a single U+003E
  GREATER-THAN SIGN character (>), nor start with a U+002D
  HYPHEN-MINUS character (-) followed by a U+003E GREATER-THAN SIGN
  (>) character, nor contain two consecutive U+002D HYPHEN-MINUS
  characters (--), nor end with a U+002D
  HYPHEN-MINUS character (-). Finally, the comment must be ended by
  the three character sequence U+002D HYPHEN-MINUS, U+002D
  HYPHEN-MINUS, U+003E GREATER-THAN SIGN (-->).