3 On SGML and HTML


  1. Introduction to SGML
  2. SGML constructs used in HTML
    1. Elements
    2. Attributes
    3. Entities
  3. How to read the HTML DTD
    1. DTD Comments
    2. Parameter entity definitions
    3. Element declarations
    4. Attribute definitions

This section of the document introduces SGML and discusses its relationship to HTML. A complete discussion of SGML is left to the SGML standard (see [ISO8879]).

3.1 Introduction to SGML

SGML is a system for defining markup languages. Authors "mark up" their documents by representing structural, presentational, and semantic information alongside content. HTML is one example of a markup language. Here is an example of an HTML document:

      <TITLE>My first HTML document</TITLE>
      <P>Hello world!

An HTML document is divided into a head section (here, between <HEAD> and </HEAD>) and a body (here, between <BODY> and </BODY>). The title of the document appears in the head, and the bulk of the document appears in the body. This document body contains just one paragraph, marked up with <P>.

Each markup language defined in SGML is called an SGML application. An SGML application is generally characterized by:

  1. An SGML declaration. The SGML declaration specifies which characters and delimiters may appear in the application.
  2. A document type declaration (DTD). The DTD defines the syntax of markup constructs. The DTD may include additional definitions such as character entity references.
  3. A specification that describes the semantics to be ascribed to the markup. This specification also imposes syntax restrictions that cannot be expressed within the DTD.
  4. Document instances containing data (content) and markup. Each instance contains a reference to the DTD to be used to interpret it.

The HTML 4.0 specification includes an SGML declaration, three document type declarations (see the section on HTML version information for a description of the three), and a list of character references.

3.2 SGML constructs used in HTML

The following sections introduce SGML constructs that are used in HTML.

The appendix lists some SGML features that are not widely supported by HTML tools and user agents and should be avoided.

3.2.1 Elements

An SGML document type definition declares element types that represent structures or desired behavior. HTML includes element types that represent paragraphs, hypertext links, lists, tables, images, etc.

Each element type declaration generally describes three parts: a start tag, content, and an end tag.

The element's name appears in the start tag (written <element-name>) and the end tag (written </element-name>); note the slash before the element name in the end tag. For example, the start and end tags of the UL element type delimit the items in a list:

<LI><P>...list item 1...
<LI><P>...list item 2...

Some HTML element types allow authors to omit end tags (e.g., the P and LI element types). A few element types also allow the start tags to be omitted; for example, HEAD and BODY. The HTML DTD indicates for each element type whether the start tag and end tag are required.

Some HTML element types have no content. For example, the line break element BR has no content; its only role is to terminate a line of text. Such "empty" elements never have end tags. The document type definition and the text of the specification indicate whether an element type is empty (has no content) or, if it can have content, what is considered legal content.

Element names are always case-insensitive.

Elements are not tags. Some people refer to elements as tags (e.g., "the P tag"). Remember that the element is one thing, and the tag (be it start or end tag) is another. For instance, the HEAD element is always present, even though both start and end HEAD tags may be missing in the markup.

All the element types declared in this specification are listed in the element index.

3.2.2 Attributes

Elements may have associated properties, called attributes, which may have values (by default, or set by authors or scripts). Attribute/value pairs appear before the final ">" of an element's start tag. Any number of (legal) attribute value pairs, separated by spaces, may appear in an element's start tag. They may appear in any order.

In this example, the id attribute is set for an H1 element:

<H1 id="section1">
This is an identified heading thanks to the id attribute

By default, SGML requires that all attribute values be delimited using either double quotation marks (ASCII decimal 34) or single quotation marks (ASCII decimal 39). Single quote marks can be included within the attribute value when the value is delimited by double quote marks, and vice versa. Authors may also use numeric character references to represent double quotes (&#34;) and single quotes (&#39;). For double quotes authors can also use the character entity reference &quot;.

In certain cases, authors may specify the value of an attribute without any quotation marks. The attribute value may only contain letters (a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45), and periods (ASCII decimal 46). We recommend using quotation marks even when it is possible to eliminate them.

Attribute names are always case-insensitive.

Attribute values are generally case-insensitive. The definition of each attribute in the reference manual indicates whether its value is case-insensitive.

All the attributes defined by this specification are listed in the attribute index.

3.2.3 Entities

Character references are numeric or symbolic names for characters that may be included in an HTML document. They are useful for referring to rarely used characters, or those that authoring tools make it difficult or impossible to enter. You will see character entities throughout this document; they begin with a "&" sign and end with a semi-colon (;). Some common examples include:

We discuss HTML character entities in detail later in the section on the HTML document character set. The specification also contains a list of character references that may appear in HTML 4.0 documents.


HTML comments have the following syntax:

<!-- this is a comment -->
<!-- and so is this one,
    which occupies more than one line -->

White space is not permitted between the markup declaration open delimiter("<!") and the comment open delimiter ("--"), but is permitted between the comment close delimiter ("--") and the markup declaration close delimiter (">"). A common error is to include a string of hyphens ("---") within a comment. Authors should avoid putting two or more adjacent hyphens inside comments.

Information that appears between comments has no special meaning (e.g., character references are not interpreted as such).

3.3 How to read the HTML DTD

Each element and attribute declaration in this specification is accompanied by its document type definition fragment. We have chosen to include the DTD fragments in the specification rather than seek a more approachable, but longer and less precise means of describing an element's properties. The following tutorial should allow readers unfamiliar with SGML to read the DTD and understand the technical details of the HTML specification.

3.3.1 DTD Comments

In DTDs, comments may spread over one or more lines. In the DTD, comments are delimited by a pair of "--" marks, e.g.

<!ELEMENT PARAM - O EMPTY       -- named property value -->
Here, the comment "named property value" explains the use of the PARAM element type. Comments in the DTD are informative only.

3.3.2 Parameter entity definitions

The HTML DTD begins with a series of parameter entity definitions. A parameter entity definition defines a kind of macro that may be referenced and expanded elsewhere in the DTD. These macros may not appear in HTML documents, only in the DTD. Other types of macros, called character references, may be used in the text of an HTML document or within attribute values.

When the parameter entity is referred to by name in the DTD, it is expanded into a string.

A parameter entity definition begins with the keyword <!ENTITY % followed by the entity name, the quoted string the entity expands to, and finally a closing >. The following example defines the string that the %font entity will expand to.

<!ENTITY % font "TT | I | B | U | S | BIG | SMALL">

The string the parameter entity expands to may contain other parameter entity names. These names are expanded recursively. In the following example, the %inline parameter entity is defined to include the %font, %phrase, %special and %formctrl parameter entities.

<!ENTITY % inline "#PCDATA | %font | %phrase | %special | %formctrl">

You will encounter two DTD entities frequently in the HTML DTD: %block and %inline. They are used when the content model includes block-level and inline elements, respectively (defined in the section on the global structure of an HTML document).

3.3.3 Element declarations

The bulk of the HTML DTD consists of the declarations of element types and their attributes. The <!ELEMENT keyword begins a declaration and the > character ends it. Between these are specified:

  1. The element's name.
  2. Whether the element's end tag is optional. Two hyphens that appear after the element name mean that the start and end tags are mandatory. One hyphen followed by the letter "O" indicates that the end tag can be omitted. A pair of letter "O"s indicate that both the start and end tags can be omitted.
  3. The element's content, if any. The allowed content for an element is called its content model. Element types that are designed to have no content are called empty elements. The content model for such element types is declared using the keyword "EMPTY".

In this example:

    <!ELEMENT UL - - (LI)+>

This example illustrates the declaration of an empty element type:


Content model definitions 

The content model describes what may be contained by an instance of an element type. Content definitions may include:

The content model uses the following syntax to define what markup an element may contain:

( ... )
Specifies a group.
A | B
Either A or B must occur but not both.
A , B
Both A and B must occur in that order.
A & B
Both A and B must occur, but may do so in any order.
A can occur zero or one times
A can occur zero or more times
A can occur one or more times

Here are some examples from the HTML DTD:


The SELECT element must contain one or more OPTION elements.

<!ELEMENT DL - - (DT|DD)+>

The DL element must contain one or more DT or DD elements in any order.


The OPTION element may only contain text and entities, such as &amp; -- this is indicated by the SGML data type #PCDATA.

A few HTML element types use an additional SGML feature to exclude elements from content model. Excluded elements are preceded by a hyphen. Explicit exclusions override permitted elements.

In this example, the -(A) signifies that the element A cannot appear in another A element (i.e., anchors may not be nested).

   <!ELEMENT A - - (%inline)* -(A)>

Note that the A element type is part of the DTD parameter entity %inline, but is excluded explicitly because of -(A).

Similarly, the following element type declaration for FORM prohibits nested forms:

   <!ELEMENT FORM - - %block -(FORM)>

3.3.4 Attribute definitions

The <!ATTLIST keyword begins the definition of attributes that an element may take. It is followed by the name of the element in question, a list of attribute definitions, and a closing >. An attribute definition is a triplet that defines:

In this example, the name attribute is defined for the MAP element. The attribute is optional for this element.

  name        CDATA     #IMPLIED

The type of values permitted for the attribute is given as CDATA, an SGML data type. CDATA is text that may contain character references.

For more information about "CDATA", "NAME", "ID", and other data types, please consult the section on HTML data types.

The following examples illustrate several attribute definitions:

rowspan     NUMBER     1         -- number of rows spanned by cell --
http-equiv  NAME       #IMPLIED  -- HTTP response header name  --
id          ID         #IMPLIED  -- document-wide unique id -- 
valign      (top|middle|bottom|baseline) #IMPLIED

The rowspan attribute requires values of type NUMBER. The default value is given explicitly as "1". The optional http-equiv attribute requires values of type NAME. The optional id attribute requires values of type ID. The optional valign attribute is constrained to take values from the set {top, middle, bottom, baseline}.

DTD entities in attribute definitions 

Attribute definitions may also contain parameter entity references.

In this example, we see that the attribute definition list for the LINK element begins with the %attrs parameter entity.

<!ELEMENT LINK - O EMPTY -- a media-independent link -->
  %attrs;                          -- %coreattrs, %i18n, %events --
  charset     %Charset;   #IMPLIED -- char encoding of linked resource --
  href        %URL;       #IMPLIED -- URL for linked resource --
  hreflang %LanguageCode; #IMPLIED -- language code --
  type      %ContentType; #IMPLIED -- advisory content type --
  rel         %LinkTypes; #IMPLIED -- forward link types --
  rev         %LinkTypes; #IMPLIED -- reverse link types --
  media       %MediaDesc; #IMPLIED -- for rendering on these media --
  target   %FrameTarget;     #IMPLIED -- render in this frame --

Start tag: required, End tag: forbidden

The %attrs parameter entity is defined as follows:

<!ENTITY % attrs "%coreattrs; %i18n; %events;">

The %coreattrs parameter entity in the %attrs definition expands as follows:

<!ENTITY % coreattrs
 "id          ID         #IMPLIED  -- document-wide unique id --
  class       CDATA      #IMPLIED  -- space separated list of classes --
  style       CDATA      #IMPLIED  -- associated style info --
  title       %Text;     #IMPLIED  -- advisory title/amplification --"

The %attrs parameter entity has been defined for convenience since these attributes are defined for most HTML element types.

Similarly, the DTD defines the %URL parameter entity as expanding into the string "CDATA".

    -- a Uniform Resource Locator,
       see [RFC1808] and [RFC1738]

As this example illustrates, the parameter entity %URL provides readers of the DTD with more information as to the type of data expected for an attribute. Similar entities have been defined for %Color, %Charset, %Length, %Pixels, etc.

Boolean attributes 

Some attributes play the role of boolean variables (e.g., the selected attribute for the OPTION element). Their appearance in the start tag of an element implies that the value of the attribute is "true". Their absence implies a value of "false".

Boolean attributes may legally take a single value: the name of the attribute itself (e.g., selected="selected").

This example defines the selected attribute to be a boolean attribute.

selected     (selected)  #IMPLIED  -- reduced inter-item spacing --

The attribute is set to "true" by appearing in the element's start tag:

<OPTION selected="selected">

In HTML, boolean attributes may be appear in "minimized form" -- the attribute's value appears alone in the element's start tag. Thus, selected may be set by writing:

<OPTION selected>

instead of:

<OPTION selected="selected">

Authors should be aware than many user agents only recognize the minimized form of boolean attributes and not the full form.