[W3C] HTML40-970708 HTML 4.0 Specification W3C Working Draft 8-July-1997 This is: http://www.w3.org/TR/WD-html40-970708/ Abstract This specification defines the HyperText Markup Language (HTML), version 4.0, the publishing language of the World Wide Web. In addition to the text, multimedia, and hyperlink features of the previous versions of HTML, HTML 4.0 supports more multimedia options, scripting languages, style sheets, better printing facilities, and documents that are more accessible to users with disabilities. HTML 4.0 also takes great strides towards the internationalization of documents, with the goal of making the Web truly World Wide. Status of this document This is a W3C Working Draft for review by W3C members and other interested parties. It is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress". This is work in progress and does not imply endorsement by, or the consensus of, either W3C or members of the HTML working group. This document has been produced as part of the W3C HTML Activity, and is intended as a draft of a proposed recommendation for HTML. The latest version of this document can be retrieved from the list of W3C technical reports at and is available as a gzip'ed tar file, a zip file, as well as a postscript (about 200 pages). We also plan to provide translations in other languages, although the English version provides the normative specification. HTML 4.0 replaces HTML 3.2, specified in http://www.w3.org/TR/REC-html32. Editors * Dave Raggett * Arnaud Le Hors * Ian Jacobs Comments Please send detailed comments on this document to www-html-editor@w3.org. We cannot garantee a personal response but we will try when it is appropriate. Public discussion on HTML features takes place on www-html@w3.org. Table of Contents 1. About the HTML 4.0 Specification 2. Introduction to HTML 4.0 1. Design principles of HTML 4.0 2. Designing documents with HTML 4.0 3. A brief SGML tutorial 3. Definitions and Conventions 4. HTML and URLs - Locating resources on the Web 5. HTML Document Character Set -Character sets, character encodings, and entities 6. Basic HTML data types -Character data, colors, and lengths 7. Structure of HTML documents - Detailed Table of Contents 1. Global structure - The HEAD and BODY of a document 2. Language information and text direction - International considerations for text 3. Text - Paragraphs, Lines, and Phrases 4. Lists - Unordered, Ordered, and Definition Lists 5. Tables 6. Links - Hypertext and Media-Independent Links 7. Inclusions - Objects, Images, and Applets in HTML documents 8. Presentation of HTML documents - Detailed Table of Contents 1. Style Sheets - Controlling the presentation of an HTML document 2. Alignment, font styles, and horizontal rules 3. Frames - Multi-view presentation of documents 9. Interactive HTML documents - Detailed Table of Contents 1. Forms - User-input Forms: Text Fields, Buttons, Menus, and more 2. Scripts - Animated Documents and Smart Forms 10. SGML reference information for HTML - Formal definition of HTML and validation 1. SGML Declaration 2. Document Type Definition 3. Named character entities 11. References 12. Indexes 1. Index of Elements 2. Index of Attributes 13. Appendixes 1. Changes between HTML 3.2 and HTML 4.0 2. Performance, Implementation, and Design Notes 3. HTML and Organizations (W3C, IETF, ISO) About the HTML 4.0 Specification Contents 1. How to read the specification 2. How the specification is organized 3. Acknowledgments This document has been written with two types of readers in mind: HTML authors and HTML implementors. We hope the specification will provide authors with the tools they need to write efficient, attractive, and accessible documents, without overexposing them to HTML's implementation details. Implementors, however, should find all they need to build user agents that interpret HTML correctly. The specification has been written with two modes of presentation in mind: electronic and printed. Although the two presentations will no doubt be similar, readers will find some differences. For example, links will not work in the printed version (obviously), and page numbers will not appear in the electronic version. In case of a discrepancy, the electronic version is considered the authoritative version of the document. How to read the specification The specification may be approached in several ways: * Read from beginning to end. The specification begins with a general presentation of HTML and becomes more and more technical and specific towards the end. This is reflected in the specification's main table of contents, which presents topical information, and the indexes, which present lower level information in alphabetical order. * Quick access to information. In order to get information about syntax and semantics as quickly as possible, the electronic version of the specification includes the following features: 1. Every reference to an element or attribute is linked to its definition in the specification. 2. Every page will include links to the indexes, so you will never be more than two links away from finding the definition of an element or attribute. 3. The front pages of the three sections of the language reference manual extend the initial table of contents with more detail about each section. How the specification is organized This specification includes the following sections: Section 2: Introduction to HTML 4.0. The introduction gives an overview of what can be done with HTML 4.0. It also provides some design tips for developing good HTML habits. Sections 3 - 11: HTML 4.0 reference manual. The bulk of the reference manual consists of the HTML language reference, which defines all elements and attributes of the language. This document has been organized by topic rather than by the grammar of HTML. Topics are grouped into three categories: structure, presentation, and interactivity. Although it is not easy to divide HTML constructs perfectly into these three categories, the model reflects the designers' experience that separating a document's structure from its presentation produces more effective and maintainable documents. The language reference consists of the following information: o Conventions used by the editors of this specification. o How HTML fits into the World Wide Web and an introduction to related Web languages and protocols such as URLs. o What characters may appear in an HTML document. o Basic data types of an HTML document. o Elements that pertain to the structure of an HTML document, including text, lists, tables, links, and included objects, images, and applets. o Elements that pertain to the presentation of an HTML document, including style sheets, fonts, colors, rules, and other visual presentation, and frames for multi-windowed presentations. o Elements that pertain to interactivity with an HTML document, including forms for user input and scripts for active documents. o The SGML definition of HTML, including the SGML declaration of HTML, the HTML DTD, and the list of character entities. o References. Section 12: Quick reference indexes. Two indexes give readers rapid access to the definition of all elements and attributes The indexes also summarize some key characteristics of each element and attribute. Section 13: Appendixes. The appendix contains information about changes from HTML 3.2, performance and implementation notes, and how W3C and other organizations interact with respect to HTML. Acknowledgments Thanks to everyone who has helped to author the working drafts that went into the HTML 4.0 specification, and all those who have sent suggestions and corrections. A particular thanks to T.V. Raman for his work on improving the accessibility of HTML forms for people with disabilities. The authors of this specification, the members of the W3C HTML Working Group, deserve much applause for their diligent review of this document, their constructive comments, and their hard work: John D. Burger, Steve Byrne, Martin J. Dürst, Daniel Glazman, Scott Isaacs, Murray Maloney, Steven Pemberton, Jared Sorensen, Powell Smith, Robert Stevahn, Ed Tecot, Jeffrey Veen, Mike Wexler, Misha Wolf, and Lauren Wood. Thank you Dan Connolly for thoughtful input and guidance as chairman of the HTML working group. Thank you Sally Khudairi for your indispensible work on the press release. Of particular help from the Inria at Sophia-Antipolis were Janet Bertot, Bert Bos, Stephane Boyera, Daniel Dardailler, Yves Lafon, Hċkon Lie, Chris Lilley, and Colas Nahaboo. Lastly, thanks to Tim Berners-Lee without whom none of this would have been possible. Introduction to HTML 4.0 Contents This is being written ... Design principles of HTML 4.0 As you read the specification, you may find it enlightening to keep in mind the following principles that guided the design of HTML 4.0. * Interoperability While most people agree that HTML documents should work well across different browsers and platforms, achieving interoperability implies higher costs to content providers since they must develop different versions of documents. If the effort is not made, however, there is much greater risk that the Web will devolve into a proprietary world of incompatible formats, ultimately reducing the Web's commercial potential for all participants. Each version of HTML attempts to reach greater consensus among industry players so that the investment made by content providers will not be wasted and that their documents will not become unreadable in a short period of time. HTML has been developed with the vision that all manner of devices should be able to use information on the Web: PCs with graphics displays of varying resolution and color depths, cellular telephones, hand held devices, devices for speech for output and input, computers with high or low bandwidth, and so on. * Internationalization This version of HTML has been designed with the help of experts in the field of internationalization, so that documents may be written in every language and be transported easily around the world. This has been accomplished by incorporating [RFC2070], which deals with the internationalization of HTML. One important step has been the adoption of the ISO/IEC:10646 standard (see [ISO10646]) as the document character set for HTML. This is the world's most inclusive standard dealing with issues of the representation of international characters, text direction, punctuation, and other world language issues. HTML now offers greater support for diverse human languages within a document. This allows for more effective indexing of documents for search engines, higher-quality typography, better text-to-speech conversion, correct hyphening, etc. * Accessibility As the Web community grows and its members diversify in their abilities and skills, it is crucial that the underlying technologies be appropriate to their specific needs. HTML has been designed to make Web pages more accessible to those with physical limitations. HTML 4.0 developments in the area of accessibility include: o Encouraging the use of style sheets (rather than tables) to achieve layout effect. o Making it easier to provided alternate (textual and aural) descriptions of images for non-visual browsers. o Providing active labels for form fields o Providing labeled hierarchical groupings for form fields. o Providing the ability to associate a longer text description with an HTML element. Authors who design pages with accessibility issues in mind will not only receive the blessings of the accessbility community, but will benefit in other ways as well: well-designed HTML documents that distinguish structure and presentation will adapt more easily to new technologies. * Tables The new table model in HTML is based on [RFC1942]. Authors now have greater control over structure and layout (e.g., column groups). The ability of designers to recommend column widths allows user agents to display table data incrementally (as it arrives) rather than waiting for the entire table before rendering. * Compound documents HTML now offers a standard mechanism for embedding generic media objects and applications in HTML documents. The OBJECT element (together with its more specific ancestor elements IMG and APPLET) provides a mechanism for including images, video, sound, mathematics, specialized applications, and other objects in a document. It also allows authors to specify a hierarchy of alternate renderings for user agents that don't support a specific rendering. * Style sheets Style sheets simplify HTML markup and largely relieve HTML of the responsibilities of presentation. They give both authors and users control over the presentation of documents --- font information, alignment, colors, etc. Stylistic information can be: o Attached to a specific element to affect, say the color or font of its content. o Placed in the document header as a series of styles comprising a style sheet o Linked to an HTML from an external style sheet. The mechanism for associating a style sheet with a document is independent of the style sheet language. * Scripting Through scripts, authors may create "smart forms" that react as users fill them out. Scripting allows designers to create dynamic Web pages, and to use HTML as a means to build networked applications. The mechanisms provided to associate HTML with scripts are independent of particular scripting languages. * Printing HTML features allow user agents to print a collection of documents in an intelligent manner based on descriptions of the relationships among documents acting as parts of a larger work. * Ease of use This version of HTML has been designed to remain easy to learn and adequate for many common publishing needs. The language offers more complex constructs (e.g., forms, scripting) for more sophisticated tasks, but even these mechanisms will become easier to use as powerful HTML authoring tools flourish. Beware - at the time of writing, some HTML authoring tools rely extensively on tables for formatting, which may easily cause accessibility problems. Designing documents with HTML 4.0 General principles for good HTML design and implementation include: * Separate structure and presentation HTML has its roots in SGML which has always been a language for the specification of structural markup. As HTML matures, more and more of its presentational elements and attributes are being replaced by other mechanisms, in particular style sheets. Experience has shown that separating the structure of a document from its presentational aspects reduces the cost of serving a wide range of platforms, media, etc., and facilitates document revisions. * Consider universal accessibility to the Web To make the Web more accessible to everyone, notably those with disabilities, authors should consider how their documents may be rendered on a variety of platforms: speech-based browsers, braille-readers, etc. We do not recommend that designers limit their creativity, only that they consider alternate renderings in their design. HTML offers a number of mechanisms to this end (e.g., the alt attribute, the accesskey attribute, etc.) Furthermore, authors should keep in mind that their documents may be reaching a far-off audience with different computer configurations. In order for documents to be interpreted correctly, designers should include in their documents information about the language and direction of the text, how the document is encoded, and other issues related to internationalization. * Help user agents with incremental rendering By carefully designing their tables and making use of new table features in HTML 4.0, designers can help user agents render documents more quickly. A brief SGML tutorial Contents 1. About SGML 2. HTML syntax 1. Entities 2. Elements 3. Attributes 4. HTML comments 3. How to read the HTML DTD 1. Block level and Inline elements 2. DTD Comments 3. Entity Definitions 4. Element definitions 5. Attribute definitions This section of the document presents introductory information about SGML and its relationship to HTML. It discusses: * HTML syntax: How to write elements, attributes, and comments. * The HTML DTD: How to read the HTML DTD. About SGML The Standard Generalized Markup Language (SGML, defined in [ISO8879]), is a language for defining markup languages. HTML is one such "application" of SGML. An SGML application consists of several parts: 1. The SGML declaration. The SGML declaration specifies which characters and delimiters may appear in the application. 2. The document type definition (DTD). The DTD defines the syntax of markup constructs. The DTD may include additional definitions such as numeric and named character entities. 3. A specification that describes the semantics to be ascribed to the markup. This specification also imposes syntax restrictions that cannot be expressed within the DTD. 4. Document instances containing data (contents) and markup. Each instance contains a reference to the DTD to be used to interpret it. The SGML declaration for HTML 4.0 and the DTD for HTML 4.0 are included in this reference manual, along with the entity sets referenced by the DTD. HTML syntax In this section, we discuss the syntax of HTML elements, attributes, and comments. Entities Character entities are numeric or symbolic names for characters that may be included in an HTML document. They are useful when your authoring tools make it difficult or impossible to enter a character you may not enter often. You will see character entities throughout this document; they begin with a "&" sign and end with a semi-colon (;). We discuss HTML character entities in detail later in the section on the HTML document character set. Elements An SGML application defines elements that represent structures or desired behavior. An element typically consists of three parts: a start tag, content, and an end tag. A element's start tag is written , where element-name is the name of the element. An element's end tag is written with a slash before the element name: . For example,
The content of the PRE element is preformatted text.
The SGML definition of HTML specifies that some HTML elements are not required to have end tags. The definition of each element in the reference manual indicates whether it requires an end tag. Some HTML elements have no content. For example, the line break element BR has no content; its only role is to terminate a line of text. Such "empty" elements never have end tags. The definition of each element in the reference manual indicates whether it is empty (has no content) or, if it can have content, what is considered legal content. Element names are always case-insensitive. Elements are not tags. Some people refer incorrectly to elements as tags (e.g., "the P tag"). Remember that the element is one thing, and the tag (be it start or end tag) is another. For instance, the HEAD element is always present, even though both start and end HEAD tags may be missing in the markup. Attributes Elements may have associated properties, called attributes, to which authors assign values. Attribute/value pairs appear before the final ">" of an element's start tag. Any number of (legal) attribute value pairs, separated by spaces, may appear in an element's start tag. They may appear in any order. In this example, the align attribute is set for the H1 element:

This is a centered heading thanks to the align attribute

By default, SGML requires you to delimit all attribute values using either double quotation marks (") or single quotation marks ('). Single quote marks can be included within the attribute value when the value is delimited by double quote marks, and vice versa. You may also use numeric character entities to represent double quotes (") and single quotes ('). For double quotes you can also use the named character entity ". In certain cases, it is possible in HTML to specify the value of an attribute without any quotation marks. The attribute value may only contain letters (a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45), and periods (ASCII decimal 46). We suggest using quotation marks even when it is possible to eliminate them. Attribute names are always case-insensitive. Attribute values are generally case-insensitive. The definition of each attribute in the reference manual indicates whether its value is case-insensitive. Note: HTML documents may compress better if you use lower case letters for element and attribute names. The reason is that the compression algorithms do a better job for more frequently repeated patterns, and lower case letters are more frequent than upper case ones. HTML comments HTML comments have the following syntax: Comments must not be rendered by user agents as part of a document. Similary, user agents must not render SGML processing instructions (e.g., ). How to read the HTML DTD This specification presents pertinent fragments of the DTD each time an element or attribute is defined. Though cryptic and dissuasive at first, the DTD fragment gives concise information about an element and its attributes. We have chosen to include the DTD fragments in the specification rather than seek a more approachable, but longer and less precise means of describing an element. While almost all of the definitions include enough English text to make them comprehensible, for those who require definitive information, we complete this specification with a brief tutorial on reading the HTML DTD. Block level and Inline elements Certain HTML elements are said to be "block level" while others are "inline" (also known as "text level"). The distinction is founded on several notions: Content model Generally, block level elements may contain inline elements and other block level elements. Generally, inline elements may generally contain only data and other inline elements. Inherent in this structural distinction is the idea that block elements create "larger" structures than inline elements. Formatting By default, block level are formatted differently than inline elements. Block level elements generally begin on new lines, inline elements generally do not. Block level elements end an unterminated paragraph element. This enables you to omit end-tags for paragraphs in many cases. Directionality For technical reasons involving the [UNICODE] bidirectional text algorithm, block level and inline elements differ in how they inherit directionality information. For details, see the section on inheritance of text direction. Style sheets provide the means to specify the rendering of arbitrary elements, including whether an element is rendered as block or inline. In some cases, such as an inline style for list elements, this may be appropriate, but generally speaking, authors are discouraged from overriding the conventional interpretation of HTML elements in this way. The alteration of the traditional presentation idioms for block level and inline elements also has an impact on the bidirectional text algorithm. See the section on the effect of style sheets on bidirectionality for more information. DTD Comments In DTDs, comments may spread over one or more lines. In the DTD, comments are delimited by a pair of "--" marks, e.g. Here, the comment "named property value" explains the use of the PARAM element. DTD comments for HTML do have not normative value. Entity Definitions The HTML DTD begins with a series of entity definitions. An entity definition (not to be confused with an SGML entity) defines a kind of macro that may be expanded elsewhere in the DTD. When the macro is referred to by name in the DTD, it is expanded into a string. An entity definition begins with the keyword . The following example defines the string that the %font entity will expand to. The string the entity expands to may contain other entity names. These names are expanded recursively. In the following example, the %inline entity is defined to include the %font, %phrase, %special and %formctrl entities. You will encounter two DTD entities frequently in the HTML DTD: %inline and %block. They are used when the content model includes inline and block level elements respectively. Element definitions The bulk of the HTML DTD consists of the definitions of elements and their attributes. The keyword begins an element definition and the > character ends it. Between these are specified: 1. The element's name. 2. Whether the element's end tag is optional. Two hyphens that appear after the element name mean that the start and end tags are mandatory. One hyphen followed by the letter "O" (not zero) indicates that the end tag can be omitted. A pair of letter "O"s indicate that both the start and end tags can be omitted. 3. The element's content, if any. The allowed content for an element is called its content model. Elements with no content are called empty elements. Empty elements are defined with the keyword "EMPTY". In this example: * The element being defined is UL. * The two hyphens indicate that both the start tag and the end tag for this element are required. * The content model for this element defined to be "at least one LI element". We describe content models in detail below. This example illustrates the definition of an empty element: * The element being defined is IMG. * The hyphen and the following "O" indicate that the end tag can be omitted, but together with the content model "EMPTY", this is strengthened to the rule that the end tag must be omitted. * The "EMPTY" keyword means the element must not have content. Content model definitions The content model describes what may be contained by an element. Content definitions may include: * The names of allowed or forbidden elements (e.g., the UL element includes instances of the LI element). * DTD entities (e.g., the LABEL element includes instances of the %inline entity). * Document text (indicated by the SGML construct "#PCDATA"). Text may contain numeric and named character entities. Recall that these begin with & and end with a semicolon (e.g., "Hergé's adventures of Tintin" includes the named entity for the "acute e" character). The content model use the following syntax to define what markup is allowed for the content of the element: ( ... ) Specifies a group. A | B Both A and B are permitted in any order. A , B A must occur before B. A & B A and B must both occur once, but may do so in any order. A? A can occur zero or one times A* A can occur zero or more times A+ A can occur one or more times Here are some examples from the HTML DTD: The SELECT element must contain one or more OPTION elements. The DL element must contain one or more DT or DD elements in any order. The OPTION element may only contain text and entities, such as & A few HTML elements use an additional SGML feature to exclude certain elements from content model. Excluded elements are preceded by a hyphen. Explicit exclusions override inclusions. In this example, the -(A) signifies that the element A cannot be included in another A element (i.e., anchors may not be nested). Note that the A element is part of the DTD entity %inline, but is excluded explicitly because of -(A). Similarly, the following element definition for FORM prohibits nested forms: Attribute definitions The keyword begins the definition of attributes that an element may take. It is followed by the name of the element in question and a list of attribute definitions. An attribute definition is a triplet that defines: * The name of an attribute. * The type of the attribute's value or an explicit set of possible values. Values defined explicitly by the DTD are case-insensitive. * Whether the default value of the attribute is implicit (keyword "#IMPLIED"), in which case the default value must be supplied by the user agent (in some cases via inheritance from parent elements); always required (keyword "#REQUIRED"); or fixed to the given value (keyword "#FIXED"). Some attributes explicitly specify a default value for the attribute. In this example, the name attribute is defined for the MAP element. The attribute is optional for this element. The type of values permitted for the attribute is given as CDATA, an SGML data type. CDATA is text that may include character entities. For more information about "CDATA", "NAME", "ID", and other data types, please consult the section on HTML data types. The following examples illustrate possible attribute definitions: rowspan NUMBER 1 -- number of rows spanned by cell -- http-equiv NAME #IMPLIED -- HTTP response header name -- id ID #IMPLIED -- document-wide unique id -- valign (top|middle|bottom|baseline) #IMPLIED The rowspan attribute requires values of type NUMBER. The default value is given explicitly as "1". The optional http-equiv attribute requires values of type NAME. The optional id attribute requires values of type ID. The optional valign attribute is constrained to take values from the set {top, middle, bottom, baseline}. DTD entities in attribute definitions Attribute definitions may also include DTD entities. In this example, we see that the attribute definition list for the LINK element begins with the %attrs entity. The %attrs entity expands to: The %attrs entity has been defined for convenience since these seven attributes are defined for most HTML elements. Simiarly, the DTD defines the %URL entity as expanding into the string CDATA. As this example illustrates, the entity %URL provides readers of the DTD with more information as to the type of data expected for an attribute. Similar entities have been defined for %color, %Content-Type, %Length, %Pixels, etc. Boolean attributes Some attributes play the role of boolean variables (e.g., selected). Their appearance in the start tag of an element implies that the value of the attribute is "true". Their absence implies a value of "false". Boolean attributes may legally take a single value: the name of the attribute itself (e.g., selected="selected"). This example defines the selected attribute to be a boolean attribute. selected (selected) #IMPLIED -- reduced interitem spacing -- The attribute is set to "true" by appearing in the element's start tag: