Arnaud Le Hors, W3C
Ian Jacobs, W3C
Copyright ©1997-1999 W3C® (MIT, INRIA, Keio), All Rights Reserved. W3C
liability, trademark, document use and software licensing rules apply.
_________________________________________________________________
Abstract
This specification defines the HyperText Markup Language (HTML), the
publishing language of the World Wide Web. This specification defines
HTML 4.01, which is a subversion of HTML 4. In addition to the text,
multimedia, and hyperlink features of the previous versions of HTML
(HTML 3.2 [HTML32] and HTML 2.0 [RFC1866]), HTML 4 supports more
multimedia options, scripting languages, style sheets, better printing
facilities, and documents that are more accessible to users with
disabilities. HTML 4 also takes great strides towards the
internationalization of documents, with the goal of making the Web
truly World Wide.
HTML 4 is an SGML application conforming to International Standard ISO
8879 -- Standard Generalized Markup Language [ISO8879].
Status of this document
This section describes the status of this document at the time of its
publication. Other documents may supersede this document. The latest
status of this document series is maintained at the W3C.
This document specifies HTML 4.01, which is part of the HTML 4 line of
specifications. The first version of HTML 4 was HTML 4.0 [HTML40],
published on 18 December 1997 and revised 24 April 1998. This
specification is the first HTML 4.01 Recommendation. It includes
non-editorial changes since the 24 April version of HTML 4.0. There
have been some changes to the DTDs, for example. This document
obsoletes previous versions of HTML 4.0, although W3C will continue to
make those specifications and their DTDs available at the W3C Web
site.
This document has been reviewed by W3C Members and other interested
parties and has been endorsed by the Director as a W3C Recommendation.
It is a stable document and may be used as reference material or cited
as a normative reference from another document. W3C's role in making
the Recommendation is to draw attention to the specification and to
promote its widespread deployment. This enhances the functionality and
interoperability of the Web.
W3C recommends that user agents and authors (and in particular,
authoring tools) produce HTML 4.01 documents rather than HTML 4.0
documents. W3C recommends that authors produce HTML 4 documents
instead of HTML 3.2 documents. For reasons of backward compatibility,
W3C also recommends that tools interpreting HTML 4 continue to support
HTML 3.2 and HTML 2.0 as well.
For information about the next generation of HTML, "The Extensible
HyperText Markup Language" [XHTML], please refer to the W3C HTML
Activity and the list of W3C Technical Reports.
This document has been produced as part of the W3C HTML Activity. The
goals of the HTML Working Group (Members only) are discussed in the
HTML Working Group charter (Members only).
A list of current W3C Recommendations and other technical documents
can be found at http://www.w3.org/TR.
Public discussion on HTML features takes place on www-html@w3.org
(archives of www-html@w3.org).
Available languages
The English version of this specification is the only normative
version. However, for translations of this document, see
http://www.w3.org/MarkUp/html4-updates/translations.
Errata
The list of known errors in this specification is available at:
http://www.w3.org/MarkUp/html4-updates/errata
Please report errors in this document to www-html-editor@w3.org.
Quick Table of Contents
1. About the HTML 4 Specification
2. Introduction to HTML 4
3. On SGML and HTML
4. Conformance: requirements and recommendations
5. HTML Document Representation - Character sets, character
encodings, and entities
6. Basic HTML data types - Character data, colors, lengths, URIs,
content types, etc.
7. The global structure of an HTML document - The HEAD and BODY of a
document
8. Language information and text direction - International
considerations for text
9. Text - Paragraphs, Lines, and Phrases
10. Lists - Unordered, Ordered, and Definition Lists
11. Tables
12. Links - Hypertext and Media-Independent Links
13. Objects, Images, and Applets
14. Style Sheets - Adding style to HTML documents
15. Alignment, font styles, and horizontal rules
16. Frames - Multi-view presentation of documents
17. Forms - User-input Forms: Text Fields, Buttons, Menus, and more
18. Scripts - Animated Documents and Smart Forms
19. SGML reference information for HTML - Formal definition of HTML
and validation
20. SGML Declaration of HTML 4
21. Document Type Definition
22. Transitional Document Type Definition
23. Frameset Document Type Definition
24. Character entity references in HTML 4
A. Changes
B. Performance, Implementation, and Design Notes
* References
* Index of Elements
* Index of Attributes
* Index
Full Table of Contents
1. About the HTML 4 Specification
1. How the specification is organized
2. Document conventions
1. Elements and attributes
2. Notes and examples
3. Acknowledgments
1. Acknowledgments for the current revision
4. Copyright Notice
2. Introduction to HTML 4
1. What is the World Wide Web?
1. Introduction to URIs
2. Fragment identifiers
3. Relative URIs
2. What is HTML?
1. A brief history of HTML
3. HTML 4
1. Internationalization
2. Accessibility
3. Tables
4. Compound documents
5. Style sheets
6. Scripting
7. Printing
4. Authoring documents with HTML 4
1. Separate structure and presentation
2. Consider universal accessibility to the Web
3. Help user agents with incremental rendering
3. On SGML and HTML
1. Introduction to SGML
2. SGML constructs used in HTML
1. Elements
2. Attributes
3. Character references
4. Comments
3. How to read the HTML DTD
1. DTD Comments
2. Parameter entity definitions
3. Element declarations
# Content model definitions
4. Attribute declarations
# DTD entities in attribute definitions
# Boolean attributes
4. Conformance: requirements and recommendations
1. Definitions
2. SGML
3. The text/html content type
5. HTML Document Representation - Character sets, character
encodings, and entities
1. The Document Character Set
2. Character encodings
1. Choosing an encoding
# Notes on specific encodings
2. Specifying the character encoding
3. Character references
1. Numeric character references
2. Character entity references
4. Undisplayable characters
6. Basic HTML data types - Character data, colors, lengths, URIs,
content types, etc.
1. Case information
2. SGML basic types
3. Text strings
4. URIs
5. Colors
1. Notes on using colors
6. Lengths
7. Content types (MIME types)
8. Language codes
9. Character encodings
10. Single characters
11. Dates and times
12. Link types
13. Media descriptors
14. Script data
15. Style sheet data
16. Frame target names
7. The global structure of an HTML document - The HEAD and BODY of a
document
1. Introduction to the structure of an HTML document
2. HTML version information
3. The HTML element
4. The document head
1. The HEAD element
2. The TITLE element
3. The title attribute
4. Meta data
# Specifying meta data
# The META element
# Meta data profiles
5. The document body
1. The BODY element
2. Element identifiers: the id and class attributes
3. Block-level and inline elements
4. Grouping elements: the DIV and SPAN elements
5. Headings: The H1, H2, H3, H4, H5, H6 elements
6. The ADDRESS element
8. Language information and text direction - International
considerations for text
1. Specifying the language of content: the lang attribute
1. Language codes
2. Inheritance of language codes
3. Interpretation of language codes
2. Specifying the direction of text and tables: the dir
attribute
1. Introduction to the bidirectional algorithm
2. Inheritance of text direction information
3. Setting the direction of embedded text
4. Overriding the bidirectional algorithm: the BDO element
5. Character references for directionality and joining
control
6. The effect of style sheets on bidirectionality
9. Text - Paragraphs, Lines, and Phrases
1. White space
2. Structured text
1. Phrase elements: EM, STRONG, DFN, CODE, SAMP, KBD, VAR,
CITE, ABBR, and ACRONYM
2. Quotations: The BLOCKQUOTE and Q elements
# Rendering quotations
3. Subscripts and superscripts: the SUB and SUP elements
3. Lines and Paragraphs
1. Paragraphs: the P element
2. Controlling line breaks
# Forcing a line break: the BR element
# Prohibiting a line break
3. Hyphenation
4. Preformatted text: The PRE element
5. Visual rendering of paragraphs
4. Marking document changes: The INS and DEL elements
10. Lists - Unordered, Ordered, and Definition Lists
1. Introduction to lists
2. Unordered lists (UL), ordered lists (OL), and list items (LI)
3. Definition lists: the DL, DT, and DD elements
1. Visual rendering of lists
4. The DIR and MENU elements
11. Tables
1. Introduction to tables
2. Elements for constructing tables
1. The TABLE element
# Table directionality
2. Table Captions: The CAPTION element
3. Row groups: the THEAD, TFOOT, and TBODY elements
4. Column groups: the COLGROUP and COL elements
# The COLGROUP element
# The COL element
# Calculating the number of columns in a table
# Calculating the width of columns
5. Table rows: The TR element
6. Table cells: The TH and TD elements
# Cells that span several rows or columns
3. Table formatting by visual user agents
1. Borders and rules
2. Horizontal and vertical alignment
# Inheritance of alignment specifications
3. Cell margins
4. Table rendering by non-visual user agents
1. Associating header information with data cells
2. Categorizing cells
3. Algorithm to find heading information
5. Sample table
12. Links - Hypertext and Media-Independent Links
1. Introduction to links and anchors
1. Visiting a linked resource
2. Other link relationships
3. Specifying anchors and links
4. Link titles
5. Internationalization and links
2. The A element
1. Syntax of anchor names
2. Nested links are illegal
3. Anchors with the id attribute
4. Unavailable and unidentifiable resources
3. Document relationships: the LINK element
1. Forward and reverse links
2. Links and external style sheets
3. Links and search engines
4. Path information: the BASE element
1. Resolving relative URIs
13. Objects, Images, and Applets
1. Introduction to objects, images, and applets
2. Including an image: the IMG element
3. Generic inclusion: the OBJECT element
1. Rules for rendering objects
2. Object initialization: the PARAM element
3. Global naming schemes for objects
4. Object declarations and instantiations
4. Including an applet: the APPLET element
5. Notes on embedded documents
6. Image maps
1. Client-side image maps: the MAP and AREA elements
# Client-side image map examples
2. Server-side image maps
7. Visual presentation of images, objects, and applets
1. Width and height
2. White space around images and objects
3. Borders
4. Alignment
8. How to specify alternate text
14. Style Sheets - Adding style to HTML documents
1. Introduction to style sheets
2. Adding style to HTML
1. Setting the default style sheet language
2. Inline style information
3. Header style information: the STYLE element
4. Media types
3. External style sheets
1. Preferred and alternate style sheets
2. Specifying external style sheets
4. Cascading style sheets
1. Media-dependent cascades
2. Inheritance and cascading
5. Hiding style data from user agents
6. Linking to style sheets with HTTP headers
15. Alignment, font styles, and horizontal rules
1. Formatting
1. Background color
2. Alignment
3. Floating objects
# Float an object
# Float text around an object
2. Fonts
1. Font style elements: the TT, I, B, BIG, SMALL, STRIKE,
S, and U elements
2. Font modifier elements: FONT and BASEFONT
3. Rules: the HR element
16. Frames - Multi-view presentation of documents
1. Introduction to frames
2. Layout of frames
1. The FRAMESET element
# Rows and columns
# Nested frame sets
# Sharing data among frames
2. The FRAME element
# Setting the initial contents of a frame
# Visual rendering of a frame
3. Specifying target frame information
1. Setting the default target for links
2. Target semantics
4. Alternate content
1. The NOFRAMES element
2. Long descriptions of frames
5. Inline frames: the IFRAME element
17. Forms - User-input Forms: Text Fields, Buttons, Menus, and more
1. Introduction to forms
2. Controls
1. Control types
3. The FORM element
4. The INPUT element
1. Control types created with INPUT
2. Examples of forms containing INPUT controls
5. The BUTTON element
6. The SELECT, OPTGROUP, and OPTION elements
1. Pre-selected options
7. The TEXTAREA element
8. The ISINDEX element
9. Labels
1. The LABEL element
10. Adding structure to forms: the FIELDSET and LEGEND elements
11. Giving focus to an element
1. Tabbing navigation
2. Access keys
12. Disabled and read-only controls
1. Disabled controls
2. Read-only controls
13. Form submission
1. Form submission method
2. Successful controls
3. Processing form data
# Step one: Identify the successful controls
# Step two: Build a form data set
# Step three: Encode the form data set
# Step four: Submit the encoded form data set
4. Form content types
# application/x-www-form-urlencoded
# multipart/form-data
18. Scripts - Animated Documents and Smart Forms
1. Introduction to scripts
2. Designing documents for user agents that support scripting
1. The SCRIPT element
2. Specifying the scripting language
# The default scripting language
# Local declaration of a scripting language
# References to HTML elements from a script
3. Intrinsic events
4. Dynamic modification of documents
3. Designing documents for user agents that don't support
scripting
1. The NOSCRIPT element
2. Hiding script data from user agents
19. SGML reference information for HTML - Formal definition of HTML
and validation
1. Document Validation
2. Sample SGML catalog
20. SGML Declaration of HTML 4
1. SGML Declaration
21. Document Type Definition
22. Transitional Document Type Definition
23. Frameset Document Type Definition
24. Character entity references in HTML 4
1. Introduction to character entity references
2. Character entity references for ISO 8859-1 characters
1. The list of characters
3. Character entity references for symbols, mathematical
symbols, and Greek letters
1. The list of characters
4. Character entity references for markup-significant and
internationalization characters
1. The list of characters
A. Changes
1. Changes between 24 April 1998 HTML 4.0 and 24 December 1999
HTML 4.01 versions
1. Changes to the specification
# General changes
# On SGML and HTML
# HTML Document Representation
# Basic HTML data types
# Global structure of an HTML document
# Language information and text direction
# Tables
# Links
# Objects, Images, and Applets
# Style Sheets in HTML Documents
# Frames
# Forms
# SGML Declaration
# Strict DTD
# Notes
# References
2. Errors that were corrected
3. Minor typographical errors that were corrected
4. Clarifications
5. Known Browser problems
2. Changes between 18 December 1997 and 24 April 1998 versions
1. Errors that were corrected
2. Minor typographical errors that were corrected
3. Changes between HTML 3.2 and HTML 4.0 (18 December 1997)
1. Changes to elements
# New elements
# Deprecated elements
# Obsolete elements
2. Changes to attributes
3. Changes for accessibility
4. Changes for meta data
5. Changes for text
6. Changes for links
7. Changes for tables
8. Changes for images, objects, and image maps
9. Changes for forms
10. Changes for style sheets
11. Changes for frames
12. Changes for scripting
13. Changes for internationalization
B. Performance, Implementation, and Design Notes
1. Notes on invalid documents
2. Special characters in URI attribute values
1. Non-ASCII characters in URI attribute values
2. Ampersands in URI attribute values
3. SGML implementation notes
1. Line breaks
2. Specifying non-HTML data
# Element content
# Attribute values
3. SGML features with limited support
4. Boolean attributes
5. Marked Sections
6. Processing Instructions
7. Shorthand markup
4. Notes on helping search engines index your Web site
1. Search robots
# The robots.txt file
# Robots and the META element
5. Notes on tables
1. Design rationale
# Dynamic reformatting
# Incremental display
# Structure and presentation
# Row and column groups
# Accessibility
2. Recommended Layout Algorithms
# Fixed Layout Algorithm
# Autolayout Algorithm
6. Notes on forms
1. Incremental display
2. Future projects
7. Notes on scripting
1. Reserved syntax for future script macros
# Current Practice for Script Macros
8. Notes on frames
9. Notes on accessibility
10. Notes on security
1. Security issues for forms
* References
1. Normative references
2. Informative references
* Index of Elements
* Index of Attributes
* Index
1 About the HTML 4 Specification
Contents
1. How the specification is organized
2. Document conventions
1. Elements and attributes
2. Notes and examples
3. Acknowledgments
1. Acknowledgments for the current revision
4. Copyright Notice
1.1 How the specification is organized
This specification is divided into the following sections:
Sections 2 and 3: Introduction to HTML 4
The introduction describes HTML's place in the scheme of the
World Wide Web, provides a brief history of the development of
HTML, highlights what can be done with HTML 4, and provides
some HTML authoring tips.
The brief SGML tutorial gives readers some understanding of
HTML's relationship to SGML and gives summary information on
how to read the HTML Document Type Definition (DTD).
Sections 4 - 24: HTML 4 reference manual
The bulk of the reference manual consists of the HTML language
reference, which defines all elements and attributes of the
language.
This document has been organized by topic rather than by the
grammar of HTML. Topics are grouped into three categories:
structure, presentation, and interactivity. Although it is not
easy to divide HTML constructs perfectly into these three
categories, the model reflects the HTML Working Group's
experience that separating a document's structure from its
presentation produces more effective and maintainable
documents.
The language reference consists of the following information:
+ What characters may appear in an HTML document.
+ Basic data types of an HTML document.
+ Elements that govern the structure of an HTML document,
including text, lists, tables, links, and included objects,
images, and applets.
+ Elements that govern the presentation of an HTML document,
including style sheets, fonts, colors, rules, and other
visual presentation, and frames for multi-windowed
presentations.
+ Elements that govern interactivity with an HTML document,
including forms for user input and scripts for active
documents.
+ The SGML formal definition of HTML:
o The SGML declaration of HTML.
o Three DTDs: strict, transitional, and frameset.
o The list of character references.
Appendixes
The first appendix contains information about changes from HTML
3.2 to help authors and implementors with the transition to
HTML 4, and changes from the 18 December 1997 specification.
The second appendix contains performance and implementation
notes, and is primarily intended to help implementors create
user agents for HTML 4.
References
A list of normative and informative references.
Indexes
Three indexes give readers rapid access to the definition of
key concepts, elements and attributes.
1.2 Document conventions
This document has been written with two types of readers in mind:
authors and implementors. We hope the specification will provide
authors with the tools they need to write efficient, attractive, and
accessible documents, without over-exposing them to HTML's
implementation details. Implementors, however, should find all they
need to build conforming user agents.
The specification may be approached in several ways:
* Read from beginning to end. The specification begins with a
general presentation of HTML and becomes more and more technical
and specific towards the end.
* Quick access to information. In order to get information about
syntax and semantics as quickly as possible, the online version of
the specification includes the following features:
1. Every reference to an element or attribute is linked to its
definition in the specification. Each element or attribute is
defined in only one location.
2. Every page includes links to the indexes, so you never are
more than two links away from finding the definition of an
element or attribute.
3. The front pages of each section of the language reference
manual extend the initial table of contents with more detail
about that section.
1.2.1 Elements and attributes
Element names are written in uppercase letters (e.g., BODY). Attribute
names are written in lowercase letters (e.g., lang, onsubmit). Recall
that in HTML, element and attribute names are case-insensitive; the
convention is meant to encourage readability.
Element and attribute names in this document have been marked up and
may be rendered specially by some user agents.
Each attribute definition specifies the type of its value. If the type
allows a small set of possible values, the definition lists the set of
values, separated by a bar (|).
After the type information, each attribute definition indicates the
case-sensitivity of its values, between square brackets ("[]"). See
the section on case information for details.
1.2.2 Notes and examples
Informative notes are emphasized to stand out from surrounding text
and may be rendered specially by some user agents.
All examples illustrating deprecated usage are marked as "DEPRECATED
EXAMPLE". Deprecated examples also include recommended alternate
solutions. All examples that illustrates illegal usage are clearly
marked "ILLEGAL EXAMPLE".
Examples and notes have been marked up and may be rendered specially
by some user agents.
1.3 Acknowledgments
Thanks to everyone who has helped to author the working drafts that
went into the HTML 4 specification, and to all those who have sent
suggestions and corrections.
Many thanks to the Web Accessibility Initiative task force (WAI HC
group) for their work on improving the accessibility of HTML and to
T.V. Raman (Adobe) for his early work on developing accessible forms.
The authors of this specification, the members of the W3C HTML Working
Group, deserve much applause for their diligent review of this
document, their constructive comments, and their hard work: John D.
Burger (MITRE), Steve Byrne (JavaSoft), Martin J. Dürst (University of
Zurich), Daniel Glazman (Electricité de France), Scott Isaacs
(Microsoft), Murray Maloney (GRIF), Steven Pemberton (CWI), Robert
Pernett (Lotus), Jared Sorensen (Novell), Powell Smith (IBM), Robert
Stevahn (HP), Ed Tecot (Microsoft), Jeffrey Veen (HotWired), Mike
Wexler (Adobe), Misha Wolf (Reuters), and Lauren Wood (SoftQuad).
Thank you Dan Connolly (W3C) for rigorous and bountiful input as
part-time editor and thoughtful guidance as chairman of the HTML
Working Group. Thank you Sally Khudairi (W3C) for your indispensable
work on press releases.
Thanks to David M. Abrahamson and Roger Price for their careful
reading of the specification and constructive comments.
Thanks to Jan Kärrman, author of html2ps for helping so much in
creating the Postscript version of the specification.
Of particular help from the W3C at Sophia-Antipolis were Janet Bertot,
Bert Bos, Stephane Boyera, Daniel Dardailler, Yves Lafon, Håkon Lie,
Chris Lilley, and Colas Nahaboo (Bull).
Lastly, thanks to Tim Berners-Lee without whom none of this would have
been possible.
1.3.1 Acknowledgments for the current revision
Many thanks to Shane McCarron for tracking errata for this revision of
the specification.
1.4 Copyright Notice
For information about copyrights, please refer to the W3C Intellectual
Property Notice, the W3C Document Notice, and the W3C IPR Software
Notice.
2 Introduction to HTML 4
Contents
1. What is the World Wide Web?
1. Introduction to URIs
2. Fragment identifiers
3. Relative URIs
2. What is HTML?
1. A brief history of HTML
3. HTML 4
1. Internationalization
2. Accessibility
3. Tables
4. Compound documents
5. Style sheets
6. Scripting
7. Printing
4. Authoring documents with HTML 4
1. Separate structure and presentation
2. Consider universal accessibility to the Web
3. Help user agents with incremental rendering
2.1 What is the World Wide Web?
The World Wide Web (Web) is a network of information resources. The
Web relies on three mechanisms to make these resources readily
available to the widest possible audience:
1. A uniform naming scheme for locating resources on the Web (e.g.,
URIs).
2. Protocols, for access to named resources over the Web (e.g.,
HTTP).
3. Hypertext, for easy navigation among resources (e.g., HTML).
The ties between the three mechanisms are apparent throughout this
specification.
2.1.1 Introduction to URIs
Every resource available on the Web -- HTML document, image, video
clip, program, etc. -- has an address that may be encoded by a
Universal Resource Identifier, or "URI".
URIs typically consist of three pieces:
1. The naming scheme of the mechanism used to access the resource.
2. The name of the machine hosting the resource.
3. The name of the resource itself, given as a path.
Consider the URI that designates the W3C Technical Reports page:
http://www.w3.org/TR
This URI may be read as follows: There is a document available via the
HTTP protocol (see [RFC2616]), residing on the machine www.w3.org,
accessible via the path "/TR". Other schemes you may see in HTML
documents include "mailto" for email and "ftp" for FTP.
Here is another example of a URI. This one refers to a user's mailbox:
...this is text...
For all comments, please send email to
Joe Cool .
Note. Most readers may be familiar with the term "URL" and not the
term "URI". URLs form a subset of the more general URI naming scheme.
2.1.2 Fragment identifiers
Some URIs refer to a location within a resource. This kind of URI ends
with "#" followed by an anchor identifier (called the fragment
identifier). For instance, here is a URI pointing to an anchor named
section_2:
http://somesite.com/html/top.html#section_2
2.1.3 Relative URIs
A relative URI doesn't contain any naming scheme information. Its path
generally refers to a resource on the same machine as the current
document. Relative URIs may contain relative path components (e.g.,
".." means one level up in the hierarchy defined by the path), and may
contain fragment identifiers.
Relative URIs are resolved to full URIs using a base URI. As an
example of relative URI resolution, assume we have the base URI
"http://www.acme.com/support/intro.html". The relative URI in the
following markup for a hypertext link:
Suppliers
would expand to the full URI
"http://www.acme.com/support/suppliers.html", while the relative URI
in the following markup for an image
would expand to the full URI "http://www.acme.com/icons/logo.gif".
In HTML, URIs are used to:
* Link to another document or resource, (see the A and LINK
elements).
* Link to an external style sheet or script (see the LINK and SCRIPT
elements).
* Include an image, object, or applet in a page, (see the IMG,
OBJECT, APPLET and INPUT elements).
* Create an image map (see the MAP and AREA elements).
* Submit a form (see FORM).
* Create a frame document (see the FRAME and IFRAME elements).
* Cite an external reference (see the Q, BLOCKQUOTE, INS and DEL
elements).
* Refer to metadata conventions describing a document (see the HEAD
element).
Please consult the section on the URI type for more information about
URIs.
2.2 What is HTML?
To publish information for global distribution, one needs a
universally understood language, a kind of publishing mother tongue
that all computers may potentially understand. The publishing language
used by the World Wide Web is HTML (from HyperText Markup Language).
HTML gives authors the means to:
* Publish online documents with headings, text, tables, lists,
photos, etc.
* Retrieve online information via hypertext links, at the click of a
button.
* Design forms for conducting transactions with remote services, for
use in searching for information, making reservations, ordering
products, etc.
* Include spread-sheets, video clips, sound clips, and other
applications directly in their documents.
2.2.1 A brief history of HTML
HTML was originally developed by Tim Berners-Lee while at CERN, and
popularized by the Mosaic browser developed at NCSA. During the course
of the 1990s it has blossomed with the explosive growth of the Web.
During this time, HTML has been extended in a number of ways. The Web
depends on Web page authors and vendors sharing the same conventions
for HTML. This has motivated joint work on specifications for HTML.
HTML 2.0 (November 1995, see [RFC1866]) was developed under the aegis
of the Internet Engineering Task Force (IETF) to codify common
practice in late 1994. HTML+ (1993) and HTML 3.0 (1995, see [HTML30])
proposed much richer versions of HTML. Despite never receiving
consensus in standards discussions, these drafts led to the adoption
of a range of new features. The efforts of the World Wide Web
Consortium's HTML Working Group to codify common practice in 1996
resulted in HTML 3.2 (January 1997, see [HTML32]). Changes from HTML
3.2 are summarized in Appendix A
Most people agree that HTML documents should work well across
different browsers and platforms. Achieving interoperability lowers
costs to content providers since they must develop only one version of
a document. If the effort is not made, there is much greater risk that
the Web will devolve into a proprietary world of incompatible formats,
ultimately reducing the Web's commercial potential for all
participants.
Each version of HTML has attempted to reflect greater consensus among
industry players so that the investment made by content providers will
not be wasted and that their documents will not become unreadable in a
short period of time.
HTML has been developed with the vision that all manner of devices
should be able to use information on the Web: PCs with graphics
displays of varying resolution and color depths, cellular telephones,
hand held devices, devices for speech for output and input, computers
with high or low bandwidth, and so on.
2.3 HTML 4
HTML 4 extends HTML with mechanisms for style sheets, scripting,
frames, embedding objects, improved support for right to left and
mixed direction text, richer tables, and enhancements to forms,
offering improved accessibility for people with disabilities.
HTML 4.01 is a revision of HTML 4.0 that corrects errors and makes
some changes since the previous revision.
2.3.1 Internationalization
This version of HTML has been designed with the help of experts in the
field of internationalization, so that documents may be written in
every language and be transported easily around the world. This has
been accomplished by incorporating [RFC2070], which deals with the
internationalization of HTML.
One important step has been the adoption of the ISO/IEC:10646 standard
(see [ISO10646]) as the document character set for HTML. This is the
world's most inclusive standard dealing with issues of the
representation of international characters, text direction,
punctuation, and other world language issues.
HTML now offers greater support for diverse human languages within a
document. This allows for more effective indexing of documents for
search engines, higher-quality typography, better text-to-speech
conversion, better hyphenation, etc.
2.3.2 Accessibility
As the Web community grows and its members diversify in their
abilities and skills, it is crucial that the underlying technologies
be appropriate to their specific needs. HTML has been designed to make
Web pages more accessible to those with physical limitations. HTML 4
developments inspired by concerns for accessibility include:
* Better distinction between document structure and presentation,
thus encouraging the use of style sheets instead of HTML
presentation elements and attributes.
* Better forms, including the addition of access keys, the ability
to group form controls semantically, the ability to group SELECT
options semantically, and active labels.
* The ability to markup a text description of an included object
(with the OBJECT element).
* A new client-side image map mechanism (the MAP element) that
allows authors to integrate image and text links.
* The requirement that alternate text accompany images included with
the IMG element and image maps included with the AREA element.
* Support for the title and lang attributes on all elements.
* Support for the ABBR and ACRONYM elements.
* A wider range of target media (tty, braille, etc.) for use with
style sheets.
* Better tables, including captions, column groups, and mechanisms
to facilitate non-visual rendering.
* Long descriptions of tables, images, frames, etc.
Authors who design pages with accessibility issues in mind will not
only receive the blessings of the accessibility community, but will
benefit in other ways as well: well-designed HTML documents that
distinguish structure and presentation will adapt more easily to new
technologies.
Note. For more information about designing accessible HTML documents,
please consult [WAI].
2.3.3 Tables
The new table model in HTML is based on [RFC1942]. Authors now have
greater control over structure and layout (e.g., column groups). The
ability of designers to recommend column widths allows user agents to
display table data incrementally (as it arrives) rather than waiting
for the entire table before rendering.
Note. At the time of writing, some HTML authoring tools rely
extensively on tables for formatting, which may easily cause
accessibility problems.
2.3.4 Compound documents
HTML now offers a standard mechanism for embedding generic media
objects and applications in HTML documents. The OBJECT element
(together with its more specific ancestor elements IMG and APPLET)
provides a mechanism for including images, video, sound, mathematics,
specialized applications, and other objects in a document. It also
allows authors to specify a hierarchy of alternate renderings for user
agents that don't support a specific rendering.
2.3.5 Style sheets
Style sheets simplify HTML markup and largely relieve HTML of the
responsibilities of presentation. They give both authors and users
control over the presentation of documents -- font information,
alignment, colors, etc.
Style information can be specified for individual elements or groups
of elements. Style information may be specified in an HTML document or
in external style sheets.
The mechanisms for associating a style sheet with a document is
independent of the style sheet language.
Before the advent of style sheets, authors had limited control over
rendering. HTML 3.2 included a number of attributes and elements
offering control over alignment, font size, and text color. Authors
also exploited tables and images as a means for laying out pages. The
relatively long time it takes for users to upgrade their browsers
means that these features will continue to be used for some time.
However, since style sheets offer more powerful presentation
mechanisms, the World Wide Web Consortium will eventually phase out
many of HTML's presentation elements and attributes. Throughout the
specification elements and attributes at risk are marked as
"deprecated". They are accompanied by examples of how to achieve the
same effects with other elements or style sheets.
2.3.6 Scripting
Through scripts, authors may create dynamic Web pages (e.g., "smart
forms" that react as users fill them out) and use HTML as a means to
build networked applications.
The mechanisms provided to include scripts in an HTML document are
independent of the scripting language.
2.3.7 Printing
Sometimes, authors will want to make it easy for users to print more
than just the current document. When documents form part of a larger
work, the relationships between them can be described using the HTML
LINK element or using W3C's Resource Description Framework (RDF) (see
[RDF10]).
2.4 Authoring documents with HTML 4
We recommend that authors and implementors observe the following
general principles when working with HTML 4.
2.4.1 Separate structure and presentation
HTML has its roots in SGML which has always been a language for the
specification of structural markup. As HTML matures, more and more of
its presentational elements and attributes are being replaced by other
mechanisms, in particular style sheets. Experience has shown that
separating the structure of a document from its presentational aspects
reduces the cost of serving a wide range of platforms, media, etc.,
and facilitates document revisions.
2.4.2 Consider universal accessibility to the Web
To make the Web more accessible to everyone, notably those with
disabilities, authors should consider how their documents may be
rendered on a variety of platforms: speech-based browsers,
braille-readers, etc. We do not recommend that authors limit their
creativity, only that they consider alternate renderings in their
design. HTML offers a number of mechanisms to this end (e.g., the alt
attribute, the accesskey attribute, etc.)
Furthermore, authors should keep in mind that their documents may be
reaching a far-off audience with different computer configurations. In
order for documents to be interpreted correctly, authors should
include in their documents information about the natural language and
direction of the text, how the document is encoded, and other issues
related to internationalization.
2.4.3 Help user agents with incremental rendering
By carefully designing their tables and making use of new table
features in HTML 4, authors can help user agents render documents more
quickly. Authors can learn how to design tables for incremental
rendering (see the TABLE element). Implementors should consult the
notes on tables in the appendix for information on incremental
algorithms.
3 On SGML and HTML
Contents
1. Introduction to SGML
2. SGML constructs used in HTML
1. Elements
2. Attributes
3. Character references
4. Comments
3. How to read the HTML DTD
1. DTD Comments
2. Parameter entity definitions
3. Element declarations
o Content model definitions
4. Attribute declarations
o DTD entities in attribute definitions
o Boolean attributes
This section of the document introduces SGML and discusses its
relationship to HTML. A complete discussion of SGML is left to the
standard (see [ISO8879]).
3.1 Introduction to SGML
SGML is a system for defining markup languages. Authors mark up their
documents by representing structural, presentational, and semantic
information alongside content. HTML is one example of a markup
language. Here is an example of an HTML document:
My first HTML document
Hello world!
An HTML document is divided into a head section (here, between
and ) and a body (here, between and ). The title
of the document appears in the head (along with other information
about the document), and the content of the document appears in the
body. The body in this example contains just one paragraph, marked up
with .
Each markup language defined in SGML is called an SGML application. An
SGML application is generally characterized by:
1. An SGML declaration. The SGML declaration specifies which
characters and delimiters may appear in the application.
2. A document type definition (DTD). The DTD defines the syntax of
markup constructs. The DTD may include additional definitions such
as character entity references.
3. A specification that describes the semantics to be ascribed to the
markup. This specification also imposes syntax restrictions that
cannot be expressed within the DTD.
4. Document instances containing data (content) and markup. Each
instance contains a reference to the DTD to be used to interpret
it.
This specification includes an SGML declaration, three document type
definitions (see the section on HTML version information for a
description of the three), and a list of character references.
3.2 SGML constructs used in HTML
The following sections introduce SGML constructs that are used in
HTML.
The appendix lists some SGML features that are not widely supported by
HTML tools and user agents and should be avoided.
3.2.1 Elements
An SGML document type definition declares element types that represent
structures or desired behavior. HTML includes element types that
represent paragraphs, hypertext links, lists, tables, images, etc.
Each element type declaration generally describes three parts: a start
tag, content, and an end tag.
The element's name appears in the start tag (written )
and the end tag (written ); note the slash before the
element name in the end tag. For example, the start and end tags of
the UL element type delimit the items in a list:
...list item 1...
...list item 2...
Some HTML element types allow authors to omit end tags (e.g., the P
and LI element types). A few element types also allow the start tags
to be omitted; for example, HEAD and BODY. The HTML DTD indicates for
each element type whether the start tag and end tag are required.
Some HTML element types have no content. For example, the line break
element BR has no content; its only role is to terminate a line of
text. Such empty elements never have end tags. The document type
definition and the text of the specification indicate whether an
element type is empty (has no content) or, if it can have content,
what is considered legal content.
Element names are always case-insensitive.
Please consult the SGML standard for information about rules governing
elements (e.g., they must be properly nested, an end tag closes, back
to the matching start tag, all unclosed intervening start tags with
omitted end tags (section 7.5.1), etc.).
For example, the following paragraph:
This is the first paragraph.
...a block element...
may be rewritten without its end tag:
This is the first paragraph.
...a block element...
since the
start tag is closed by the following block element.
Similarly, if a paragraph is enclosed by a block element, as in:
the end tag of the enclosing block element (here, ) implies the
end tag of the open start tag.
Elements are not tags. Some people refer to elements as tags (e.g.,
"the P tag"). Remember that the element is one thing, and the tag (be
it start or end tag) is another. For instance, the HEAD element is
always present, even though both start and end HEAD tags may be
missing in the markup.
All the element types declared in this specification are listed in the
element index.
3.2.2 Attributes
Elements may have associated properties, called attributes, which may
have values (by default, or set by authors or scripts).
Attribute/value pairs appear before the final ">" of an element's
start tag. Any number of (legal) attribute value pairs, separated by
spaces, may appear in an element's start tag. They may appear in any
order.
In this example, the id attribute is set for an H1 element:
This is an identified heading thanks to the id attribute
By default, SGML requires that all attribute values be delimited using
either double quotation marks (ASCII decimal 34) or single quotation
marks (ASCII decimal 39). Single quote marks can be included within
the attribute value when the value is delimited by double quote marks,
and vice versa. Authors may also use numeric character references to
represent double quotes (") and single quotes ('). For double
quotes authors can also use the character entity reference ".
In certain cases, authors may specify the value of an attribute
without any quotation marks. The attribute value may only contain
letters (a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45),
periods (ASCII decimal 46), underscores (ASCII decimal 95), and colons
(ASCII decimal 58). We recommend using quotation marks even when it is
possible to eliminate them.
Attribute names are always case-insensitive.
Attribute values are generally case-insensitive. The definition of
each attribute in the reference manual indicates whether its value is
case-insensitive.
All the attributes defined by this specification are listed in the
attribute index.
3.2.3 Character references
Character references are numeric or symbolic names for characters that
may be included in an HTML document. They are useful for referring to
rarely used characters, or those that authoring tools make it
difficult or impossible to enter. You will see character references
throughout this document; they begin with a "&" sign and end with a
semi-colon (;). Some common examples include:
* "<" represents the < sign.
* ">" represents the > sign.
* """ represents the " mark.
* "å" (in decimal) represents the letter "a" with a small
circle above it.
* "И" (in decimal) represents the Cyrillic capital letter "I".
* "水" (in hexadecimal) represents the Chinese character for
water.
We discuss HTML character references in detail later in the section on
the HTML document character set. The specification also contains a
list of character references that may appear in HTML 4 documents.
3.2.4 Comments
HTML comments have the following syntax:
White space is not permitted between the markup declaration open
delimiter(""). A common error is to include a
string of hyphens ("---") within a comment. Authors should avoid
putting two or more adjacent hyphens inside comments.
Information that appears between comments has no special meaning
(e.g., character references are not interpreted as such).
Note that comments are markup.
3.3 How to read the HTML DTD
Each element and attribute declaration in this specification is
accompanied by its document type definition fragment. We have chosen
to include the DTD fragments in the specification rather than seek a
more approachable, but longer and less precise means of describing an
element's properties. The following tutorial should allow readers
unfamiliar with SGML to read the DTD and understand the technical
details of the HTML specification.
3.3.1 DTD Comments
In DTDs, comments may spread over one or more lines. In the DTD,
comments are delimited by a pair of "--" marks, e.g.
Here, the comment "named property value" explains the use of the PARAM
element type. Comments in the DTD are informative only.
3.3.2 Parameter entity definitions
The HTML DTD begins with a series of parameter entity definitions. A
parameter entity definition defines a kind of macro that may be
referenced and expanded elsewhere in the DTD. These macros may not
appear in HTML documents, only in the DTD. Other types of macros,
called character references, may be used in the text of an HTML
document or within attribute values.
When the parameter entity is referred to by name in the DTD, it is
expanded into a string.
A parameter entity definition begins with the keyword . Instances of parameter entities in a DTD
begin with "%", then the parameter entity name, and terminated by an
optional ";".
The following example defines the string that the "%fontstyle;" entity
will expand to.
The string the parameter entity expands to may contain other parameter
entity names. These names are expanded recursively. In the following
example, the "%inline;" parameter entity is defined to include the
"%fontstyle;", "%phrase;", "%special;" and "%formctrl;" parameter
entities.
You will encounter two DTD entities frequently in the HTML DTD:
"%block;" "%inline;". They are used when the content model includes
block-level and inline elements, respectively (defined in the section
on the global structure of an HTML document).
3.3.3 Element declarations
The bulk of the HTML DTD consists of the declarations of element types
and their attributes. The character ends it. Between these are specified:
1. The element's name.
2. Whether the element's tags are optional. Two hyphens that appear
after the element name mean that the start and end tags are
mandatory. One hyphen followed by the letter "O" indicates that
the end tag can be omitted. A pair of letter "O"s indicate that
both the start and end tags can be omitted.
3. The element's content, if any. The allowed content for an element
is called its content model. Element types that are designed to
have no content are called empty elements. The content model for
such element types is declared using the keyword "EMPTY".
In this example:
* The element type being declared is UL.
* The two hyphens indicate that both the start tag for this element type are required.
* The content model for this element type is declared to be "at
least one LI element". Below, we explain how to specify content
models.
This example illustrates the declaration of an empty element type:
* The element type being declared is IMG.
* The hyphen and the following "O" indicate that the end tag can be
omitted, but together with the content model "EMPTY", this is
strengthened to the rule that the end tag must be omitted.
* The "EMPTY" keyword means that instances of this type must not
have content.
Content model definitions
The content model describes what may be contained by an instance of an
element type. Content model definitions may include:
* The names of allowed or forbidden element types (e.g., the UL
element contains instances of the LI element type, and the P
element type may not contain other P elements).
* DTD entities (e.g., the LABEL element contains instances of the
"%inline;" parameter entity).
* Document text (indicated by the SGML construct "#PCDATA"). Text
may contain character references. Recall that these begin with &
and end with a semicolon (e.g., "Hergé's adventures of
Tintin" contains the character entity reference for the "e acute"
character).
The content model of an element is specified with the following
syntax. Please note that the list below is a simplification of the
full SGML syntax rules and does not address, e.g., precedences.
( ... )
Delimits a group.
A
A must occur, one time only.
A+
A must occur one or more times.
A?
A must occur zero or one time.
A*
A may occur zero or more times.
+(A)
A may occur.
-(A)
A must not occur.
A | B
Either A or B must occur, but not both.
A , B
Both A and B must occur, in that order.
A & B
Both A and B must occur, in any order.
Here are some examples from the HTML DTD:
The UL element must contain one or more LI elements.
The DL element must contain one or more DT or DD elements in any
order.
The OPTION element may only contain text and entities, such as &
-- this is indicated by the SGML data type #PCDATA.
A few HTML element types use an additional SGML feature to exclude
elements from their content model. Excluded elements are preceded by a
hyphen. Explicit exclusions override permitted elements.
In this example, the -(A) signifies that the element A cannot appear
in another A element (i.e., anchors may not be nested).
Note that the A element type is part of the DTD parameter entity
"%inline;", but is excluded explicitly because of -(A).
Similarly, the following element type declaration for FORM prohibits
nested forms:
3.3.4 Attribute declarations
The . Each
attribute definition is a triplet that defines:
* The name of an attribute.
* The type of the attribute's value or an explicit set of possible
values. Values defined explicitly by the DTD are case-insensitive.
Please consult the section on basic HTML data types for more
information about attribute value types.
* Whether the default value of the attribute is implicit (keyword
"#IMPLIED"), in which case the default value must be supplied by
the user agent (in some cases via inheritance from parent
elements); always required (keyword "#REQUIRED"); or fixed to the
given value (keyword "#FIXED"). Some attribute definitions
explicitly specify a default value for the attribute.
In this example, the name attribute is defined for the MAP element.
The attribute is optional for this element.
The type of values permitted for the attribute is given as CDATA, an
SGML data type. CDATA is text that may contain character references.
For more information about "CDATA", "NAME", "ID", and other data
types, please consult the section on HTML data types.
The following examples illustrate several attribute definitions:
rowspan NUMBER 1 -- number of rows spanned by cell --
http-equiv NAME #IMPLIED -- HTTP response header name --
id ID #IMPLIED -- document-wide unique id --
valign (top|middle|bottom|baseline) #IMPLIED
The rowspan attribute requires values of type NUMBER. The default
value is given explicitly as "1". The optional http-equiv attribute
requires values of type NAME. The optional id attribute requires
values of type ID. The optional valign attribute is constrained to
take values from the set {top, middle, bottom, baseline}.
DTD entities in attribute definitions
Attribute definitions may also contain parameter entity references.
In this example, we see that the attribute definition list for the
LINK element begins with the "%attrs;" parameter entity.
Start tag: required, End tag: forbidden
The "%attrs;" parameter entity is defined as follows:
The "%coreattrs;" parameter entity in the "%attrs;" definition expands
as follows:
The "%attrs;" parameter entity has been defined for convenience since
these attributes are defined for most HTML element types.
Similarly, the DTD defines the "%URI;" parameter entity as expanding
into the string "CDATA".
As this example illustrates, the parameter entity "%URI;" provides
readers of the DTD with more information as to the type of data
expected for an attribute. Similar entities have been defined for
"%Color;", "%Charset;", "%Length;", "%Pixels;", etc.
Boolean attributes
Some attributes play the role of boolean variables (e.g., the selected
attribute for the OPTION element). Their appearance in the start tag
of an element implies that the value of the attribute is "true". Their
absence implies a value of "false".
Boolean attributes may legally take a single value: the name of the
attribute itself (e.g., selected="selected").
This example defines the selected attribute to be a boolean attribute.
selected (selected) #IMPLIED -- option is pre-selected --
The attribute is set to "true" by appearing in the element's start
tag:
...contents...
In HTML, boolean attributes may appear in minimized form -- the
attribute's value appears alone in the element's start tag. Thus,
selected may be set by writing:
instead of:
Authors should be aware that many user agents only recognize the
minimized form of boolean attributes and not the full form.
4 Conformance: requirements and recommendations
Contents
1. Definitions
2. SGML
3. The text/html content type
In this section, we begin the specification of HTML 4, starting with
the contract between authors, documents, users, and user agents.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119]. However, for
readability, these words do not appear in all uppercase letters in
this specification.
At times, the authors of this specification recommend good practice
for authors and user agents. These recommendations are not normative
and conformance with this specification does not depend on their
realization. These recommendations contain the expression "We
recommend ...", "This specification recommends ...", or some similar
wording.
4.1 Definitions
HTML document
An HTML document is an SGML document that meets the constraints
of this specification.
Author
An author is a person or program that writes or generates HTML
documents. An authoring tool is a special case of an author,
namely, it's a program that generates HTML.
We recommend that authors write documents that conform to the
strict DTD rather than the other DTDs defined by this
specification. Please see the section on version information
for details about the DTDs defined in HTML 4.
User
A user is a person who interacts with a user agent to view,
hear, or otherwise use a rendered HTML document.
HTML user agent
An HTML user agent is any device that interprets HTML
documents. User agents include visual browsers (text-only and
graphical), non-visual browsers (audio, Braille), search
robots, proxies, etc.
A conforming user agent for HTML 4 is one that observes the
mandatory conditions ("must") set forth in this specification,
including the following points:
+ A user agent should avoid imposing arbitrary length limits on
attribute value literals (see the section on capacities in
the SGML Declaration). For introductory information on SGML
attributes, please consult the section on attribute
definitions.
+ A user agent must ensure that rendering is unchanged by the
presence or absence of start tags and end tags when the HTML
DTD indicates that these are optional. See the section on
element definitions for introductory information on SGML
elements.
+ For reasons of backwards compatibility, we recommend that
tools interpreting HTML 4 continue to support HTML 3.2 (see
[HTML32]) and HTML 2.0 (see [RFC1866]).
Error conditions
This specification does not define how conforming user agents
handle general error conditions, including how user agents
behave when they encounter elements, attributes, attribute
values, or entities not specified in this document.
However, for recommended error handling behavior, please
consult the notes on invalid documents.
Deprecated
A deprecated element or attribute is one that has been outdated
by newer constructs. Deprecated elements are defined in the
reference manual in appropriate locations, but are clearly
marked as deprecated. Deprecated elements may become obsolete
in future versions of HTML.
User agents should continue to support deprecated elements for
reasons of backward compatibility.
Definitions of elements and attributes clearly indicate which
are deprecated.
This specification includes examples that illustrate how to
avoid using deprecated elements. In most cases these depend on
user agent support for style sheets. In general, authors should
use style sheets to achieve stylistic and formatting effects
rather than HTML presentational attributes. HTML presentational
attributes have been deprecated when style sheet alternatives
exist (see, for example, [CSS1]).
Obsolete
An obsolete element or attribute is one for which there is no
guarantee of support by a user agent. Obsolete elements are no
longer defined in the specification, but are listed for
historical purposes in the changes section of the reference
manual.
4.2 SGML
HTML 4 is an SGML application conforming to International Standard ISO
8879 -- Standard Generalized Markup Language SGML (defined in
[ISO8879]).
Examples in the text conform to the strict document type definition
unless the example in question refers to elements or attributes only
defined by the transitional document type definition or frameset
document type definition. For the sake of brevity, most of the
examples in this specification do not begin with the document type
declaration that is mandatory at the beginning of each HTML document.
DTD fragments in element definitions come from the strict document
type definition except for the elements related to frames.
Please consult the section on HTML version information for details
about when to use the strict, transitional, or frameset DTD.
Comments appearing in the HTML 4 DTD have no normative value; they are
informative only.
User agents must not render SGML processing instructions (e.g., ) or comments. For more information about this and other SGML
features that may be legal in HTML but aren't widely supported by HTML
user agents, please consult the section on SGML features with limited
support.
4.3 The text/html content type
HTML documents are sent over the Internet as a sequence of bytes
accompanied by encoding information (described in the section on
character encodings). The structure of the transmission, termed a
message entity, is defined by [RFC2045] and [RFC2616]. A message
entity with a content type of "text/html" represents an HTML document.
The content type for HTML documents is defined as follows:
Content type name:
text
Content subtype name:
html
Required parameters:
none
Optional parameters:
charset
Encoding considerations:
any encoding is allowed
Security considerations:
See the notes on security
The optional parameter "charset" refers to the character encoding used
to represent the HTML document as a sequence of bytes. Legal values
for this parameter are defined in the section on character encodings.
Although this parameter is optional, we recommend that it always be
present.
5 HTML Document Representation
Contents
1. The Document Character Set
2. Character encodings
1. Choosing an encoding
o Notes on specific encodings
2. Specifying the character encoding
3. Character references
1. Numeric character references
2. Character entity references
4. Undisplayable characters
In this chapter, we discuss how HTML documents are represented on a
computer and over the Internet.
The section on the document character set addresses the issue of what
abstract characters may be part of an HTML document. Characters
include the Latin letter "A", the Cyrillic letter "I", the Chinese
character meaning "water", etc.
The section on character encodings addresses the issue of how those
characters may be represented in a file or when transferred over the
Internet. As some character encodings cannot directly represent all
characters an author may want to include in a document, HTML offers
other mechanisms, called character references, for referring to any
character.
Since there are a great number of characters throughout human
languages, and a great variety of ways to represent those characters,
proper care must be taken so that documents may be understood by user
agents around the world.
5.1 The Document Character Set
To promote interoperability, SGML requires that each application
(including HTML) specify its document character set. A document
character set consists of:
* A Repertoire: A set of abstract characters,, such as the Latin
letter "A", the Cyrillic letter "I", the Chinese character meaning
"water", etc.
* Code positions: A set of integer references to characters in the
repertoire.
Each SGML document (including each HTML document) is a sequence of
characters from the repertoire. Computer systems identify each
character by its code position; for example, in the ASCII character
set, code positions 65, 66, and 67 refer to the characters 'A', 'B',
and 'C', respectively.
The ASCII character set is not sufficient for a global information
system such as the Web, so HTML uses the much more complete character
set called the Universal Character Set (UCS), defined in [ISO10646].
This standard defines a repertoire of thousands of characters used by
communities all over the world.
The character set defined in [ISO10646] is character-by-character
equivalent to Unicode ([UNICODE]). Both of these standards are updated
from time to time with new characters, and the amendments should be
consulted at the respective Web sites. In the current specification,
"[ISO10646]" is used to refer to the document character set while
"[UNICODE]" is reserved for references to the Unicode bidirectional
text algorithm.
The document character set, however, does not suffice to allow user
agents to correctly interpret HTML documents as they are typically
exchanged -- encoded as a sequence of bytes in a file or during a
network transmission. User agents must also know the specific
character encoding that was used to transform the document character
stream into a byte stream.
5.2 Character encodings
What this specification calls a character encoding is known by
different names in other specifications (which may cause some
confusion). However, the concept is largely the same across the
Internet. Also, protocol headers, attributes, and parameters referring
to character encodings share the same name -- "charset" -- and use the
same values from the [IANA] registry (see [CHARSETS] for a complete
list).
The "charset" parameter identifies a character encoding, which is a
method of converting a sequence of bytes into a sequence of
characters. This conversion fits naturally with the scheme of Web
activity: servers send HTML documents to user agents as a stream of
bytes; user agents interpret them as a sequence of characters. The
conversion method can range from simple one-to-one correspondence to
complex switching schemes or algorithms.
A simple one-byte-per-character encoding technique is not sufficient
for text strings over a character repertoire as large as [ISO10646].
There are several different encodings of parts of [ISO10646] in
addition to encodings of the entire character set (such as UCS-4).
5.2.1 Choosing an encoding
Authoring tools (e.g., text editors) may encode HTML documents in the
character encoding of their choice, and the choice largely depends on
the conventions used by the system software. These tools may employ
any convenient encoding that covers most of the characters contained
in the document, provided the encoding is correctly labeled.
Occasional characters that fall outside this encoding may still be
represented by character references. These always refer to the
document character set, not the character encoding.
Servers and proxies may change a character encoding (called
transcoding) on the fly to meet the requests of user agents (see
section 14.2 of [RFC2616], the "Accept-Charset" HTTP request header).
Servers and proxies do not have to serve a document in a character
encoding that covers the entire document character set.
Commonly used character encodings on the Web include ISO-8859-1 (also
referred to as "Latin-1"; usable for most Western European languages),
ISO-8859-5 (which supports Cyrillic), SHIFT_JIS (a Japanese encoding),
EUC-JP (another Japanese encoding), and UTF-8 (an encoding of ISO
10646 using a different number of bytes for different characters).
Names for character encodings are case-insensitive, so that for
example "SHIFT_JIS", "Shift_JIS", and "shift_jis" are equivalent.
This specification does not mandate which character encodings a user
agent must support.
Conforming user agents must correctly map to ISO 10646 all characters
in any character encodings that they recognize (or they must behave as
if they did).
Notes on specific encodings
When HTML text is transmitted in UTF-16 (charset=UTF-16), text data
should be transmitted in network byte order ("big-endian", high-order
byte first) in accordance with [ISO10646], Section 6.3 and [UNICODE],
clause C3, page 3-1.
Furthermore, to maximize chances of proper interpretation, it is
recommended that documents transmitted as UTF-16 always begin with a
ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF, also called
Byte Order Mark (BOM)) which, when byte-reversed, becomes hexadecimal
FFFE, a character guaranteed never to be assigned. Thus, a user-agent
receiving a hexadecimal FFFE as the first bytes of a text would know
that bytes have to be reversed for the remainder of the text.
The UTF-1 transformation format of [ISO10646] (registered by IANA as
ISO-10646-UTF-1), should not be used. For information about ISO 8859-8
and the bidirectional algorithm, please consult the section on
bidirectionality and character encoding.
5.2.2 Specifying the character encoding
How does a server determine which character encoding applies for a
document it serves? Some servers examine the first few bytes of the
document, or check against a database of known files and encodings.
Many modern servers give Web masters more control over charset
configuration than old servers do. Web masters should use these
mechanisms to send out a "charset" parameter whenever possible, but
should take care not to identify a document with the wrong "charset"
parameter value.
How does a user agent know which character encoding has been used? The
server should provide this information. The most straightforward way
for a server to inform the user agent about the character encoding of
the document is to use the "charset" parameter of the "Content-Type"
header field of the HTTP protocol ([RFC2616], sections 3.4 and 14.17)
For example, the following HTTP header announces that the character
encoding is EUC-JP:
Content-Type: text/html; charset=EUC-JP
Please consult the section on conformance for the definition of
text/html.
The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a
default character encoding when the "charset" parameter is absent from
the "Content-Type" header field. In practice, this recommendation has
proved useless because some servers don't allow a "charset" parameter
to be sent, and others may not be configured to send the parameter.
Therefore, user agents must not assume any default value for the
"charset" parameter.
To address server or configuration limitations, HTML documents may
include explicit information about the document's character encoding;
the META element can be used to provide user agents with this
information.
For example, to specify that the character encoding of the current
document is "EUC-JP", a document should include the following META
declaration:
The META declaration must only be used when the character encoding is
organized such that ASCII-valued bytes stand for ASCII characters (at
least until the META element is parsed). META declarations should
appear as early as possible in the HEAD element.
For cases where neither the HTTP protocol nor the META element
provides information about the character encoding of a document, HTML
also provides the charset attribute on several elements. By combining
these mechanisms, an author can greatly improve the chances that, when
the user retrieves a resource, the user agent will recognize the
character encoding.
To sum up, conforming user agents must observe the following
priorities when determining a document's character encoding (from
highest priority to lowest):
1. An HTTP "charset" parameter in a "Content-Type" field.
2. A META declaration with "http-equiv" set to "Content-Type" and a
value set for "charset".
3. The charset attribute set on an element that designates an
external resource.
In addition to this list of priorities, the user agent may use
heuristics and user settings. For example, many user agents use a
heuristic to distinguish the various encodings used for Japanese text.
Also, user agents typically have a user-definable, local default
character encoding which they apply in the absence of other
indicators.
User agents may provide a mechanism that allows users to override
incorrect "charset" information. However, if a user agent offers such
a mechanism, it should only offer it for browsing and not for editing,
to avoid the creation of Web pages marked with an incorrect "charset"
parameter.
Note. If, for a specific application, it becomes necessary to refer to
characters outside [ISO10646], characters should be assigned to a
private zone to avoid conflicts with present or future versions of the
standard. This is highly discouraged, however, for reasons of
portability.
5.3 Character references
A given character encoding may not be able to express all characters
of the document character set. For such encodings, or when hardware or
software configurations do not allow users to input some document
characters directly, authors may use SGML character references.
Character references are a character encoding-independent mechanism
for entering any character from the document character set.
Character references in HTML may appear in two forms:
* Numeric character references (either decimal or hexadecimal).
* Character entity references.
Character references within comments have no special meaning; they are
comment data only.
Note. HTML provides other ways to present character data, in
particular inline images.
Note. In SGML, it is possible to eliminate the final ";" after a
character reference in some cases (e.g., at a line break or
immediately before a tag). In other circumstances it may not be
eliminated (e.g., in the middle of a word). We strongly suggest using
the ";" in all cases to avoid problems with user agents that require
this character to be present.
5.3.1 Numeric character references
Numeric character references specify the code position of a character
in the document character set. Numeric character references may take
two forms:
* The syntax "D;", where D is a decimal number, refers to the ISO
10646 decimal character number D.
* The syntax "H;" or "H;", where H is a hexadecimal number,
refers to the ISO 10646 hexadecimal character number H.
Hexadecimal numbers in numeric character references are
case-insensitive.
Here are some examples of numeric character references:
* å (in decimal) represents the letter "a" with a small circle
above it (used, for example, in Norwegian).
* å (in hexadecimal) represents the same character.
* å (in hexadecimal) represents the same character as well.
* И (in decimal) represents the Cyrillic capital letter "I".
* 水 (in hexadecimal) represents the Chinese character for
water.
Note. Although the hexadecimal representation is not defined in
[ISO8879], it is expected to be in the revision, as described in
[WEBSGML]. This convention is particularly useful since character
standards generally use hexadecimal representations.
5.3.2 Character entity references
In order to give authors a more intuitive way of referring to
characters in the document character set, HTML offers a set of
character entity references. Character entity references use symbolic
names so that authors need not remember code positions. For example,
the character entity reference å refers to the lowercase "a"
character topped with a ring; "å" is easier to remember than
å.
HTML 4 does not define a character entity reference for every
character in the document character set. For instance, there is no
character entity reference for the Cyrillic capital letter "I". Please
consult the full list of character references defined in HTML 4.
Character entity references are case-sensitive. Thus, Å refers
to a different character (uppercase A, ring) than å (lowercase
a, ring).
Four character entity references deserve special mention since they
are frequently used to escape special characters:
* "<" represents the < sign.
* ">" represents the > sign.
* "&" represents the & sign.
* "" represents the " mark.
Authors wishing to put the "<" character in text should use "<"
(ASCII decimal 60) to avoid possible confusion with the beginning of a
tag (start tag open delimiter). Similarly, authors should use ">"
(ASCII decimal 62) in text instead of ">" to avoid problems with older
user agents that incorrectly perceive this as the end of a tag (tag
close delimiter) when it appears in quoted attribute values.
Authors should use "&" (ASCII decimal 38) instead of "&" to avoid
confusion with the beginning of a character reference (entity
reference open delimiter). Authors should also use "&" in
attribute values since character references are allowed within CDATA
attribute values.
Some authors use the character entity reference """ to encode
instances of the double quote mark (") since that character may be
used to delimit attribute values.
5.4 Undisplayable characters
A user agent may not be able to render all characters in a document
meaningfully, for instance, because the user agent lacks a suitable
font, a character has a value that may not be expressed in the user
agent's internal character encoding, etc.
Because there are many different things that may be done in such
cases, this document does not prescribe any specific behavior.
Depending on the implementation, undisplayable characters may also be
handled by the underlying display system and not the application
itself. In the absence of more sophisticated behavior, for example
tailored to the needs of a particular script or language, we recommend
the following behavior for user agents:
1. Adopt a clearly visible, but unobtrusive mechanism to alert the
user of missing resources.
2. If missing characters are presented using their numeric
representation, use the hexadecimal (not decimal) form since this
is the form used in character set standards.
6 Basic HTML data types
Contents
1. Case information
2. SGML basic types
3. Text strings
4. URIs
5. Colors
1. Notes on using colors
6. Lengths
7. Content types (MIME types)
8. Language codes
9. Character encodings
10. Single characters
11. Dates and times
12. Link types
13. Media descriptors
14. Script data
15. Style sheet data
16. Frame target names
This section of the specification describes the basic data types that
may appear as an element's content or an attribute's value.
For introductory information about reading the HTML DTD, please
consult the SGML tutorial.
6.1 Case information
Each attribute definition includes information about the
case-sensitivity of its values. The case information is presented with
the following keys:
CS
The value is case-sensitive (i.e., user agents interpret "a"
and "A" differently).
CI
The value is case-insensitive (i.e., user agents interpret "a"
and "A" as the same).
CN
The value is not subject to case changes, e.g., because it is a
number or a character from the document character set.
CA
The element or attribute definition itself gives case
information.
CT
Consult the type definition for details about case-sensitivity.
If an attribute value is a list, the keys apply to every value in the
list, unless otherwise indicated.
6.2 SGML basic types
The document type definition specifies the syntax of HTML element
content and attribute values using SGML tokens (e.g., PCDATA, CDATA,
NAME, ID, etc.). See [ISO8879] for their full definitions. The
following is a summary of key information:
* CDATA is a sequence of characters from the document character set
and may include character entities. User agents should interpret
attribute values as follows:
+ Replace character entities with characters,
+ Ignore line feeds,
+ Replace each carriage return or tab with a single space.
User agents may ignore leading and trailing white space in CDATA
attribute values (e.g., " myval " may be interpreted as
"myval"). Authors should not declare attribute values with leading
or trailing white space.
For some HTML 4 attributes with CDATA attribute values, the
specification imposes further constraints on the set of legal
values for the attribute that may not be expressed by the DTD.
Although the STYLE and SCRIPT elements use CDATA for their data
model, for these elements, CDATA must be handled differently by
user agents. Markup and entities must be treated as raw text and
passed to the application as is. The first occurrence of the
character sequence "" (end-tag open delimiter) is treated as
terminating the end of the element's content. In valid documents,
this would be the end tag for the element.
* ID and NAME tokens must begin with a letter ([A-Za-z]) and may be
followed by any number of letters, digits ([0-9]), hyphens ("-"),
underscores ("_"), colons (":"), and periods (".").
* IDREF and IDREFS are references to ID tokens defined by other
attributes. IDREF is a single token and IDREFS is a
space-separated list of tokens.
* NUMBER tokens must contain at least one digit ([0-9]).
6.3 Text strings
A number of attributes ( %Text; in the DTD) take text that is meant to
be "human readable". For introductory information about attributes,
please consult the tutorial discussion of attributes.
6.4 URIs
This specification uses the term URI as defined in [URI] (see also
[RFC1630]).
Note that URIs include URLs (as defined in [RFC1738] and [RFC1808]).
Relative URIs are resolved to full URIs using a base URI. [RFC1808],
section 3, defines the normative algorithm for this process. For more
information about base URIs, please consult the section on base URIs
in the chapter on links.
URIs are represented in the DTD by the parameter entity %URI;.
URIs in general are case-sensitive. There may be URIs, or parts of
URIs, where case doesn't matter (e.g., machine names), but identifying
these may not be easy. Users should always consider that URIs are
case-sensitive (to be on the safe side).
Please consult the appendix for information about non-ASCII characters
in URI attribute values.
6.5 Colors
The attribute value type "color" (%Color;) refers to color definitions
as specified in [SRGB]. A color value may either be a hexadecimal
number (prefixed by a hash mark) or one of the following sixteen color
names. The color names are case-insensitive.
CAPTION: Color names and sRGB values
Black = "#000000" Green = "#008000"
Silver = "#C0C0C0" Lime = "#00FF00"
Gray = "#808080" Olive = "#808000"
White = "#FFFFFF" Yellow = "#FFFF00"
Maroon = "#800000" Navy = "#000080"
Red = "#FF0000" Blue = "#0000FF"
Purple = "#800080" Teal = "#008080"
Fuchsia = "#FF00FF" Aqua = "#00FFFF"
Thus, the color values "#800080" and "Purple" both refer to the color
purple.
6.5.1 Notes on using colors
Although colors can add significant amounts of information to
documents and make them more readable, please consider the following
guidelines when including color in your documents:
* The use of HTML elements and attributes for specifying color is
deprecated. You are encouraged to use style sheets instead.
* Don't use color combinations that cause problems for people with
color blindness in its various forms.
* If you use a background image or set the background color, then be
sure to set the various text colors as well.
* Colors specified with the BODY and FONT elements and bgcolor on
tables look different on different platforms (e.g., workstations,
Macs, Windows, and LCD panels vs. CRTs), so you shouldn't rely
entirely on a specific effect. In the future, support for the
[SRGB] color model together with ICC color profiles should
mitigate this problem.
* When practical, adopt common conventions to minimize user
confusion.
6.6 Lengths
HTML specifies three types of length values for attributes:
1. Pixels: The value (%Pixels; in the DTD) is an integer that
represents the number of pixels of the canvas (screen, paper).
Thus, the value "50" means fifty pixels. For normative information
about the definition of a pixel, please consult [CSS1].
2. Length: The value (%Length; in the DTD) may be either a %Pixel; or
a percentage of the available horizontal or vertical space. Thus,
the value "50%" means half of the available space.
3. MultiLength: The value ( %MultiLength; in the DTD) may be a
%Length; or a relative length. A relative length has the form
"i*", where "i" is an integer. When allotting space among elements
competing for that space, user agents allot pixel and percentage
lengths first, then divide up remaining available space among
relative lengths. Each relative length receives a portion of the
available space that is proportional to the integer preceding the
"*". The value "*" is equivalent to "1*". Thus, if 60 pixels of
space are available after the user agent allots pixel and
percentage space, and the competing relative lengths are 1*, 2*,
and 3*, the 1* will be alloted 10 pixels, the 2* will be alloted
20 pixels, and the 3* will be alloted 30 pixels.
Length values are case-neutral.
6.7 Content types (MIME types)
Note. A "media type" (defined in [RFC2045] and [RFC2046]) specifies
the nature of a linked resource. This specification employs the term
"content type" rather than "media type" in accordance with current
usage. Furthermore, in this specification, "media type" may refer to
the media where a user agent renders a document.
This type is represented in the DTD by %ContentType;.
Content types are case-insensitive.
Examples of content types include "text/html", "image/png",
"image/gif", "video/mpeg", "text/css", and "audio/basic". For the
current list of registered MIME types, please consult [MIMETYPES].
6.8 Language codes
The value of attributes whose type is a language code ( %LanguageCode
in the DTD) refers to a language code as specified by [RFC1766],
section 2. For information on specifying language codes in HTML,
please consult the section on language codes. Whitespace is not
allowed within the language-code.
Language codes are case-insensitive.
6.9 Character encodings
The "charset" attributes (%Charset in the DTD) refer to a character
encoding as described in the section on character encodings. Values
must be strings (e.g., "euc-jp") from the IANA registry (see
[CHARSETS] for a complete list).
Names of character encodings are case-insensitive.
User agents must follow the steps set out in the section on specifying
character encodings in order to determine the character encoding of an
external resource.
6.10 Single characters
Certain attributes call for a single character from the document
character set. These attributes take the %Character type in the DTD.
Single characters may be specified with character references (e.g.,
"&").
6.11 Dates and times
[ISO8601] allows many options and variations in the representation of
dates and times. The current specification uses one of the formats
described in the profile [DATETIME] for its definition of legal
date/time strings ( %Datetime in the DTD).
The format is:
YYYY-MM-DDThh:mm:ssTZD
where:
YYYY = four-digit year
MM = two-digit month (01=January, etc.)
DD = two-digit day of month (01 through 31)
hh = two digits of hour (00 through 23) (am/pm NOT allowed)
mm = two digits of minute (00 through 59)
ss = two digits of second (00 through 59)
TZD = time zone designator
The time zone designator is one of:
Z
indicates UTC (Coordinated Universal Time). The "Z" must be
uppercase.
+hh:mm
indicates that the time is a local time which is hh hours and
mm minutes ahead of UTC.
-hh:mm
indicates that the time is a local time which is hh hours and
mm minutes behind UTC.
Exactly the components shown here must be present, with exactly this
punctuation. Note that the "T" appears literally in the string (it
must be uppercase), to indicate the beginning of the time element, as
specified in [ISO8601]
If a generating application does not know the time to the second, it
may use the value "00" for the seconds (and minutes and hours if
necessary).
Note. [DATETIME] does not address the issue of leap seconds.
6.12 Link types
Authors may use the following recognized link types, listed here with
their conventional interpretations. In the DTD, %LinkTypes refers to a
space-separated list of link types. White space characters are not
permitted within link types.
These link types are case-insensitive, i.e., "Alternate" has the same
meaning as "alternate".
User agents, search engines, etc. may interpret these link types in a
variety of ways. For example, user agents may provide access to linked
documents through a navigation bar.
Alternate
Designates substitute versions for the document in which the
link occurs. When used together with the lang attribute, it
implies a translated version of the document. When used
together with the media attribute, it implies a version
designed for a different medium (or media).
Stylesheet
Refers to an external style sheet. See the section on external
style sheets for details. This is used together with the link
type "Alternate" for user-selectable alternate style sheets.
Start
Refers to the first document in a collection of documents. This
link type tells search engines which document is considered by
the author to be the starting point of the collection.
Next
Refers to the next document in a linear sequence of documents.
User agents may choose to preload the "next" document, to
reduce the perceived load time.
Prev
Refers to the previous document in an ordered series of
documents. Some user agents also support the synonym
"Previous".
Contents
Refers to a document serving as a table of contents. Some user
agents also support the synonym ToC (from "Table of Contents").
Index
Refers to a document providing an index for the current
document.
Glossary
Refers to a document providing a glossary of terms that pertain
to the current document.
Copyright
Refers to a copyright statement for the current document.
Chapter
Refers to a document serving as a chapter in a collection of
documents.
Section
Refers to a document serving as a section in a collection of
documents.
Subsection
Refers to a document serving as a subsection in a collection of
documents.
Appendix
Refers to a document serving as an appendix in a collection of
documents.
Help
Refers to a document offering help (more information, links to
other sources information, etc.)
Bookmark
Refers to a bookmark. A bookmark is a link to a key entry point
within an extended document. The title attribute may be used,
for example, to label the bookmark. Note that several bookmarks
may be defined in each document.
Authors may wish to define additional link types not described in this
specification. If they do so, they should use a profile to cite the
conventions used to define the link types. Please see the profile
attribute of the HEAD element for more details.
For further discussions about link types, please consult the section
on links in HTML documents.
6.13 Media descriptors
The following is a list of recognized media descriptors ( %MediaDesc
in the DTD).
screen
Intended for non-paged computer screens.
tty
Intended for media using a fixed-pitch character grid, such as
teletypes, terminals, or portable devices with limited display
capabilities.
tv
Intended for television-type devices (low resolution, color,
limited scrollability).
projection
Intended for projectors.
handheld
Intended for handheld devices (small screen, monochrome,
bitmapped graphics, limited bandwidth).
print
Intended for paged, opaque material and for documents viewed on
screen in print preview mode.
braille
Intended for braille tactile feedback devices.
aural
Intended for speech synthesizers.
all
Suitable for all devices.
Future versions of HTML may introduce new values and may allow
parameterized values. To facilitate the introduction of these
extensions, conforming user agents must be able to parse the media
attribute value as follows:
1. The value is a comma-separated list of entries. For example,
media="screen, 3d-glasses, print and resolution > 90dpi"
is mapped to:
"screen"
"3d-glasses"
"print and resolution > 90dpi"
2. Each entry is truncated just before the first character that isn't
a US ASCII letter [a-zA-Z] (ISO 10646 hex 41-5a, 61-7a), digit
[0-9] (hex 30-39), or hyphen (hex 2d). In the example, this gives:
"screen"
"3d-glasses"
"print"
3. A case-sensitive match is then made with the set of media types
defined above. User agents may ignore entries that don't match. In
the example we are left with screen and print.
Note. Style sheets may include media-dependent variations within them
(e.g., the CSS @media construct). In such cases it may be appropriate
to use "media=all".
6.14 Script data
Script data ( %Script; in the DTD) can be the content of the SCRIPT
element and the value of intrinsic event attributes. User agents must
not evaluate script data as HTML markup but instead must pass it on as
data to a script engine.
The case-sensitivity of script data depends on the scripting language.
Please note that script data that is element content may not contain
character references, but script data that is the value of an
attribute may contain them. The appendix provides further information
about specifying non-HTML data.
6.15 Style sheet data
Style sheet data (%StyleSheet; in the DTD) can be the content of the
STYLE element and the value of the style attribute. User agents must
not evaluate style data as HTML markup.
The case-sensitivity of style data depends on the style sheet
language.
Please note that style sheet data that is element content may not
contain character references, but style sheet data that is the value
of an attribute may contain them. The appendix provides further
information about specifying non-HTML data.
6.16 Frame target names
Except for the reserved names listed below, frame target names
(%FrameTarget; in the DTD) must begin with an alphabetic character
(a-zA-Z). User agents should ignore all other target names.
The following target names are reserved and have special meanings.
_blank
The user agent should load the designated document in a new,
unnamed window.
_self
The user agent should load the document in the same frame as
the element that refers to this target.
_parent
The user agent should load the document into the immediate
FRAMESET parent of the current frame. This value is equivalent
to _self if the current frame has no parent.
_top
The user agent should load the document into the full, original
window (thus canceling all other frames). This value is
equivalent to _self if the current frame has no parent.
7 The global structure of an HTML document
Contents
1. Introduction to the structure of an HTML document
2. HTML version information
3. The HTML element
4. The document head
1. The HEAD element
2. The TITLE element
3. The title attribute
4. Meta data
o Specifying meta data
o The META element
o Meta data profiles
5. The document body
1. The BODY element
2. Element identifiers: the id and class attributes
3. Block-level and inline elements
4. Grouping elements: the DIV and SPAN elements
5. Headings: The H1, H2, H3, H4, H5, H6 elements
6. The ADDRESS element
7.1 Introduction to the structure of an HTML document
An HTML 4 document is composed of three parts:
1. a line containing HTML version information,
2. a declarative header section (delimited by the HEAD element),
3. a body, which contains the document's actual content. The body may
be implemented by the BODY element or the FRAMESET element.
White space (spaces, newlines, tabs, and comments) may appear before
or after each section. Sections 2 and 3 should be delimited by the
HTML element.
Here's an example of a simple HTML document:
My first HTML document
Hello world!
7.2 HTML version information
A valid HTML document declares what version of HTML is used in the
document. The document type declaration names the document type
definition (DTD) in use for the document (see [ISO8879]).
HTML 4.01 specifies three DTDs, so authors must include one of the
following document type declarations in their documents. The DTDs vary
in the elements they support.
* The HTML 4.01 Strict DTD includes all elements and attributes that
have not been deprecated or do not appear in frameset documents.
For documents that use this DTD, use this document type
declaration:
* The HTML 4.01 Transitional DTD includes everything in the strict
DTD plus deprecated elements and attributes (most of which concern
visual presentation). For documents that use this DTD, use this
document type declaration:
* The HTML 4.01 Frameset DTD includes everything in the transitional
DTD plus frames as well. For documents that use this DTD, use this
document type declaration:
The URI in each document type declaration allows user agents to
download the DTD and any entity sets that are needed. The following
(relative) URIs refer to DTDs and entity sets for HTML 4:
* "strict.dtd" -- default strict DTD
* "loose.dtd" -- loose DTD
* "frameset.dtd" -- DTD for frameset documents
* "HTMLlat1.ent" -- Latin-1 entities
* "HTMLsymbol.ent" -- Symbol entities
* "HTMLspecial.ent" -- Special entities
The binding between public identifiers and files can be specified
using a catalog file following the format recommended by the Oasis
Open Consortium (see [OASISOPEN]). A sample catalog file for HTML 4.01
is included at the beginning of the section on SGML reference
information for HTML. The last two letters of the declaration indicate
the language of the DTD. For HTML, this is always English ("EN").
Note. As of the 24 December version of HTML 4.01, the HTML Working
Group commits to the following policy:
* Any changes to future HTML 4 DTDs will not invalidate documents
that conform to the DTDs of the present specification. The HTML
Working Group reserves the right to correct known bugs.
* Software conforming to the DTDs of the present specification may
ignore features of future HTML 4 DTDs that it does not recognize.
This means that in a document type declaration, authors may safely use
a system identifier that refers to the latest version of an HTML 4
DTD. Authors may also choose to use a system identifier that refers to
a specific (dated) version of an HTML 4 DTD when validation to that
particular DTD is required. W3C will make every effort to make
archival documents indefinitely available at their original address in
their original form.
7.3 The HTML element
Start tag: optional, End tag: optional
Attribute definitions
version = cdata [CN]
Deprecated. The value of this attribute specifies which HTML
DTD version governs the current document. This attribute has
been deprecated because it is redundant with version
information provided by the document type declaration.
Attributes defined elsewhere
* lang (language information), dir (text direction)
After document type declaration, the remainder of an HTML document is
contained by the HTML element. Thus, a typical HTML document has this
structure:
...The head, body, etc. goes here...
7.4 The document head
7.4.1 The HEAD element
Start tag: optional, End tag: optional
Attribute definitions
profile = uri [CT]
This attribute specifies the location of one or more meta data
profiles, separated by white space. For future extensions, user
agents should consider the value to be a list even though this
specification only considers the first URI to be significant.
Profiles are discussed below in the section on meta data.
Attributes defined elsewhere
* lang (language information), dir (text direction)
The HEAD element contains information about the current document, such
as its title, keywords that may be useful to search engines, and other
data that is not considered document content. User agents do not
generally render elements that appear in the HEAD as content. They
may, however, make information in the HEAD available to users through
other mechanisms.
7.4.2 The TITLE element
Start tag: required, End tag: required
Attributes defined elsewhere
* lang (language information), dir (text direction)
Every HTML document must have a TITLE element in the HEAD section.
Authors should use the TITLE element to identify the contents of a
document. Since users often consult documents out of context, authors
should provide context-rich titles. Thus, instead of a title such as
"Introduction", which doesn't provide much contextual background,
authors should supply a title such as "Introduction to Medieval
Bee-Keeping" instead.
For reasons of accessibility, user agents must always make the content
of the TITLE element available to users (including TITLE elements that
occur in frames). The mechanism for doing so depends on the user agent
(e.g., as a caption, spoken).
Titles may contain character entities (for accented characters,
special characters, etc.), but may not contain other markup (including
comments). Here is a sample document title:
A study of population dynamics
... other head elements...
... document body...
7.4.3 The title attribute
Attribute definitions
title = text [CS]
This attribute offers advisory information about the element
for which it is set.
Unlike the TITLE element, which provides information about an entire
document and may only appear once, the title attribute may annotate
any number of elements. Please consult an element's definition to
verify that it supports this attribute.
Values of the title attribute may be rendered by user agents in a
variety of ways. For instance, visual browsers frequently display the
title as a "tool tip" (a short message that appears when the pointing
device pauses over an object). Audio user agents may speak the title
information in a similar context. For example, setting the attribute
on a link allows user agents (visual and non-visual) to tell users
about the nature of the linked resource:
...some text...
Here's a photo of
me scuba diving last summer
...some more text...
The title attribute has an additional role when used with the LINK
element to designate an external style sheet. Please consult the
section on links and style sheets for details.
Note. To improve the quality of speech synthesis for cases handled
poorly by standard techniques, future versions of HTML may include an
attribute for encoding phonemic and prosodic information.
7.4.4 Meta data
Note. The W3C Resource Description Framework (see [RDF10]) became a
W3C Recommendation in February 1999. RDF allows authors to specify
machine-readable metadata about HTML documents and other
network-accessible resources.
HTML lets authors specify meta data -- information about a document
rather than document content -- in a variety of ways.
For example, to specify the author of a document, one may use the META
element as follows:
The META element specifies a property (here "Author") and assigns a
value to it (here "Dave Raggett").
This specification does not define a set of legal meta data
properties. The meaning of a property and the set of legal values for
that property should be defined in a reference lexicon called a
profile. For example, a profile designed to help search engines index
documents might define properties such as "author", "copyright",
"keywords", etc.
Specifying meta data
In general, specifying meta data involves two steps:
1. Declaring a property and a value for that property. This may be
done in two ways:
1. From within a document, via the META element.
2. From outside a document, by linking to meta data via the LINK
element (see the section on link types).
2. Referring to a profile where the property and its legal values are
defined. To designate a profile, use the profile attribute of the
HEAD element.
Note that since a profile is defined for the HEAD element, the same
profile applies to all META and LINK elements in the document head.
User agents are not required to support meta data mechanisms. For
those that choose to support meta data, this specification does not
define how meta data should be interpreted.
The META element
Start tag: required, End tag: forbidden
Attribute definitions
For the following attributes, the permitted values and their
interpretation are profile dependent:
name = name [CS]
This attribute identifies a property name. This specification
does not list legal values for this attribute.
content = cdata [CS]
This attribute specifies a property's value. This specification
does not list legal values for this attribute.
scheme = cdata [CS]
This attribute names a scheme to be used to interpret the
property's value (see the section on profiles for details).
http-equiv = name [CI]
This attribute may be used in place of the name attribute. HTTP
servers use this attribute to gather information for HTTP
response message headers.
Attributes defined elsewhere
* lang (language information), dir (text direction)
The META element can be used to identify properties of a document
(e.g., author, expiration date, a list of key words, etc.) and assign
values to those properties. This specification does not define a
normative set of properties.
Each META element specifies a property/value pair. The name attribute
identifies the property and the content attribute specifies the
property's value.
For example, the following declaration sets a value for the Author
property:
The lang attribute can be used with META to specify the language for
the value of the content attribute. This enables speech synthesizers
to apply language dependent pronunciation rules.
In this example, the author's name is declared to be French:
Note. The META element is a generic mechanism for specifying meta
data. However, some HTML elements and attributes already handle
certain pieces of meta data and may be used by authors instead of META
to specify those pieces: the TITLE element, the ADDRESS element, the
INS and DEL elements, the title attribute, and the cite attribute.
Note. When a property specified by a META element takes a value that
is a URI, some authors prefer to specify the meta data via the LINK
element. Thus, the following meta data declaration:
might also be written:
META and HTTP headers
The http-equiv attribute can be used in place of the name attribute
and has a special significance when documents are retrieved via the
Hypertext Transfer Protocol (HTTP). HTTP servers may use the property
name specified by the http-equiv attribute to create an [RFC822]-style
header in the HTTP response. Please see the HTTP specification
([RFC2616]) for details on valid HTTP headers.
The following sample META declaration:
will result in the HTTP header:
Expires: Tue, 20 Aug 1996 14:25:27 GMT
This can be used by caches to determine when to fetch a fresh copy of
the associated document.
Note. Some user agents support the use of META to refresh the current
page after a specified number of seconds, with the option of replacing
it by a different URI. Authors should not use this technique to
forward users to different pages, as this makes the page inaccessible
to some users. Instead, automatic page forwarding should be done using
server-side redirects.
META and search engines
A common use for META is to specify keywords that a search engine may
use to improve the quality of search results. When several META
elements provide language-dependent information about a document,
search engines may filter on the lang attribute to display search
results using the language preferences of the user. For example,
<-- For speakers of US English -->
<-- For speakers of British English -->
<-- For speakers of French -->
The effectiveness of search engines can also be increased by using the
LINK element to specify links to translations of the document in other
languages, links to versions of the document in other media (e.g.,
PDF), and, when the document is part of a collection, links to an
appropriate starting point for browsing the collection.
Further help is provided in the section on helping search engines
index your Web site.
META and PICS
The Platform for Internet Content Selection (PICS, specified in
[PICS]) is an infrastructure for associating labels (meta data) with
Internet content. Originally designed to help parents and teachers
control what children can access on the Internet, it also facilitates
other uses for labels, including code signing, privacy, and
intellectual property rights management.
This example illustrates how one can use a META declaration to include
a PICS 1.1 label:
... document title ...
META and default information
The META element may be used to specify the default information for a
document in the following instances:
* The default scripting language.
* The default style sheet language.
* The document character encoding.
The following example specifies the character encoding for a document
as being ISO-8859-5
Meta data profiles
The profile attribute of the HEAD specifies the location of a meta
data profile. The value of the profile attribute is a URI. User agents
may use this URI in two ways:
* As a globally unique name. User agents may be able to recognize
the name (without actually retrieving the profile) and perform
some activity based on known conventions for that profile. For
instance, search engines could provide an interface for searching
through catalogs of HTML documents, where these documents all use
the same profile for representing catalog entries.
* As a link. User agents may dereference