Structured Text

An HTML instance is like a text file, except that some of the characters are interpreted as markup. The markup gives structure to the document.

The instance represents a hierarchy of elements. Each element has a name , some attributes , and some content. Most elements are represented in the document as a start tag, which gives the name and attributes, followed by the content, followed by the end tag. For example:

	<!DOCTYPE HTML PUBLIC
	 	"-//W3 Organization//DTD W3 HTML 2.0//EN">
	<HTML>
	  <HEAD>
	    <TITLE>
	      A sample HTML document
	    </TITLE>
	  </HEAD>

	  <BODY>
	    <H1>
	      An Example of Structure
	      <br>
	      In HTML
	    </H1>
	    <P>
	      Here's a typical paragraph.
	    <UL>
	      <LI>
	        Item one has an
	        <A NAME="anchor">
	          anchor
	        </A>
	      <LI>
	        Here's item two.
	    </UL>
	  </BODY>
	</HTML>

Some elements (e.g. BR ) are empty. They have no content. They show up as just a start tag.

For the rest of the elements, the content is a sequence of data characters and nested elements. Some things such as forms and anchors cannot be nested, in which case this is mentioned in the text. Anchors and character highlighting may be put inside other constructs.

Tags

Most elements start and end with tags. Empty elements have no end tag. Start tags are delimited by <and >, and end tags are delimited by </ and >. For example:

	<h1> ... </H1>   <!-- uppercase = lowercase  -->
	<h1 > ... </h1 > <!-- spaces OK before > -->

The following are not valid tags:

	< h1>             <!-- this is not a tag at all -->
	<H1/> <H=1>       <!-- these are markup errors -->

NOTE:: The SGML declaration for HTML specifies SHORTTAG YES , which means that there are some other valid syntaxes for tags, e.g. NET tags: <em/.../ , empty start tags: <> , empty end tags: </> . Until such time as support for these idioms is widely deployed, their use is strongly discouraged.

The start and end tags for the HTML, HEAD, and BODY elements are omissable. The end tags of some other elements (e.g. P, LI, DT, DD) can be ommitted (see the DTD for details). This does not change the document structure -- the following documents are equivalent:

	<!DOCTYPE HTML PUBLIC
	 	"-//W3 Organization//DTD W3 HTML 2.0//EN">
	  <TITLE>Structural Example</TITLE>
	  <H1>Structural Example</H1>
	  <P>A paragraph...

	<!DOCTYPE HTML PUBLIC
	 	"-//W3 Organization//DTD W3 HTML 2.0//EN">
	  <HTML><HEAD>
	  <TITLE>Structural Example</TITLE>
	  </HEAD>
	  <BODY>
	  <H1>Structural Example</H1>
	  <P>A paragraph...</P>
	  </BODY>

Names

The element name immediately follows the tag open delimiter. Names consist of a letter followed by up to 33 letters, digits, periods, or hyphens. Names are not case sensitive. For example:

	A H1 h1 another.name name-with-hyphens

Attributes

In a start tag, whitespace and attributes are allowed between the element name and the closing delimiter. An attribute consists of a name, an equal sign, and a value. Whitespace is allowed around the equal sign.

The value is either:

A string literal, delimited by single quotes or double quotes, or
A name token; that is, a sequence of letters, digits, periods, or hyphens.

For example:

	<A HREF="http://host/dir/file.html">
	<A HREF=foo.html >
	<IMG SRC="mrbill.gif" ALT="Mr. Bill says, &#34;Oh Noooo&#34;">

The length of an attribute value (after replacing entity and numeric character referencees) is limited to 1024 characters.

NOTE 1:: Some implementations allowed any character except space or '>' in a name token, for example <A HREF=foo/bar.html> . As a result, there are many documents that contain attribute values that should be quoted but are not. While parser implementators are encouraged to support this idiom, its use in future documents is stictly prohibited.
NOTE 2:: Some implementations also consider any occurence of the > character to signal the end of a tag. For compatibility with such implementations, it may be necessary to represent > with an entity or numeric character reference; for example: <IMG SRC="eq1.ps" ALT="a > b">

Attributes with a delcared value of NAME (e.g. ISMAP , COMPACT ) may be written using a minimized syntax. The markup:

	<UL COMPACT="COMPACT">

can be written as

	<UL COMPACT>

Undefined tag and attribute names

It is a principle to be conservative in that which one produces, and liberal in that which one accepts. HTML parsers should be liberal except when verifying code. HTML generators should generate strictly conforming HTML.

The behaviour of WWW applications reading HTML documents and discovering tag or attribute names which they do not understand should be to behave as though, in the case of a tag, the whole tag had not been there but its content had, or in the case of an attribute, that the attribute had not been present.

Character Data

The charcters between the tags represent text in the ISO-Latin-1 character set, which is a superset of ASCII. Because certain characters will be interpreted as markup, they should be "escaped"; that is, represented by markup -- entity or numeric character references. For example:

                When a&#60;b, we can show that...
                Brought to you by AT&amp;T

The HTML DTD includes entities for each of the non-ASCII characters so that one may reference them by name if it is inconvenient to enter them directly:

           Kurt G&ouml;del was a famous logician and mathematician.

NOTE 1:: To ensure that a string of characters has no markup, it is sufficient to represent all occurrences of < , > , and & by character or entity references.
NOTE 2:: There are SGML features ( CDATA , RCDATA ) to allow most < , > , and & characters to be entered without the use of entity or character references. Because these features tend to be used and implemented inconsistently, and because they require 8-bit characters to represent non-ASCII characters, they are not employed in this version of the HTML DTD. An earlier HTML specification included an XMP element whose syntax is not expressible in SGML. Inside the XMP , no markup was recognized except the </XMP> end tag. While implementations are encouraged to support this idiom, its use is obsolete.

Comments

To include comments in an HTML document that will be ignored by the parser, surround them with . After the comment delimiter, all text up to the next occurrence of -- is ignored. Hence comments cannot be nested. Whitespace is allowed between the closing -- and >. (But not between the opening <! and --.)

For example:

<HEAD>
<TITLE>HTML Guide: Recommended Usage</TITLE>
<!-- Id: Text.html,v 1.6 1994/04/25 17:33:48 connolly Exp -->
</HEAD>

Note 3:: Some historical implementations incorrectly consider a > sign to terminate a comment.