Structured Text

An HTML instance is like a text file, except that some of the characters are interpreted as markup. The markup gives structure to the document.

The instance represents a hierarchy of elements. Each element has a name , some attributes , and some content. Most elements are represented in the document as a start tag, which gives the name and attributes, followed by the content, followed by the end tag. For example:

<HTML>
 <TITLE>
  A sample HTML instance
 </TITLE>
 <H1>
  An Example of Structure
 </H1>
 Here's a typical paragraph.
 <P>
 <UL>
  <LI>
  Item one has an
  <A NAME="anchor">
   anchor
  </A>
  <LI>
  Here's item two.
 </UL>
</HTML>
Some elements (e.g. P, LI) are empty. They have no content. They show up as just a start tag.

For the rest of the elements, the content is a sequence of data characters and nested elements.

Tags

Every element starts with a tag, and every non-empty element ends with a tag. Start tags are delimited by < and >, and end tags are delimited by </ and >.

Names

The element name immediately follows the tag open delimiter. Names consist of a letter followed by up to 33 letters, digits, periods, or hyphens. Names are not case sensitive.

Attributes

In a start tag, whitespace and attributes are allowed between the element name and the closing delimiter. An attribute consists of a name, an equal sign, and a value. Whitespace is allowed around the equal sign.

The value is specified in a string surrounded by single quotes or a string surrounded by double quotes. (See: other tolerated forms @@)

The string is parsed like RCDATA (see below ) to determine the attribute value. This allows, for example, quote characters in attribute values to be represented by character references.

The length of an attribute value (after parsing) is limited to 1024 characters.

Element Types

The name of a tag refers to an element type declaration in the HTML DTD. An element type declaration associates an element name with

Empty Elements

Empty elements have the keyword EMPTY in their declaration. For example:

<!ELEMENT NEXTID - O EMPTY>
<!ATTLIST NEXTID N NUMBER #REQUIRED>
This means that the follwing:

<nextid n=''27''>
is legal, but these others are not:

<nextid>
<nextid n=''abc''>

Character Data

The keyword CDATA indicates that the content of an element is character data. Character data is all the text up to the next end tag open delimter-in-context. For example:

<!ELEMENT XMP - - CDATA>
specifies that the following text is a legal XMP element:

<xmp>Here's a title. It looks like it has <tags> and <!--comments-->
in it, but it does not. Even this </ is data.</xmp>
The string </ is only recognized as the opening delimiter of an end tag when it is ``in context,'' that is, when it is followed by a letter. However, as soon as the end tag open delimiter is recognized, it terminates the CDATA content. The following is an error:

<xmp>There is no way to represent </end> tags in CDATA </xmp>

Replaceable Character Data

Elements with RCDATA content behave much like thos with CDATA, except for character references and entity references. Elements declared like:

<!ELEMENT TITLE - - RCDATA>
can have any sequence of characters in their content.

Character References

To represent a character that would otherwise be recognized as markup, use a character referece. The string &# signals a character reference when it is followed by a letter or a digit. The delimiter is followed by the decimal character number and a semicolon. For example:

<title>You can even represent &#60;/end> tags in RCDATA </title>

Entity References

The HTML DTD declares entities for the less than, greater than, and ampersand characters and each of the ISO Latin 1 characters so that you can reference them by name rather than by number.

The string & signals an entity reference when it is followed by a letter or a digit. The delimiter is followed by the entity name and a semicolon. For example:

Kurt G&ouml;del was a famous logician and mathemetician.
Note:
To be sure that a string of characters has no markup, HTML writers should represent all occurences of <, >, and & by character or entity references.

Element Content

Some elements have, in stead of a keyword that states the type of content, a content model, which tells what patterns of data and nested elements are allowed. If the content model of an element does not include the symbol #PCDATA , the content is element content.

Whitespace in element content is considered markup and ignored. Any characters that are not markup, that is, data characters, are illegal.

For example:

<!ELEMENT HEAD - - (TITLE? & ISINDEX? & NEXTID? & LINK*)>
declares an element that may be used as follows:

<head>
 <isindex>
 <title>Head Example</title>
</head>
But the following are illegal:

<head> no data allowed! </head>
<head><isindex><title>Two isindex tags</title><isindex></head>

Mixed Content

If the content model includes the symbol #PCDATA, the content of the element is parsed as mixed content. For example:

<!ELEMENT PRE - - (#PCDATA | A | B | I | U | P)+>
<!ATTLIST PRE
	WIDTH NUMBER #implied
	>
This says that the PRE element contains one or more A, B, I, U, or P elements or data characters. Here's an example of a PRE element:

<pre>
<b>NAME</b>
    cat -- concatenate<a href=''terms.html#file''>files</a>
<b>EXAMPLE</b>
    cat &#60;xyz
</pre>
The content of the above PRE element is:

Comments and Other Markup

To include comments in an HTML document that will be ignored by the parser, surround them with <!-- and -->. After the comment delimiter, all text up to the next occurence of -- is ignored. Hence comments cannot be nested. Whitespace is allowed between the closing -- and >. (But not between the opening <! and --.)

For example:

<HEAD>
<TITLE>HTML Guide: Recommended Usage</TITLE>
<!-- $Id: recommended.html,v 1.3 93/01/06 18:38:11 connolly Exp $ -->
</HEAD>
 
There are a few other SGML markup constructs that are deprecated or illegal.

Delimiter
Signals...
<?
Processing instruction. Terminated by >.
<![L
Marked section. Marked sections are deprecated. See the SGML standard for complete information.
<!L
Markup declaration. HTML defines no short reference maps, so these are errors. Terminated by >.

Line Breaks

A line break character is considered markup (and ignored) if it is the first or last piece of content in an element. This allows you to write either

<PRE>some example text</pre>
or

<pre>
some example text
</pre>
and these will be processed identically.

Also, a line that's not empty but contains no content will be ignored altogether. For example, the element

<pre>
<!-- this line is ignored, including the linebreak character -->
first line
 
third line<!-- the following linebreak is content: -->
fourth line<!-- this one's ignored cuz it's the last piece of content: -->
</pre>
contains only the string first line\n\nthird line\nfourth line.

Summary of Markup Signals

The following delimiters may signal markup, depending on context.

Delimiter
Signals
<!--
Comment
&#
Character reference
&
Entity reference
</
End tag
<!
Markup declaration
]]>
Marked section close (an error)
<
Start tag