Structured Text

An HTML instance is like a text file, except that some of the characters are interpreted as markup. The markup gives structure to the document.

The instance represents a hierarchy of elements. Each element has a name , some attributes , and some content. Most elements are represented in the document as a start tag, which gives the name and attributes, followed by the content, followed by the end tag. For example:

<HTML>
 <TITLE>
  A sample HTML instance
 </TITLE>
 <H1>
  An Example of Structure
 </H1>
 Here's a typical paragraph.
 <P>
 <UL>
  <LI>
  Item one has an
  <A NAME="anchor">
   anchor
  </A>
  <LI>
  Here's item two.
 </UL>
</HTML>

Some elements (e.g. P, LI) are empty. They have no content. They show up as just a start tag.

For the rest of the elements, the content is a sequence of data characters and nested elements.

Element Types

The name of a tag refers to an element type declaration in the HTML DTD. An element type declaration associates an element name with

A list of attributes and their types and statuses
A content type (one of EMPTY, CDATA, RCDATA, ELEMENT, or MIXED) which determines the syntax of the element's content
A content model, which specifies the pattern of nested elements and data

Empty Elements

Empty elements have the keyword EMPTY in their declaration. For example:

<!ELEMENT NEXTID - O EMPTY>
<!ATTLIST NEXTID N NUMBER #REQUIRED>

This means that the follwing:

<nextid n=''27''>

is legal, but these others are not:

<nextid>

<nextid n=''abc''>

Character Data

The keyword CDATA indicates that the content of an element is character data. Character data is all the text up to the next end tag open delimter-in-context. For example:

<!ELEMENT XMP - - CDATA>

specifies that the following text is a legal XMP element:

<xmp>Here's a title. It looks like it has <tags> and <!--comments-->
in it, but it does not. Even this </ is data.</xmp>

The string </ is only recognized as the opening delimiter of an end tag when it is ``in context,'' that is, when it is followed by a letter. However, as soon as the end tag open delimiter is recognized, it terminates the CDATA content. The following is an error:

<xmp>There is no way to represent </end> tags in CDATA </xmp>

Replaceable Character Data

Elements with RCDATA content behave much like thos with CDATA, except for character references and entity references. Elements declared like:

<!ELEMENT TITLE - - RCDATA>

can have any sequence of characters in their content.

Character References

To represent a character that would otherwise be recognized as markup, use a character referece. The string &# signals a character reference when it is followed by a letter or a digit. The delimiter is followed by the decimal character number and a semicolon. For example:

<title>You can even represent &#60;/end> tags in RCDATA </title>

Entity References

The HTML DTD declares entities for the less than, greater than, and ampersand characters and each of the ISO Latin 1 characters so that you can reference them by name rather than by number.

The string & signals an entity reference when it is followed by a letter or a digit. The delimiter is followed by the entity name and a semicolon. For example:

Kurt G&ouml;del was a famous logician and mathemetician.

Note:: To be sure that a string of characters has no markup, HTML writers should represent all occurences of <, >, and & by character or entity references.

Element Content

Some elements have, in stead of a keyword that states the type of content, a content model, which tells what patterns of data and nested elements are allowed. If the content model of an element does not include the symbol #PCDATA , the content is element content.

Whitespace in element content is considered markup and ignored. Any characters that are not markup, that is, data characters, are illegal.

For example:

<!ELEMENT HEAD - - (TITLE? & ISINDEX? & NEXTID? & LINK*)>

declares an element that may be used as follows:

<head>
 <isindex>
 <title>Head Example</title>
</head>

But the following are illegal:

<head> no data allowed! </head>

<head><isindex><title>Two isindex tags</title><isindex></head>

Mixed Content

If the content model includes the symbol #PCDATA, the content of the element is parsed as mixed content. For example:

<!ELEMENT PRE - - (#PCDATA | A | B | I | U | P)+>
<!ATTLIST PRE
	WIDTH NUMBER #implied

This says that the PRE element contains one or more A, B, I, U, or P elements or data characters. Here's an example of a PRE element:

<pre>

<b>NAME</b>

    cat -- concatenate<a href=''terms.html#file''>files</a>

<b>EXAMPLE</b>

    cat &#60;xyz

</pre>

The content of the above PRE element is:

A B element
The string `` cat -- concatenate''
An A element
The string ``\n''
Another B element
The string ``\n cat <xyz''

Comments and Other Markup

To include comments in an HTML document that will be ignored by the parser, surround them with . After the comment delimiter, all text up to the next occurence of -- is ignored. Hence comments cannot be nested. Whitespace is allowed between the closing -- and >. (But not between the opening <! and --.)

For example:

<HEAD>
<TITLE>HTML Guide: Recommended Usage</TITLE>
<!-- $Id: recommended.html,v 1.3 93/01/06 18:38:11 connolly Exp $ -->
</HEAD>

There are a few other SGML markup constructs that are deprecated or illegal.

Delimiter: Signals...
<?: Processing instruction. Terminated by >.
<![L: Marked section. Marked sections are deprecated. See the SGML standard for complete information.
<!L: Markup declaration. HTML defines no short reference maps, so these are errors. Terminated by >.

Line Breaks

A line break character is considered markup (and ignored) if it is the first or last piece of content in an element. This allows you to write either

<PRE>some example text</pre>

<pre>
some example text
</pre>

and these will be processed identically.

Also, a line that's not empty but contains no content will be ignored altogether. For example, the element

<pre>

<!-- this line is ignored, including the linebreak character -->

first line

third line<!-- the following linebreak is content: -->

fourth line<!-- this one's ignored cuz it's the last piece of content: -->

</pre>

contains only the string first line\n\nthird line\nfourth line.

Summary of Markup Signals

The following delimiters may signal markup, depending on context.

Delimiter: Signals
<!--: Comment
&#: Character reference
&: Entity reference
</: End tag
<!: Markup declaration
]]>: Marked section close (an error)
<: Start tag