PUBLIC DRAFT -- HTML HYPERTEXT MARKUP LANGUAGE A REPRESENTATION FOR NODES IN THE WORLD WIDE WEB Daniel W. Connolly, Convex Computer Corp. January, 1993 Status of this Document Distribution of this document is unlimited. Please send comments to Dan Connolly <connolly@convex.com>. Abstract The World Wide Web project involves the processing of structured documents by diverse systems around the globe. Existing document representations geared towards typesetting, information retrieval, or multimedia are too tightly coupled to a hardware system, authoring environment, publication style, or field of study. HyperText Markup Language was created to fill the need to Represent existing bodies of information Connect information entities with hypertext links Scale to a world-wide scope Fit into existing and evolving user interface paradigms Provide an experimental platform for collaborative hypermedia Contents Introduction 2 Structured Text 3 Tags 3 Element Types 4 Comments and Other Markup 6 Line Breaks 7 Summary of Markup Signals 7 HTML semantics @@ Rationale @@ References 9 HTML DTD 10 PUBLIC DRAFT -- HTML INTRODUCTION The HyperText Markup Language is defined in terms of the ISO Standard Generalized Markup Language []. SGML is a system for defining structured document types and markup languages to represent instances of those document types. Every SGML document has three parts: An SGML declaration, which binds SGML processing quantities and syntax token names to specific values. For example, the SGML declaration in the HTML DTD specifies that the string that opens a tag is </ and the maximum length of a name is 40 characters. A prologue including one or more document type declarations, which specifiy the element types, element relationships and attributes, and references that can be represented by markup. The HTML DTD specifies, for example, that the HEAD element contains at most one TITLE element. An instance, which contains the data and markup of the document. We use the term HTML to mean both the document type and the markup language for representing instances of that document type. All HTML documents share the same SGML declaration an prologue. Hence implementations of the WorldWide Web generally only transmit and store the instance part of an HTML document. To construct an SGML document entity for processing by an SGML parser, it is necessary to prefix the text from ``HTML DTD'' on page 10 to the HTML instance. Conversely, to implement an HTML parser, one need only implement those parts of an SGML parser that are needed to parse an instance after parsing the HTML DTD. PUBLIC DRAFT -- HTML STRUCTURED TEXT An HTML instance is like a text file, except that some of the characters are interpreted as markup. The markup gives structure to the document. The instance represents a hierarchy of elements. Each element has a name , some attributes , and some content. Most elements are represented in the document as a start tag, which gives the name and attributes, followed by the content, followed by the end tag. For example: <HTML> <TITLE> A sample HTML instance </TITLE> <H1> An Example of Structure </H1> Here's a typical paragraph. <P> <UL> <LI> Item one has an <A NAME="anchor"> anchor </A> <LI> Here's item two. </UL> </HTML> Some elements (e.g. P, LI) are empty. They have no content. They show up as just a start tag. For the rest of the elements, the content is a sequence of data characters and nested elements. Tags Every element starts with a tag, and every non-empty element ends with a tag. Start tags are delimited by < and >, and end tags are delimited by </ and >. NAMES The element name immediately follows the tag open delimiter. Names consist of a letter followed by up to 33 letters, digits, periods, or hyphens. Names are not case sensitive. ATTRIBUTES In a start tag, whitespace and attributes are allowed between the element name and the closing delimiter. An attribute consists of a name, an equal sign, and a value. Whitespace is allowed around the equal sign. The value is specified in a string surrounded by single quotes or a string surrounded by double quotes. (See: other tolerated forms @@) The string is parsed like RCDATA (see below ) to determine the attribute value. This allows, for example, quote characters in attribute values to be represented by character references. The length of an attribute value (after parsing) is limited to 1024 characters. Element Types The name of a tag refers to an element type declaration in the HTML DTD. An element type declaration associates an element name with A list of attributes and their types and statuses A content type (one of EMPTY, CDATA, RCDATA, ELEMENT, or MIXED) which determines the syntax of the element's content A content model, which specifies the pattern of nested elements and data EMPTY ELEMENTS Empty elements have the keyword EMPTY in their declaration. For example: <!ELEMENT NEXTID - O EMPTY> <!ATTLIST NEXTID N NUMBER #REQUIRED> This means that the follwing: is legal, but these others are not: <nextid> CHARACTER DATA The keyword CDATA indicates that the content of an element is character data. Character data is all the text up to the next end tag open delimter-in-context. For example: <!ELEMENT XMP - - CDATA> specifies that the following text is a legal XMP element: Here's a title. It looks like it has <tags> and <!--comments--> in it, but it does not. Even this &#60;/ is data.&#60;/xmp&#62; The string &#60;/ is only recognized as the opening delimiter of an end tag when it is ``in context,'' that is, when it is followed by a letter. However, as soon as the end tag open delimiter is recognized, it terminates the CDATA content. The following is an error: &#60;xmp&#62;There is no way to represent &#60;/end&#62; tags in CDATA &#60;/xmp&#62; REPLACEABLE CHARACTER DATA Elements with RCDATA content behave much like thos with CDATA, except for character references and entity references. Elements declared like: &#60;!ELEMENT TITLE - - RCDATA&#62; can have any sequence of characters in their content. Character References To represent a character that would otherwise be recognized as markup, use a character referece. The string &#38;# signals a character reference when it is followed by a letter or a digit. The delimiter is followed by the decimal character number and a semicolon. For example: &#60;title&#62;You can even represent &#38;#60;/end&#62; tags in RCDATA &#60;/title&#62; Entity References The HTML DTD declares entities for the less than, greater than, and ampersand characters and each of the ISO Latin 1 characters so that you can reference them by name rather than by number. The string &#38; signals an entity reference when it is followed by a letter or a digit. The delimiter is followed by the entity name and a semicolon. For example: Kurt G&#38;ouml;del was a famous logician and mathemetician. Note: To be sure that a string of characters has no markup, HTML writers should represent all occurences of &#60;, &#62;, and &#38; by character or entity references. ELEMENT CONTENT Some elements have, in stead of a keyword that states the type of content, a content model, which tells what patterns of data and nested elements are allowed. If the content model of an element does not include the symbol #PCDATA , the content is element content. Whitespace in element content is considered markup and ignored. Any characters that are not markup, that is, data characters, are illegal. For example: &#60;!ELEMENT HEAD - - (TITLE? &#38; ISINDEX? &#38; NEXTID? &#38; LINK*)&#62; declares an element that may be used as follows: &#60;head&#62; &#60;isindex&#62; &#60;title&#62;Head Example&#60;/title&#62; &#60;/head&#62; But the following are illegal: &#60;head&#62; no data allowed! &#60;/head&#62; &#60;head&#62;&#60;isindex&#62;&#60;title&#62;Two isindex tags&#60;/title&#62;&#60;isindex&#62;&#60;/head&#62; MIXED CONTENT If the content model includes the symbol #PCDATA, the content of the element is parsed as mixed content. For example: &#60;!ELEMENT PRE - - (#PCDATA | A | B | I | U | P)+&#62; &#60;!ATTLIST PRE WIDTH NUMBER #implied &#62; This says that the PRE element contains one or more A, B, I, U, or P elements or data characters. Here's an example of a PRE element: &#60;pre&#62; &#60;b&#62;NAME&#60;/b&#62; cat -- concatenate<a href=''terms.html#file''>files</a> &#60;b&#62;EXAMPLE&#60;/b&#62; cat &#38;#60;xyz &#60;/pre&#62; The content of the above PRE element is: A B element The string `` cat -- concatenate'' An A element The string ``\n'' Another B element The string ``\n cat <xyz'' Comments and Other Markup To include comments in an HTML document that will be ignored by the parser, surround them with &#60;!-- and --&#62;. After the comment delimiter, all text up to the next occurence of -- is ignored. Hence comments cannot be nested. Whitespace is allowed between the closing -- and &#62;. (But not between the opening &#60;! and --.) For example: &#60;HEAD&#62; &#60;TITLE&#62;HTML Guide: Recommended Usage&#60;/TITLE&#62; &#60;!-- $Id: recommended.html,v 1.3 93/01/06 18:38:11 connolly Exp $ --&#62; &#60;/HEAD&#62; There are a few other SGML markup constructs that are deprecated or illegal. Delimiter Signals... &#60;? Processing instruction. Terminated by &#62;. &#60;![L Marked section. Marked sections are deprecated. See the SGML standard for complete information. &#60;!L Markup declaration. HTML defines no short reference maps, so these are errors. Terminated by &#62;. Line Breaks A line break character is considered markup (and ignored) if it is the first or last piece of content in an element. This allows you to write either &#60;PRE&#62;some example text&#60;/pre&#62; or &#60;pre&#62; some example text &#60;/pre&#62; and these will be processed identically. Also, a line that's not empty but contains no content will be ignored altogether. For example, the element &#60;pre&#62; &#60;!-- this line is ignored, including the linebreak character --&#62; first line third line&#60;!-- the following linebreak is content: --&#62; fourth line<!-- this one's ignored cuz it's the last piece of content: --> &#60;/pre&#62; contains only the string first line\n\nthird line\nfourth line. Summary of Markup Signals The following delimiters may signal markup, depending on context. Delimiter Signals &#60;!-- Comment &#38;# Character reference &#38; Entity reference &#60;/ End tag &#60;! Markup declaration ]]&#62; Marked section close (an error) &#60; Start tag PUBLIC DRAFT -- HTML REFERENCES ISO 8879:1986, Information ProcessingText and Office SystemsStandard Generalized Markup Language (SGML) sgmls an SGML parser by James Clark &#60;jjc@jclark.com&#62; derived from the ARCSGML parser materials which were written by Charles F. Goldfarb. The source is available on the ifi.uio.no FTP server in the directory /pub/SGML/SGMLS . WWW URL PUBLIC DRAFT -- HTML &#60;!SGML "ISO 8879:1986" -- HTML DTD Document Type Definition for the HyperText Markup Language as used by the World Wide Web application (HTML DTD). NOTE: This is a definition of HTML with respect to SGML, and assumes an understaning of SGML terms. For a description of HTML in layman's terms, see "HTML: A Representation for Nodes in the World Wide Web" by Dan Connolly. aka http://info.cern.ch/hypertext/WWW/MarkUp/MarkUp.html by &#60;connolly@convex.com&#62; -- CHARSET BASESET "ISO 646:1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0" DESCSET 0 9 UNUSED 9 2 9 11 2 UNUSED 13 1 13 14 18 UNUSED 32 95 32 127 1 UNUSED BASESET "ISO Registration Number 100//CHARSET ECMA-94 Right Part of Latin Alphabet Nr. 1//ESC 2/13 4/1" DESCSET 128 32 UNUSED 160 95 32 255 1 UNUSED CAPACITY SGMLREF TOTALCAP 150000 GRPCAP 150000 SCOPE DOCUMENT SYNTAX SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127 255 BASESET "ISO 646:1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0" DESCSET 0 128 0 FUNCTION RE 13 RS 10 SPACE 32 TAB SEPCHAR 9 NAMING LCNMSTRT "" UCNMSTRT "" LCNMCHAR ".-" UCNMCHAR ".-" NAMECASE GENERAL YES ENTITY NO DELIM GENERAL SGMLREF SHORTREF SGMLREF NAMES SGMLREF QUANTITY SGMLREF NAMELEN 34 TAGLVL 100 LITLEN 1024 GRPGTCNT 150 GRPCNT 64 FEATURES MINIMIZE DATATAG NO OMITTAG NO RANK NO SHORTTAG NO LINK SIMPLE NO IMPLICIT NO EXPLICIT NO OTHER CONCUR NO SUBDOC NO FORMAL YES APPINFO NONE &#62; &#60;!DOCTYPE HTML [ &#60;!-- $Id: html.dtd,v 1.4 93/01/20 20:56:08 connolly Exp $ --&#62; &#60;!-- Regarding clause 6.1, SGML Document: [1] SGML document = SGML document entity, (SGML subdocument entity | SGML text entity | non-SGML data entity)* The role of SGML document entity is filled by this DTD, followed by the conventional HTML data stream. --&#62; &#60;!-- DTD definitions --&#62; &#60;!ENTITY % heading "H1|H2|H3|H4|H5|H6" &#62; &#60;!ENTITY % list "UL|OL|DIR|MENU"&#62; &#60;!ENTITY % literal "XMP|LISTING"&#62; &#60;!ENTITY % headelement "TITLE | NEXTID | ISINDEX" &#62; &#60;!ENTITY % bodyelement "P | %heading | %list | DL | HEADERS | ADDRESS | PRE | BLOCKQUOTE | %literal"&#62; &#60;!ENTITY % oldstyle "%headelement | %bodyelement | #PCDATA"&#62; &#60;!ENTITY % URL "CDATA" -- The term URL means a CDATA attribute whose value is a Universal Resource Locator, as defined in ftp://info.cern.ch/pub/www/doc/url3.txt --&#62; &#60;!ENTITY % linkattributes "NAME NMTOKEN #IMPLIED HREF %URL; #IMPLIED TYPE NAME #IMPLIED -- type of relashionship to referent data: PARENT CHILD, SIBLING, NEXT, TOP, DEFINITION, UPDATE, ORIGINAL etc. -- URN CDATA #IMPLIED -- universal resource number. unique doc id -- TITLE CDATA #IMPLIED -- advisory only -- METHODS NAMES #IMPLIED -- supported methods of the object: TEXTSEARCH, GET, HEAD, ... -- "&#62; &#60;!-- Document Element --&#62; &#60;!ELEMENT HTML O O ((HEAD | BODY | %oldstyle)*, PLAINTEXT?)&#62; &#60;!ELEMENT HEAD - - (TITLE? &#38; ISINDEX? &#38; NEXTID? &#38; LINK*)&#62; &#60;!ELEMENT TITLE - - RCDATA -- The TITLE element is not considered part of the flow of text. It should be displayed, for example as the page header or window title. --&#62; &#60;!ELEMENT ISINDEX - O EMPTY -- WWW clients should offer the option to perform a search on documents containing ISINDEX. --&#62; &#60;!ELEMENT NEXTID - O EMPTY&#62; &#60;!ATTLIST NEXTID N NUMBER #REQUIRED -- The number should be the highest number that appears in any NAME attribute in the document. --&#62; &#60;!ELEMENT LINK - O EMPTY&#62; &#60;!ATTLIST LINK %linkattributes&#62; &#60;!ENTITY % inline "EM | TT | STRONG | B | I | U | CODE | SAMP | KBD | KEY | VAR | DFN | CITE " &#62; &#60;!ELEMENT (%inline;) - - (#PCDATA)&#62; &#60;!ENTITY % hypertext "#PCDATA | %inline; | A"&#62; &#60;!ELEMENT BODY - - (%bodyelement|%hypertext;)*&#62; &#60;!ELEMENT A - - (#PCDATA)&#62; &#60;!ATTLIST A %linkattributes; &#62; &#60;!ELEMENT P - O EMPTY -- separates paragraphs --&#62; &#60;!ELEMENT (%heading) - - (%hypertext;)+&#62; &#60;!ELEMENT DL - - (DT | DD | P | %hypertext;)*&#62; &#60;!-- Content should match ((DT,(%hypertext;)+)+,(DD,(%hypertext;)+)) But mixed content is messy. --&#62; &#60;!ATTLIST DL STYLE NAME #IMPLIED -- COMPACT, etc.-- &#62; &#60;!ELEMENT DT - O EMPTY&#62; &#60;!ELEMENT DD - O EMPTY&#62; &#60;!ELEMENT (UL|OL) - - (%hypertext;|LI|P)+&#62; &#60;!ELEMENT (DIR|MENU) - - (%hypertext;|LI)+&#62; &#60;!-- Content should match ((LI,(%hypertext;)+)+) But mixed content is messy. --&#62; &#60;!ELEMENT LI - O EMPTY&#62; &#60;!ELEMENT BLOCKQUOTE - - (%hypertext;|P)+ -- for quoting some other source --&#62; &#60;!ATTLIST BLOCKQUOTE SOURCE CDATA #IMPLIED -- URL of source -- &#62; &#60;!ELEMENT ADDRESS - - (%hypertext;|P)+&#62; &#60;!ELEMENT PRE - - (#PCDATA | A | B | I | U | P)+&#62; &#60;!ATTLIST PRE WIDTH NUMBER #implied &#62; &#60;!-- Mnemonic character entities. --&#62; &#60;!ENTITY AElig "&#38;#198;" -- capital AE diphthong (ligature) --&#62; &#60;!ENTITY Aacute "&#38;#193;" -- capital A, acute accent --&#62; &#60;!ENTITY Acirc "&#38;#194;" -- capital A, circumflex accent --&#62; &#60;!ENTITY Agrave "&#38;#192;" -- capital A, grave accent --&#62; &#60;!ENTITY Aring "&#38;#197;" -- capital A, ring --&#62; &#60;!ENTITY Atilde "&#38;#195;" -- capital A, tilde --&#62; &#60;!ENTITY Auml "&#38;#196;" -- capital A, dieresis or umlaut mark --&#62; &#60;!ENTITY Ccedil "&#38;#199;" -- capital C, cedilla --&#62; &#60;!ENTITY ETH "&#38;#208;" -- capital Eth, Icelandic --&#62; &#60;!ENTITY Eacute "&#38;#201;" -- capital E, acute accent --&#62; &#60;!ENTITY Ecirc "&#38;#202;" -- capital E, circumflex accent --&#62; &#60;!ENTITY Egrave "&#38;#200;" -- capital E, grave accent --&#62; &#60;!ENTITY Euml "&#38;#203;" -- capital E, dieresis or umlaut mark --&#62; &#60;!ENTITY Iacute "&#38;#205;" -- capital I, acute accent --&#62; &#60;!ENTITY Icirc "&#38;#206;" -- capital I, circumflex accent --&#62; &#60;!ENTITY Igrave "&#38;#204;" -- capital I, grave accent --&#62; &#60;!ENTITY Iuml "&#38;#207;" -- capital I, dieresis or umlaut mark --&#62; &#60;!ENTITY Ntilde "&#38;#209;" -- capital N, tilde --&#62; &#60;!ENTITY Oacute "&#38;#211;" -- capital O, acute accent --&#62; &#60;!ENTITY Ocirc "&#38;#212;" -- capital O, circumflex accent --&#62; &#60;!ENTITY Ograve "&#38;#210;" -- capital O, grave accent --&#62; &#60;!ENTITY Oslash "&#38;#216;" -- capital O, slash --&#62; &#60;!ENTITY Otilde "&#38;#213;" -- capital O, tilde --&#62; &#60;!ENTITY Ouml "&#38;#214;" -- capital O, dieresis or umlaut mark --&#62; &#60;!ENTITY THORN "&#38;#222;" -- capital THORN, Icelandic --&#62; &#60;!ENTITY Uacute "&#38;#218;" -- capital U, acute accent --&#62; &#60;!ENTITY Ucirc "&#38;#219;" -- capital U, circumflex accent --&#62; &#60;!ENTITY Ugrave "&#38;#217;" -- capital U, grave accent --&#62; &#60;!ENTITY Uuml "&#38;#220;" -- capital U, dieresis or umlaut mark --&#62; &#60;!ENTITY Yacute "&#38;#221;" -- capital Y, acute accent --&#62; &#60;!ENTITY aacute "&#38;#225;" -- small a, acute accent --&#62; &#60;!ENTITY acirc "&#38;#226;" -- small a, circumflex accent --&#62; &#60;!ENTITY aelig "&#38;#230;" -- small ae diphthong (ligature) --&#62; &#60;!ENTITY agrave "&#38;#224;" -- small a, grave accent --&#62; &#60;!ENTITY amp "&#38;#38;" -- ampersand --&#62; &#60;!ENTITY aring "&#38;#229;" -- small a, ring --&#62; &#60;!ENTITY atilde "&#38;#227;" -- small a, tilde --&#62; &#60;!ENTITY auml "&#38;#228;" -- small a, dieresis or umlaut mark --&#62; &#60;!ENTITY ccedil "&#38;#231;" -- small c, cedilla --&#62; &#60;!ENTITY eacute "&#38;#233;" -- small e, acute accent --&#62; &#60;!ENTITY ecirc "&#38;#234;" -- small e, circumflex accent --&#62; &#60;!ENTITY egrave "&#38;#232;" -- small e, grave accent --&#62; &#60;!ENTITY eth "&#38;#240;" -- small eth, Icelandic --&#62; &#60;!ENTITY euml "&#38;#235;" -- small e, dieresis or umlaut mark --&#62; &#60;!ENTITY gt "&#38;#62;" -- greater than --&#62; &#60;!ENTITY iacute "&#38;#237;" -- small i, acute accent --&#62; &#60;!ENTITY icirc "&#38;#238;" -- small i, circumflex accent --&#62; &#60;!ENTITY igrave "&#38;#236;" -- small i, grave accent --&#62; &#60;!ENTITY iuml "&#38;#239;" -- small i, dieresis or umlaut mark --&#62; &#60;!ENTITY lt "&#38;#60;" -- less than --&#62; &#60;!ENTITY ntilde "&#38;#241;" -- small n, tilde --&#62; &#60;!ENTITY oacute "&#38;#243;" -- small o, acute accent --&#62; &#60;!ENTITY ocirc "&#38;#244;" -- small o, circumflex accent --&#62; &#60;!ENTITY ograve "&#38;#242;" -- small o, grave accent --&#62; &#60;!ENTITY oslash "&#38;#248;" -- small o, slash --&#62; &#60;!ENTITY otilde "&#38;#245;" -- small o, tilde --&#62; &#60;!ENTITY ouml "&#38;#246;" -- small o, dieresis or umlaut mark --&#62; &#60;!ENTITY szlig "&#38;#223;" -- small sharp s, German (sz ligature) --&#62; &#60;!ENTITY thorn "&#38;#254;" -- small thorn, Icelandic --&#62; &#60;!ENTITY uacute "&#38;#250;" -- small u, acute accent --&#62; &#60;!ENTITY ucirc "&#38;#251;" -- small u, circumflex accent --&#62; &#60;!ENTITY ugrave "&#38;#249;" -- small u, grave accent --&#62; &#60;!ENTITY uuml "&#38;#252;" -- small u, dieresis or umlaut mark --&#62; &#60;!ENTITY yacute "&#38;#253;" -- small y, acute accent --&#62; &#60;!ENTITY yuml "&#38;#255;" -- small y, dieresis or umlaut mark --&#62; &#60;!-- deprecated elements --&#62; &#60;!ELEMENT (%literal) - - CDATA&#62; &#60;!ELEMENT PLAINTEXT - O EMPTY&#62; &#60;!-- Local Variables: --&#62; &#60;!-- mode: sgml --&#62; &#60;!-- compile-command: "sgmls -s -p " --&#62; &#60;!-- end: --&#62; ]&#62;