<chapter id="ch-gssgml">
<?dbhtml filename="ch01.html"?>
<chapterinfo>
<pubdate>$Date: 2005/03/25 21:54:04 $</pubdate>
<releaseinfo>$Revision: 1.1 $</releaseinfo>
</chapterinfo>
<title>Getting Started<?lb?>with &SGML;/&XML;</title>
<para>
<indexterm id="getstartSGML" class="startofrange"><primary>SGML</primary>
  <secondary>getting started</secondary></indexterm>
<indexterm id="XMLgetstart" class="startofrange"><primary>XML</primary>
  <secondary>getting started</secondary></indexterm>

This chapter is intended to provide a quick introduction to structured
markup (&SGML; and &XML;). If you're already familiar with &SGML; or
&XML;, you only need to skim this chapter.
</para>
<para>
To work with DocBook, you need to understand a few basic concepts of
structured editing in general, and DocBook, in particular. That's
covered here. You also need some concrete experience with the way a
DocBook document is structured. That's covered in the next chapter.
</para>
<sect1 id="ch01-compare">
<title>&HTML; and &SGML; vs. &XML;</title>
<para>
<indexterm><primary>HTML</primary>
  <secondary>XML vs.</secondary></indexterm>
<indexterm><primary>Hypertext Markup Language</primary><see>HTML</see></indexterm>
<indexterm><primary>SGML</primary>
  <secondary>HTML vs.</secondary></indexterm>

This chapter doesn't assume that you know what &HTML; is, but if you
do, you have a starting point for understanding structured
markup. &HTML; (Hypertext Markup Language) is a way of marking up text
and graphics so that the most popular web browsers can interpret
them. &HTML; consists of a set of markup tags with specific
meanings. Moreover, &HTML; is a very basic type of &SGML; markup that
is easy to learn and easy for computer applications to generate. But
the simplicity of &HTML; is both its virtue and its weakness. Because
of &HTML;'s limitations, web users and programmers have had to extend
and enhance it by a series of customizations and revisions that still
fall short of accommodating current, to say nothing of future, needs.
</para>
<para>


&SGML;, on the other hand, is an international standard that describes
how markup languages are defined. &SGML; does not consist of
particular tags or the rules for their usage. &HTML; is an example of
a markup language defined in &SGML;.
</para>
<para>
<indexterm><primary>XML</primary>
  <secondary>HTML and SGML vs.</secondary></indexterm>

&XML; promises an intelligent improvement over &HTML;, and
compatibility with it is already being built into the most popular web
browsers. &XML; is not a new markup language designed to compete with
&HTML;, and it's not designed to create conversion headaches for people
with tons of &HTML; documents. &XML; is intended to alleviate
compatibility problems with browser software; it's a new, easier
version of the standard rules that govern the markup itself, or, in
other words, a new version of &SGML;. The rules of &XML; are designed
to make it easier to write both applications that interpret its type
of markup and applications that generate its markup. &XML; was
developed by a team of &SGML; experts who understood and sought to
correct the problems of learning and implementing &SGML;. &XML; is
also <emphasis>extensible</emphasis> markup, which means that it is
customizable. A browser or word processor that is &XML;-capable will
be able to read any &XML;-based markup language that an individual
user defines.
</para>
<para>
In this book, we tend to describe things in terms of &SGML;, but where
there are differences between &SGML; and &XML; (and there are only a
few), we point them out. For our purposes, it doesn't really matter
whether you use &SGML; or &XML;.
</para>
<para>
During the coming months, we anticipate that &XML;-aware web browsers
and other tools will become available. Nevertheless, it's not
unreasonable to do your authoring in &SGML; and your online publishing
in &XML; or &HTML;. By the same token, it's not unreasonable to do
your authoring in &XML;.
</para>
</sect1>
<sect1 id="s1-basic-concepts">
<title>Basic &SGML;/&XML; Concepts</title>
<para>
<indexterm id="SGMLbasicconceptch01" class="startofrange"><primary>SGML</primary>
  <secondary>basic concepts</secondary></indexterm>
<indexterm id="XMLbasicconceptch01" class="startofrange"><primary>XML</primary>
  <secondary>basic concepts</secondary></indexterm>

<indexterm><primary>XML</primary>
  <secondary>basic concepts</secondary></indexterm>
<indexterm><primary>structured semantic markup language</primary><see>SGML</see></indexterm>

Here are the basic &SGML;/&XML; concepts you need to grasp:</para>
<itemizedlist>
<listitem><para>structured, semantic markup</para>
</listitem>
<listitem><para>elements</para>
</listitem>
<listitem><para>attributes</para>
</listitem>
<listitem><para>entities</para>
</listitem>
</itemizedlist>
<sect2>
<title>Structured and Semantic Markup</title>
<para>
<indexterm><primary>appearance</primary>
  <secondary>SGML and</secondary></indexterm>
<indexterm><primary>structured markup</primary></indexterm>
<indexterm><primary>semantic markup</primary></indexterm>

An essential characteristic of structured markup is that it explicitly
distinguishes (and accordingly &ldquo;marks up&rdquo; within a
document) the structure and semantic content of a document. It does
not mark up the way in which the document will appear to the reader,
in print or otherwise.
</para>
<para>
In the days before word processors it was common for a typed
manuscript to be submitted to a publisher. The manuscript identified
the logical structures of the documents (chapters, section titles, and
so on), but said nothing about its appearance. Working independently
of the author, a designer then developed a specification for the
appearance of the document, and a typesetter marked up and applied the
designer's format to the document.
</para>
<para>
<indexterm><primary>presentation</primary><see>appearance</see></indexterm>
<indexterm><primary>HTML</primary>
  <secondary>appearance, limitions of specification</secondary></indexterm>

Because presentation or appearance is usually based on structure and
content, &SGML; markup logically precedes and generally determines the
way a document will look to a reader. If you are familiar with strict,
simple &HTML; markup, you know that a given document that is
structurally the same can also look different on different
computers. That's because the markup does not specify many aspects of
a document's appearance, although it does specify many aspects of a
document's structure.
</para>
<para>
<indexterm><primary>text</primary>
  <secondary>formatting</secondary></indexterm>
<indexterm><primary>word processors, SGML/XML vs.</primary></indexterm>
Many writers type their text into a word processor, line-by-line and
word-for-word, italicizing technical terms, underlining words for
emphasis, or setting section headers in a font complementary to the
body text, and finally, setting the headers off with a few carriage
returns fore and aft. The format such a writer imposes on the words on
the screen imparts structure to the document by changing its
appearance in ways that a reader can more or less reliably decode.
The reliability depends on how consistently and unambiguously the
changes in type and layout are made. By contrast, an &SGML;/&XML;
markup of a section header explicitly specifies that a specific piece
of text is a section header. This assertion does not specify the
presentation or appearance of the section header, but it makes the
fact that the text is a section header completely unambiguous.
</para>
<para>
<indexterm><primary>elements</primary>
  <secondary>SGML/XML, using</secondary></indexterm>
<indexterm><primary>titles</primary>
  <secondary>top-level sections</secondary></indexterm>
<indexterm><primary>top-level sections</primary></indexterm>
<indexterm><primary>characters</primary>
  <secondary>character sets</secondary>
    <tertiary>SGML documents</tertiary></indexterm>
<indexterm><primary>ASCII character set</primary></indexterm>
<indexterm><primary>XML</primary>
  <secondary>Unicode character set</secondary></indexterm>
<indexterm><primary>Unicode character set</primary>
  <secondary>XML documents, using</secondary></indexterm>

&SGML; and &XML; use named elements, delimited by angle brackets
(&ldquo;&lt;&rdquo; and &ldquo;>&rdquo;) to identify the markup in a
document. In DocBook, a top-level section is <sgmltag class="starttag">sect1</sgmltag>, so the title of a top-level section
named <emphasis>My First-Level Header</emphasis> would be identified
like this:
</para>

<screen>&lt;sect1>&lt;title>My First-Level Header&lt;/title> </screen>

<para>Note the following features of this markup:</para>
<variablelist>
<varlistentry>
<term>Clarity</term>
<listitem><para>A title begins with <sgmltag class="starttag">
title</sgmltag> and ends with <sgmltag class="endtag">title</sgmltag>. The <sgmltag>sect1</sgmltag> also has
an ending <sgmltag class="endtag">sect1</sgmltag>, but we haven't
shown the whole section so it's not visible.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Hierarchy</term>
<listitem><para>&ldquo;My First-Level
Header&rdquo; is the title of a top-level section because it occurs
inside a title in a <sgmltag>sect1</sgmltag>. A
<sgmltag>title</sgmltag> element occurring somewhere else, say in a
<sgmltag>Chapter</sgmltag> element, would be the title of the
chapter.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Plain text</term>
<listitem><para>&SGML; documents can have varying character sets, but
most are <acronym>ASCII</acronym>. &XML; documents use the Unicode
character set. This makes &SGML; and &XML; documents highly portable
across systems and tools.</para>
</listitem>
</varlistentry>
</variablelist>
<para>
<indexterm><primary>appearance</primary>
  <secondary>SGML and</secondary></indexterm>
<indexterm><primary>formatting</primary>
  <secondary>SGML documents</secondary></indexterm>
<indexterm><primary>filenames</primary>
  <secondary>tags, specifying</secondary></indexterm>
<indexterm><primary>semantic content, SGML marking for</primary></indexterm>

In an &SGML; document, there is no obligatory difference between the
size or face of the type in a first-level section header and the title
of a book in a footnote or the first sentence of a body paragraph. All
&SGML; files are simple text files without font changes or special
characters.<footnote><para>Some structured editors apply style to the
document while it's being edited, using fonts and color to make the
editing task easier, but this stylistic information is not stored in
the actual &SGML;/&XML; document. Instead, it is provided by the
editing application.</para></footnote> Similarly, an &SGML; document
does not specify the words in a text that are to be set in italic,
bold, or roman type. Instead, &SGML; marks certain kinds of texts for
their semantic content. For example, if a particular word is the name
of a file, then the tags around it should specify that it is a
filename:
</para>

<screen>Many mail programs read configuration information from the
users <sgmltag class="starttag">filename</sgmltag>.mailrc<sgmltag class="endtag">filename</sgmltag> file.</screen>

<para>
<indexterm><primary>stylesheets</primary>
  <secondary>SGML documents, specifying appearance</secondary></indexterm>
<indexterm><primary>appearance</primary>
  <secondary>structure or content vs.</secondary></indexterm>
<indexterm><primary>CSS stylesheets</primary></indexterm>
<indexterm><primary>FOSI stylesheets</primary></indexterm>
<indexterm><primary>DSSSL</primary>
  <secondary>stylesheets</secondary></indexterm>
<indexterm><primary>XSL stylesheets</primary></indexterm>
<indexterm><primary>XML</primary>
  <secondary>XSL stylesheets</secondary></indexterm>

If the meaning of a phrase is particularly audacious, it might get
tagged for boldness of thought instead of appearance. An &SGML;
document contains all the information that a typesetter needs to lay
out and typeset a printed page in the most effective and consistent
way, but it does not specify the layout or the
type.<footnote><para>The distinction between appearance or
presentation and structure or content is essential to &SGML;, but
there is a way to specify the appearance of an &SGML; document: attach
a stylesheet to it. There are several standards for such stylesheets:
<acronym>CSS</acronym>, <acronym>XSL</acronym>, <acronym>FOSI</acronym>s,
and <acronym>DSSSL</acronym>.
See <xref linkend="ch-publish"/>.</para></footnote>
</para>
<para>
<indexterm><primary>DocBook DTD</primary>
  <secondary>document type definition</secondary></indexterm>
<indexterm><primary>declarations</primary>
  <secondary>SGML documents</secondary></indexterm>
<indexterm><primary>document type definitions</primary><see>DTDs</see></indexterm>
<indexterm><primary>tags</primary>
  <secondary>names</secondary>
    <tertiary>document type definition</tertiary></indexterm>
<indexterm><primary>combination rules (DTD)</primary></indexterm>
<indexterm><primary>DTDs</primary></indexterm>
<indexterm><primary>DTDs</primary>
  <secondary>DocBook</secondary><see>DocBook DTD</see></indexterm>


Not only is the structure of an &SGML;/&XML; document explicit, but it
is also carefully controlled. An &SGML; document makes reference to a
set of declarations&mdash;a document type definition
(&DTD;)&mdash;that contains an inventory of tag names and specifies
the combination rules for the various structural and semantic features
that make up a document. What the distinctive features are and how
they should be combined is &ldquo;arbitrary&rdquo; in the sense that
almost any selection of features and rules of composition is
theoretically possible. The DocBook &DTD; chooses a particular set of
features and rules for its users.
</para>
<para>
<indexterm><primary>sections</primary>
  <secondary>ordering, DocBook DTD rules (example)</secondary></indexterm>
Here is a specific example of how the DocBook &DTD; works. DocBook
specifies that a third-level section can follow a second-level section
but cannot follow a first-level section without an intervening
second-level section.
</para>
<informaltable>
<tgroup cols="2">
<colspec colname="COLSPEC0" colwidth="2.50in"/>
<colspec colname="COLSPEC1" colwidth="2.50in"/>
<tbody>
<row>
<entry colname="COLSPEC0" valign="top"><para>This is valid:</para><screen>&lt;sect1>&lt;title>...&lt;/title>
  &lt;sect2>&lt;title>...&lt;/title>
    &lt;sect3>&lt;title>...&lt;/title>
      ...
    &lt;/sect3>
  &lt;/sect2>
&lt;/sect1>
</screen></entry>
<entry colname="COLSPEC1" valign="top"><para>This is not:</para><screen>&lt;sect1>&lt;title>...&lt;/title>
  &lt;sect3>&lt;title>...&lt;/title>
    ...
  &lt;/sect3>
&lt;/sect1>
</screen></entry>
</row>
</tbody>
</tgroup>
</informaltable>
<para>
<indexterm><primary>parsers</primary>
  <secondary>validating</secondary></indexterm>
<indexterm><primary>validation</primary>
  <secondary>SGML documents</secondary></indexterm>
<indexterm><primary>DTDs</primary>
  <secondary>validating SGML documents against</secondary></indexterm>
<indexterm><primary>instance (DocBook document)</primary></indexterm>

Because an &SGML;/&XML; document has an associated &DTD; that
describes the valid, logical structures of the document, you can test
the logical structure of any particular document against the
&DTD;. This process is performed by a <firstterm>parser</firstterm>. An
&SGML; processor must begin by parsing the document and determining if
it is valid, that is, if it conforms to the rules specified in the
&DTD;. <!--<phrase role="xml">-->&XML; processors are not required to
check for validity, but it's always a good idea to check for validity
when authoring.<!--</phrase>--> Because you can test and validate the
structure of an &SGML;/&XML; document with software, a DocBook
document containing a first-level section followed immediately by a
third-level section will be identified as invalid, meaning that it's
not a valid <firstterm>instance</firstterm> or example of a document
defined by the DocBook &DTD;. Presumably, a document with a logical
structure won't normally jump from a first- to a third-level section,
so the rule is a safeguard&mdash;but not a guarantee&mdash;of good
writing, or at the very least, reasonable structure. A parser also
verifies that the names of the tags are correct and that tags
requiring an ending tag have them. This means that a valid document is
also one that should format correctly, without runs of paragraphs
incorrectly appearing in bold type or similar monstrosities that
everyone has seen in print at one time or another. For more
information about &SGML;/&XML; parsers, see <xref linkend="ch-parse"/>.
</para>
<para>
In general, adherence to the explicit rules of structure and markup in
a &DTD; is a useful and reassuring guarantee of consistency and
reliability within documents, across document sets, and over
time. This makes &SGML;/&XML; markup particularly desirable to
corporations or governments that have large sets of documents to
manage, but it is a boon to the individual writer as well.
</para>
<sect3>
<title>How can this markup help you?</title>
<para>
<indexterm><primary>semantic markup</primary>
  <secondary>presentation media, different</secondary></indexterm>
Semantic markup makes your documents more amenable to interpretation
by software, especially publishing software. You can publish a white
paper, authored as a DocBook <sgmltag>Article</sgmltag>, in the
following formats:
<indexterm><primary>articles</primary>
  <secondary>formats, listed</secondary></indexterm>
<indexterm><primary>journal articles</primary></indexterm>

</para>
<itemizedlist>
<listitem><para>On the Web in &HTML;</para>
</listitem>
<listitem><para>As a standalone document on 8&frac12;&times;11 paper</para>
</listitem>
<listitem><para>As part of a quarterly journal, in a 6&times;9 format
</para>
</listitem>
<listitem><para>In Braille</para>
</listitem>
<listitem><para>In audio</para>
</listitem>
</itemizedlist>
<para>
You can produce each of these publications from exactly the same
source document using the presentational techniques best suited to
both the content of the document and the presentation medium. This
versatility also frees the author to concentrate on the document
content. For example, as we write this book, we don't know exactly how
O'Reilly will choose to present chapter headings, bulleted lists,
&SGML; terms, or any of the other semantic features. And we don't
care. It's irrelevant; whatever presentation is chosen, the &SGML;
sources will be transformed automatically into that style.
</para>
<para>
Semantic markup can relieve the author of other, more significant
burdens as well (after all, careful use of paragraph and character
styles in a word processor document theoretically allows us to change
the presentation independently from the document). Using semantic
markup opens up your documents to a world of possibilities. Documents
become, in a loose sense, databases of information. Programs can
compile, retrieve, and otherwise manipulate the documents in
predictable, useful ways.
</para>
<para>
<indexterm><primary>links</primary>
  <secondary>SGML documents, maintaining</secondary></indexterm>
<indexterm><primary>elements</primary>
  <secondary>linking to references</secondary></indexterm>

Consider the online version of this book: almost every element name
(<sgmltag>Article</sgmltag>, <sgmltag>Book</sgmltag>, and so on) is a
hyperlink to the reference page that describes that
element. Maintaining these links by hand would be tedious and might be
unreliable, as well. Instead, every element name is marked as an
element using <sgmltag>SGMLTag</sgmltag>: a <sgmltag>Book</sgmltag> is
a <literal><sgmltag class="starttag">sgmltag</sgmltag>Book<sgmltag class="endtag">sgmltag</sgmltag></literal>.
</para>
<para>
Because each element name in this book is tagged semantically, the
program that produces the online version can determine which
occurrences of the word &ldquo;book&rdquo; in the text are actually
references to the <sgmltag>Book</sgmltag> element. The program can
then automatically generate the appropriate hyperlink when it should.
</para>
<para>
There's one last point to make about the versatility of &SGML;
documents: how much you have depends on the &DTD;. If you take a good
photo with a high resolution lens, you can print it and copy it and
scan it and put it on the Web, and it will look good. If you start
with a low-resolution picture it will not survive those
transformations so well. DocBook &SGML;/&XML; has this advantage over,
say, &HTML;: DocBook has specific and unambiguous semantic and
structural markup, because you can convert its documents with ease
into other presentational forms, and search them more precisely. If
you start with &HTML;, whose markup is at a lower resolution than
DocBook's, your versatility and searchability is substantially
restricted and cannot be improved.
</para>
</sect3>
<sect3>
<title>What are the shortcomings to structural authoring?</title>
<para>
There are a few significant shortcomings to structured authoring:
</para>
<itemizedlist>
<listitem><para>It requires a significant change in the authoring
process. Writing structured documents is very different from writing
with a typical word processor, and change is difficult. In particular,
authors don't like giving up control over the appearance of their
words especially now that they have acquired it with the advent of
word processors. But many publishing companies need authors to
relinquish that control, because book design and production remains
their job, not their authors'.</para>
</listitem>
<listitem><para>Because semantics are separate from appearance, in
order to publish an &SGML;/&XML; document, a stylesheet or other tool
must create the presentational form from the structural form. Writing
stylesheets is a skill in its own right, and though not every author
among a group of authors has to learn how to write them, someone has
to.</para>
</listitem>
<listitem><para>Authoring tools for &SGML; documents can generally be
pretty expensive. While it's not entirely unreasonable to edit
&SGML;/&XML; documents with a simple text editor, it's a bit tedious
to do so. However, there are a few free tools that are
&SGML;-aware. The widespread interest in &XML; may well produce new,
clever, and less expensive &XML; editing tools.</para>
</listitem>
</itemizedlist>
</sect3>
</sect2>
</sect1>
<sect1 id="ch01-elemattr">
<title>Elements and Attributes</title>
<para>
<indexterm><primary>elements</primary>
  <secondary>attributes</secondary></indexterm>
<indexterm><primary>attributes</primary>
  <secondary>elements and</secondary></indexterm>
<indexterm><primary>elements</primary>
  <secondary>attributes</secondary><seealso>attributes</seealso></indexterm>
<indexterm><primary>empty elements</primary></indexterm>
<indexterm><primary>end tags</primary>
  <secondary>empty elements, not requiring</secondary></indexterm>
<indexterm><primary>cross references</primary></indexterm>
<indexterm><primary>entities</primary>
  <secondary>SGML/XML markup</secondary></indexterm>

&SGML;/&XML; markup consists primarily of
<firstterm>elements</firstterm>, <firstterm>attributes</firstterm>,
and <firstterm>entities</firstterm>. Elements are the terms we have
been speaking about most, like <sgmltag>sect1</sgmltag>, that describe
a document's content and structure. Most elements are represented by pairs
of tags and
mark the start and end of the construct they surround&mdash;for
example, the &SGML; source for this particular paragraph begins with a
<sgmltag class="starttag">para</sgmltag> tag and ends with a <sgmltag class="endtag">para</sgmltag> tag. Some elements are
&ldquo;empty&rdquo; (such as DocBook's cross-reference element,
<sgmltag class="starttag">xref</sgmltag>) and require no end
tag.<footnote><para>In &XML;, this is written as
<literal>&lt;xref/></literal>, as we'll see in the section <xref linkend="ch02-typexml"/>.</para></footnote>
</para>
<para>
<indexterm><primary>ID attribute</primary>
  <secondary>SGML start tags</secondary></indexterm>
<indexterm><primary>tags</primary>
  <secondary>identifiers (SGML)</secondary></indexterm>
<indexterm><primary>end tags</primary>
  <secondary>attributes and</secondary></indexterm>
<indexterm><primary>start tags</primary>
  <secondary>attribute ID, containing</secondary></indexterm>

Elements can, but don't necessarily, include one or more attributes,
which are additional terms that extend the function or refine the
content of a given element. For instance, in DocBook a <sgmltag class="starttag">sect1</sgmltag> start tag can contain an
identifier&mdash;an <sgmltag class="attribute">id</sgmltag>
attribute&mdash;that will ultimately allow the writer to
cross-reference it or enable a reader to retrieve it. End tags cannot
contain attributes. A <sgmltag class="starttag">sect1</sgmltag>
element with an <sgmltag class="attribute">id</sgmltag> attribute
looks like this:
</para>

<screen>&lt;sect1 id="<replaceable>idvalue</replaceable>"&gt;</screen>

<para>
<indexterm><primary>namespaces</primary>
  <secondary>XML tags</secondary></indexterm>
<indexterm><primary>tags</primary>
  <secondary>namespaces (XML)</secondary></indexterm>
<indexterm><primary>validation</primary>
  <secondary>namespace tags (XML), problems</secondary></indexterm>
<indexterm><primary>XML</primary>
  <secondary>namespaces, using</secondary></indexterm>

In &SGML;, the catalog of attributes that can occur on an element is
predefined. You cannot add arbitrary attribute names to an
element. Similarly, the values allowed for each attribute are
predefined. In &XML;, the use of <ulink url="http://www.w3.org/TR/REC-xml-names/">namespaces</ulink> may allow you
to add additional attributes to an element, but as of this writing,
there's no way to perform validation on those attributes.
</para>
<para>
<indexterm><primary>SystemItem element</primary>
  <secondary>subdividing into URL and email addresses</secondary></indexterm>
<indexterm><primary>Role attribute</primary>
  <secondary>systemitem tags, subdividing</secondary></indexterm>

The <sgmltag class="attribute">id</sgmltag> attribute is one half of a
cross reference. An <sgmltag class="attribute">idref</sgmltag>
attribute on another element, for example <sgmltag class="starttag">xref linkend=&rdquo;idvalue&rdquo;
</sgmltag>, provides the other half. These attributes provide whatever
application might process the &SGML; source with the data needed
either to make a hypertext link or to substitute a named and/or numbered cross
reference in place of the <sgmltag class="starttag">
xref</sgmltag>. Another use for attributes is to specify subclasses of
certain elements. For instance, you can subdivide DocBook's <sgmltag class="starttag">systemitem</sgmltag> into <acronym>URL</acronym>s and
email addresses by making the content of the <sgmltag class="attribute">role</sgmltag> attribute the distinction between
them, as in <sgmltag class="starttag">systemitem role="URL"</sgmltag>
versus <sgmltag class="starttag">systemitem
role="emailaddr"</sgmltag>.
</para>
</sect1>
<sect1 id="s-entities"><title>Entities</title>
<para>
<indexterm><primary>entities</primary>
  <secondary>functions</secondary></indexterm>
<indexterm><primary>parsed entities</primary></indexterm>
<indexterm><primary>unparsed entities</primary></indexterm>
<indexterm><primary>names</primary>
  <secondary>assigning to data (entities)</secondary></indexterm>

Entities are a fundamental concept in &SGML; and &XML;, and can be
somewhat daunting at first. They serve a number of related, but
slightly different functions, and this makes them a little bit
complicated.
</para>
<para>
In the most general terms, entities allow you to assign a name to some
chunk of data, and use that name to refer to that data. The complexity
arises because there are two different contexts in which you can use
entities (in the &DTD; and in your documents), two types of entities
(parsed and unparsed), and two or three different ways in which the
entities can point to the chunk of data that they name.
</para>
<para>
In the rest of this section, we'll describe each of the commonly
encountered entity types. If you find the material in this section
confusing, feel free to skip over it now and come back to it later.
We'll refer to the different types of entities as the need arises in
our discussion of DocBook. Come back to this section when you're
looking for more detail.
</para>
<para>
Entities can be divided into two broad categories, <firstterm>general
entities</firstterm> and <firstterm>parameter entities</firstterm>.
Parameter entities are most often used in the &DTD;, not in documents,
so we'll describe them last. Before you can use any type of entity, it
must be formally declared. This is typically done in the document
prologue, as we'll explain in <xref linkend="ch-create"/>, but we will
show you how to declare each of the entities discussed here.
</para>
<sect2><title>General Entities</title>
<para>
<indexterm><primary>general entities</primary>
  <secondary>external and internal</secondary></indexterm>
<indexterm><primary>entities</primary>
  <secondary>general</secondary></indexterm>
In use, general entities are introduced with an ampersand (&amp;) and end with
a semicolon (;). Within the category of general entities, there are
two types: <firstterm>internal general entities</firstterm> and
<firstterm>external general entities</firstterm>.
</para>
<sect3><title>Internal general entities</title>
<para>
<indexterm><primary>internal general entities</primary></indexterm>
<indexterm><primary>names</primary>
  <secondary>text, associating with (internal general entities)</secondary></indexterm>
<indexterm><primary>text</primary>
  <secondary>entity, declaring as</secondary></indexterm>

With internal entities, you can associate an essentially arbitrary
piece of text (which may have other markup, including references to
other entities) with a name. You can then include that text by
referring to its name. For example, if your document frequently refers
to, say, &ldquo;O'Reilly &amp; Associates,&rdquo; you might declare it
as an entity:
</para>

<screen><![CDATA[<!ENTITY ora "O'Reilly &amp; Associates">]]></screen>

<para>
Then, instead of typing it out each time, you can insert it as needed
in your document with the entity reference <sgmltag class="genentity">ora</sgmltag>, simply to save time. Note that this
entity declaration includes another entity reference within it.
That's perfectly valid as long as the reference isn't directly or
indirectly recursive.
</para>
<para>
<indexterm><primary>entities</primary>
  <secondary>adding directly to DTD</secondary></indexterm>

If you find that you use a number of entities across many documents,
you can add them directly to the &DTD; and avoid having to include the
declarations in each document. See the discussion of
<filename>dbgenent.mod</filename> in <xref linkend="app-customizing"/>.
</para>
</sect3>
<sect3 id="s-egenent"><title>External general entities</title>
<para>
<indexterm><primary>external general entities</primary></indexterm>
<indexterm><primary>SGML</primary>
  <secondary>external documents, referencing (external general entities)</secondary></indexterm>
<indexterm><primary>parsers</primary>
  <secondary>external file text, inserting</secondary></indexterm>
<indexterm><primary>files</primary>
  <secondary>external, referencing</secondary></indexterm>

With external entities, you can reference other documents from within
your document. If these entities contain document text (&SGML; or
&XML;), then references to them cause the parser to insert the text of
the external file directly into your document (these are called parsed
entities). In this way, you can use entities to divide your single,
logical document into physically distinct chunks. For example, you
might break your document into four chapters and store them in
separate files. At the top of your document, you would include entity
declarations to reference the four files:
</para>

<screen><![CDATA[<!ENTITY ch01 SYSTEM "ch01.sgm">
<!ENTITY ch02 SYSTEM "ch02.sgm">
<!ENTITY ch03 SYSTEM "ch03.sgm">
<!ENTITY ch04 SYSTEM "ch04.sgm">]]></screen>

<para>
Your <sgmltag>Book</sgmltag> now consists simply of references to the
entities:
</para>

<screen>&lt;book&gt;
&amp;ch01;
&amp;ch02;
&amp;ch03;
&amp;ch04;
&lt;/book&gt;</screen>

<para>
<indexterm><primary>unparsed entities</primary></indexterm>
<indexterm><primary>notations (unparsed entities)</primary></indexterm>

Sometimes it's useful to reference external files that don't contain
document text. For example, you might want to reference an external
graphic. You can do this with entities by declaring the type of data
that's in the entity using a notation (these are called unparsed
entities). For example, the following declaration declares the entity
<literal>tree</literal> as an encapsulated PostScript image:
</para>

<screen><![CDATA[<!ENTITY tree SYSTEM "tree.eps" NDATA EPS>]]></screen>

<para>
<indexterm><primary>elements</primary>
  <secondary>entity attributes</secondary></indexterm>

Entities declared this way cannot be inserted directly into your
document. Instead, they must be used as entity attributes to elements:
</para>

<screen><![CDATA[<graphic entityref="tree"></graphic>]]></screen>

<para>
Conversely, you cannot use entities declared without a notation as the
value of an entity attribute.
</para>
</sect3>
<sect3 id="s-specchar"><title>Special characters</title>
<para>
<indexterm><primary>markup</primary>
  <secondary>distinguishing from content</secondary></indexterm>
<indexterm><primary>start tags</primary>
  <secondary>beginning</secondary></indexterm>
<indexterm><primary>end tags</primary>
  <secondary>beginning</secondary></indexterm>
In order for the parser to recognize markup in your document, it must
be able to distinguish markup from content. It does this with two
special characters: &ldquo;&lt;,&rdquo; which identifies the beginning
of a start or end tag, and &ldquo;&amp;,&rdquo; which identifies the
beginning of an entity reference.<footnote>
<para>
<indexterm><primary>start characters, changing</primary></indexterm>
In &XML;, these characters are fixed. In &SGML;, it is possible to
change the markup start characters, but we won't consider that case
here. If you change the markup start characters, you know what you're
doing. While we're on the subject, in &SGML;, these characters only
have their special meaning if they are followed by a name character.
It is, in fact, valid in an <emphasis>&SGML;</emphasis> (but not an &XML;)
document to write &ldquo;O'Reilly &amp; Associates&rdquo; because the
ampersand is not followed by a name character. Don't do this, however.
<indexterm><primary>characters</primary>
  <secondary>entities</secondary>
    <tertiary>encoding as</tertiary></indexterm>
<indexterm><primary>entities</primary>
  <secondary>characters</secondary></indexterm>
<indexterm><primary>angle brackets</primary>
  <secondary>coding as entities</secondary></indexterm>
</para>
</footnote>
If you want these characters to have their literal value, they must be
encoded as entity references in your document. The entity reference
<sgmltag class="genentity">lt</sgmltag> produces a left angle bracket;
<sgmltag class="genentity">amp</sgmltag> produces the
ampersand.<footnote>
<para>
<indexterm><primary>marked sections</primary>
  <secondary>character sequence, ending</secondary></indexterm>

The sequence of characters that end a marked section (see <xref linkend="s-ms"/>), such as ]]&gt; must also be encoded with at least
one entity reference if it is not being used to end a marked section.
For this purpose, you can use the entity reference <sgmltag class="genentity">gt</sgmltag> for the final right angle bracket.
</para>
</footnote>
</para>
<para>
<indexterm><primary>parsers</primary>
  <secondary>entity references, interpreting</secondary></indexterm>

If you do not encode each of these as their respective entity
references, then an &SGML; parser or application is likely to
interpret them as characters introducing elements or entities (an
&XML; parser will always interpret them this way); consequently, they
won't appear as you intended. If you wish to cite text that contains
literal ampersands and less-than signs, you need to transform these
two characters into entity references before they are included in a
DocBook document. The only other alternative is to incorporate text
that includes them in your document through some process that avoids
the parser.
</para>
<para>
<indexterm><primary>data entities</primary></indexterm>
<indexterm><primary>numeric character references</primary></indexterm>

In &SGML;, character entities are frequently declared using a third
entity category (one that we deliberately chose to overlook), called
<firstterm>data entities</firstterm>. In &XML;, these are declared using
numeric character references. Numeric character references resemble
entity references, but technically aren't the same. They have the form
<literal>&amp;#<replaceable>999</replaceable>;</literal>, in which
&ldquo;999&rdquo; is the numeric character number.
</para>
<para>
<indexterm><primary>Unicode character set</primary>
  <secondary>character numbers (XML)</secondary></indexterm>
<indexterm><primary>hexadecimal numeric character references (XML)</primary></indexterm>

In &XML;, the numeric character number is always the Unicode character
number. In addition, &XML; allows hexadecimal numeric character
references of the form
<literal>&amp;#x<replaceable>hhhh</replaceable>;</literal>. In &SGML;, the
numeric character number is a number from the document character set
that's declared in the &SGML; declaration.
</para>
<para>
<indexterm><primary>special characters, encoding as entities</primary></indexterm>

Character entities are also used to give a name to special characters
that can't otherwise be typed or are not portable across applications
and operating systems. You can then include these characters in your
document by refering to their entity name. Instead of using the often
obscure and inconsistent key combinations of your particular word
processor to type, say, an uppercase letter U with an umlaut (&Uuml;),
you type in an entity for it instead. For instance, the entity for an
uppercase letter U with an umlaut has been defined as the entity
<literal>Uuml</literal>, so you would type in <sgmltag class="genentity">Uuml</sgmltag> to reference it instead of the actual
character. The &SGML; application that eventually processes your
document for presentation will match the entity to your platform's
handling of special characters in order to render it
appropriately.
</para>
</sect3>
</sect2><!--general entities-->
<sect2><title>Parameter Entities</title>
<para>
<indexterm><primary>entities</primary>
  <secondary>parameter entities</secondary><see>parameter entities</see></indexterm>
<indexterm><primary>parameter entities</primary></indexterm>

Parameter entities are only recognized in markup declarations (in the
&DTD;, for example). Instead of beginning with an ampersand, they
begin with a percent sign.  Parameter entities are most frequently
used to customize the &DTD;. For a detailed discussion of this topic,
see <xref linkend="app-customizing"/>. Following are some other uses for
them.
</para>
<sect3 id="s-ms"><title>Marked sections</title>
<para>
<indexterm><primary>marked sections</primary></indexterm>
<indexterm><primary>SGML</primary>
  <secondary>marked sections</secondary></indexterm>
<indexterm><primary>XML</primary>
  <secondary>marked sections</secondary></indexterm>

You might use a parameter entity reference in an &SGML; document in a
marked section. Marking sections is a mechanism for indicating that
special processing should apply to a particular block of text.  Marked
sections are introduced by the special sequence
<literal>&lt;![<replaceable>keyword</replaceable>[</literal> and end
with <literal>]]&gt;</literal>.  In &SGML;, marked sections can appear
in both &DTD;s and document instances.  In &XML;, they're only allowed
in the &DTD;.<footnote>
<para>
<indexterm><primary>CDATA</primary>
  <secondary>marked sections</secondary></indexterm>
Actually, CDATA marked sections are allowed in an &XML; document, but
the keyword cannot be a parameter entity, and it must be typed
literally. See the examples on this page.
</para>
</footnote>
</para>
<para>
<indexterm><primary>keywords</primary>
  <secondary>marked sections</secondary></indexterm>
<indexterm><primary>INCLUDE keyword (marked section)</primary></indexterm>
<indexterm><primary>IGNORE keyword (marked section)</primary></indexterm>

The most common keywords are <literal>INCLUDE</literal>, which
indicates that the text in the marked section should be included in
the document; <literal>IGNORE</literal>, which indicates that the text
in the marked section should be ignored (it completely disappears from
the parsed document); and <literal>CDATA</literal>, which indicates
that all markup characters within that section should be ignored
except for the closing characters <literal>]]&gt;</literal>.
</para>
<para>
<indexterm><primary>SGML</primary>
  <secondary>keywords as parameter entities</secondary></indexterm>
In &SGML;, these keywords can be parameter entities. For example, you
might declare the following parameter entity in your document:
</para>

<screen><![CDATA[<!ENTITY % draft "INCLUDE">]]></screen>

<para>
Then you could put the sections of the document that are only applicable
in a draft within marked sections:
</para>

<screen>&lt;![%draft;[
&lt;para>
This paragraph only appears in the draft version.
&lt;/para>
]]&gt;</screen>

<para>
When you're ready to print the final version, simply change the 
<literal>draft</literal> parameter entity declaration:
</para>

<screen><![CDATA[<!ENTITY % draft "IGNORE">]]></screen>

<para>
and publish the document. None of the draft sections will appear.
<indexterm startref="SGMLbasicconceptch01" class="endofrange"/>
<indexterm startref="XMLbasicconceptch01" class="endofrange"/>
</para>
</sect3>
</sect2>
</sect1>
<sect1 id="ch01-wheredocbook"><title>How Does DocBook Fit In?</title>
<para>
<indexterm><primary>DocBook DTD</primary>
  <secondary>history and overview</secondary></indexterm>

DocBook is a very popular set of tags for describing books, articles,
and other prose documents, particularly technical
documentation. DocBook is defined using the native &DTD; syntax of
&SGML; and &XML;. Like &HTML;, DocBook is an example of a markup
language defined in &SGML;/&XML;.
</para>
<sect2><title>A Short DocBook History</title>
<para>
DocBook is almost 10 years old. It began in 1991 as a joint project of
HaL Computer Systems and O'Reilly. Its popularity grew, and eventually
it spawned its own maintenance organization, the Davenport Group. In
mid-1998, it became a Technical Committee (<acronym>TC</acronym>) of
the Organization for the Advancement of Structured Information
Standards (<acronym>OASIS</acronym>).
</para>
<sect3><title>The HaL and O'Reilly era</title>
<para>
<indexterm><primary>Open Software Foundation</primary></indexterm>
<indexterm><primary>troff markup (UNIX documentation)</primary></indexterm>
<indexterm><primary>UNIX</primary>
  <secondary>DocBook DTD, development</secondary></indexterm>

The DocBook &DTD; was originally designed and implemented by HaL
Computer Systems and O'Reilly &amp; Associates around 1991. It was
developed primarily to facilitate the exchange of &UNIX; documentation
originally marked up in <command>troff</command>. Its design appears
to have been based partly on input from &SGML; interchange projects
conducted by the Unix International and Open Software Foundation
consortia.
</para>
<para>
<indexterm><primary>Davenport Group (DocBook maintenance)</primary></indexterm>
When DocBook <acronym>V1.1</acronym> was published, discussion about
its revision and maintenance began in earnest in the Davenport Group,
a forum created by O'Reilly for computer documentation
producers. Version 1.2 was influenced strongly by
Novell and Digital.
</para>
<para>
In 1994, the Davenport Group became an officially chartered entity
responsible for DocBook's maintenance. DocBook
<acronym>V1.2.2</acronym> was published simultaneously. The founding
sponsors of this incarnation of Davenport include the following
people:
<itemizedlist spacing="compact">
<listitem><para>Jon Bosak, Novell</para></listitem>
<listitem><para>Dale Dougherty, O'Reilly &amp; Associates</para></listitem>
<listitem><para>Ralph Ferris, Fujitsu <acronym>OSSI</acronym></para></listitem>
<listitem><para>Dave Hollander, Hewlett-Packard</para></listitem>
<listitem><para>Eve Maler, Digital Equipment Corporation</para></listitem>
<listitem><para>Murray Maloney, <acronym>SCO</acronym></para></listitem>
<listitem><para>Conleth O'Connell, HaL Computer Systems</para></listitem>
<listitem><para>Nancy Paisner, Hitachi Computer Products</para></listitem>
<listitem><para>Mike Rogers, SunSoft</para></listitem>
<listitem><para>Jean Tappan, Unisys</para></listitem>
</itemizedlist>
</para>
</sect3>
<sect3><title>The Davenport era</title>
<para>
Under the auspices of the Davenport Group, the DocBook &DTD; began to
widen its scope. It was now being used by a much wider audience, and
for new purposes, such as direct authoring with &SGML;-aware tools,
and publishing directly to paper. As the largest users of DocBook,
Novell and Sun had a heavy influence on its design.
</para>
<para>
<indexterm><primary>DocBook DTD</primary>
  <secondary>releases, rules for new versions</secondary></indexterm>

In order to help users manage change, the new Davenport charter established
the following rules for DocBook releases:
<itemizedlist>
<listitem><para>Minor versions (<quote>point releases</quote> such as
<acronym>V2.2</acronym>) could add to the markup model, but could not
change it in a backward-incompatible way. For example, a new kind of
list element could be added, but it would not be acceptable for the
existing itemized-list model to start requiring two list items inside
it instead of only one. Thus, any document conforming to version
<replaceable>n</replaceable>.0 would also conform to
<replaceable>n</replaceable>.<replaceable>m</replaceable>.</para>
</listitem>
<listitem><para>Major versions (such as <acronym>V3.0</acronym>) could
both add to the markup model and make backward-incompatible
changes. However, the changes would have to be announced in the last
major release.</para>
</listitem>
<listitem><para>Major-version introductions must be separated by at
least a year.</para>
</listitem>
</itemizedlist>
</para>
<para>
<indexterm><primary>DocBook DTD</primary>
  <secondary>XML</secondary>
    <tertiary>XML-compliant version</tertiary></indexterm>
<indexterm><primary>XML</primary>
  <secondary>DocBook version compliant with</secondary></indexterm>

<acronym>V3.0</acronym> was released in January 1997. After that time,
although DocBook's audience continued to grow, many of the Davenport
Group stalwarts became involved in the &XML; effort, and development
slowed dramatically. The idea of creating an official &XML;-compliant
version of DocBook was discussed, but not implemented. (For more
detailed information about DocBook <acronym>V3.0</acronym> and plans
for subsequent versions, see <xref linkend="app-versions"/>.)
</para>
<para>
<indexterm><primary>OASIS</primary>
  <secondary>DocBook Technical Committee</secondary></indexterm>

The sponsors wanted to close out Davenport in an orderly way to ensure
that DocBook users would be supported. It was suggested that
<acronym>OASIS</acronym> become DocBook's new home. An
<acronym>OASIS</acronym> DocBook Technical Committee was formed in
July, 1998, with Eduardo Gutentag of Sun Microsystems as chair.
</para>
</sect3>
<sect3>
<title>The <acronym>OASIS</acronym> era</title>
<para>
The <ulink url="http://www.oasis-open.org/docbook/">DocBook Technical
Commitee</ulink> is continuing the work started by the
Davenport Group. The transition from Davenport to
<acronym>OASIS</acronym> has been very smooth, in part because the
core design team consists of essentially the same individuals (we all
just changed hats).
</para>
<para>
DocBook <acronym>V3.1</acronym>, published in February 1999, was the
first <acronym>OASIS</acronym> release.  It integrated a number of
changes that had been <quote>in the wings</quote> for some time.
</para>

<para>In February of 2001, OASIS made DocBook SGML V4.1 and DocBook XML V4.1.2
<ulink url="http://lists.oasis-open.org/archives/members/200102/msg00000.html">official
OASIS Specifications</ulink>.
</para>

<para><ulink url="http://www.oasis-open.org/docbook/specs/cs-docbook-docbook-4.2.html">Version 4.2</ulink> of the DocBook &DTD;, for both &SGML; and &XML;, was
released in July 2002.</para>

<para>
The committee continues new DocBook development to ensure
that the &DTD; continues to meet the needs of its users.  Forthcoming
and experimental work includes:
</para>

<itemizedlist>
<listitem><para>A V5.0 DTD projected for release no earlier than the end of
2002.
</para></listitem>
<listitem><para>Experimental
<ulink url="http://www.oasis-open.org/committees/relax-ng/">RELAX NG</ulink> schemas
<ulink url="http://www.oasis-open.org/docbook/relaxng">available</ulink>.</para></listitem>
<listitem><para>Experimental
<ulink url="http://www.w3.org/XML/Schema">W3C XML Schema</ulink> versions
<ulink url="http://www.oasis-open.org/docbook/xmlschema/">available</ulink>.</para></listitem>
<listitem><para>Experimental
<ulink url="http://www.xml.gr.jp/relax/">RELAX</ulink> schemas
<ulink url="http://www.oasis-open.org/docbook/relax/">available</ulink>.</para></listitem>
<listitem><para>Experimental
<ulink url="http://www.thaiopensource.com/trex/">TREX</ulink> schemas
<ulink url="http://www.oasis-open.org/docbook/trex/">available</ulink>.</para></listitem>
</itemizedlist>

<indexterm startref="XMLgetstart" class="endofrange"/>
<indexterm startref="getstartSGML" class="endofrange"/>

</sect3>
</sect2>
</sect1>
</chapter>

<!--
Local Variables:
mode: xml
sgml-parent-document: ("book.xml" "chapter")
End:
-->
