2. Schema Modularization Framework

Contents

This section is informative.

2.1. How Schema Modularization Works

2.1.1. DTDs and XML Schema

Both DTDs and XML Schema are designed to accomplish the same fundamental task: to define the structure of XML document types. In this sense both are simply different text representations for the same underlying data structures. However, Schema and DTDs differ significantly in several ways, both in structure and capabilities.

Some differences worth noting are:

Common XML features
XML Schema are XML documents themselves and therefore share many aspects of the languages they define.
Data typing
Schemas are designed with a much larger set of built-in data types than DTDs, and provide methods for creating user-defined types.
Namespaces
DTDs only partially support XML Namespaces, which are inherently a part of XML Schema.
Extension
XML Schema have a rich set of extension mechanisms including inheritance, redefinition, and substitution.
Entities
There is no mechanism in XML Schema corresponding to the use of entities for data abstraction in DTDs. In many cases the functionality of entities can be replaced through other XML-based mechanisms. However, there is currently no support for named character entity references as used in XHTML within XML Schema. In the XML Schema modules described here, named character entities for XHTML are included using a DTD.
DTDs and Document Order Dependence
A more subtle feature of modularized DTDs is their dependence on the document order; the order in which elements and entities are defined within DTD files has a large impact on language development. XML Schema are far less dependent on document order.

2.1.2. Document Data Structures

XML language definitions, regardless of their text representation, contain at least three types of data structures. When combined into a coherent and consistent whole, they form a complete language definition. These three components are:

Additional abstract data structures may be defined for use in the language definition, such as common content models or attribute groups, whose use is shared by other data structures within the language definition. The definition of these structures is the primary task of language development, and the core of the modularization framework.

2.1.3. Understanding XHTML Modularization

This schema modularization framework consists of two parts:

  1. A set of schema modules that conform to the abstract modules in XHTML
  2. A set of modularization conventions that describe how the individual modules work together, and how they can be modified or extended.

In XHTML-MOD, every object in the DTDs is represented by an XML entity. These entities are then composed into larger sets of entities and so on, resulting in a set of data abstractions that can be generalized and used modularly. These multiple levels of abstraction are tied together by the use of a specific naming convention and a set of abstract modules.

Generic classes of entities (composed of sub- and sub-sub-entities) are used to create definitions of the three components listed above. Content models, attribute lists and elements are defined separately, sometimes in separate modules, and the ordering of the modules in the DTD structure is strictly defined (due to document order dependence). They are then combined to form the resulting document type. Extensibility is accomplished through the extensive use of INCLUDE/IGNORE sections in the DTD modules. How each of these structures relates to its Schema-based counterpart is summarized in Table 1 below.

2.1.4. Mapping DTDs onto Schema

Both the DTD and schema-based modularization frameworks implement a set of formalized data structures, often in a conceptually similar way. The modularization framework described here is designed around the use of similar data structures, which can be represented (more or less) equally well in either representation. This is accomplished through the use of a straightforward mapping of data structures defined in the DTD modules onto equivalent data structures in the XML Schema language.

2.1.4.1. Content Models

In XHTML-MOD, content models for elements are defined using three classes of entities, identified through the naming conventions by the suffixes ".content", ".class", and ".mix". Each of these classes of entities is mapped onto a corresponding Schema counterpart in the following way:

".content" models   - these models are used to define the contents of individual elements. For each element there is a corresponding ".content" object. IN XML Schema, ".content" entities are mapped directly onto groups:

Example 1 - Content Group
DTD Schema
<ENTITY % html.content "(head+,body+)">
<group name="html.content">
   <sequence>
      <element ref="head" minOccurs="1">
      <element ref="body" minOccurs="1">
   </sequence>
</group> 

The contents of ".content" groups are often classes or mixes.

".class" models - these models are used to define abstract classes of content models made up of either ".content" entities or other ".class" entities (or elements). In XML Schema they correspond to groups that may also contain substitution groups:

Example 2 - ".class" Group
DTD Schema
<!ENTITY % Misc.class "%Edit.class;
                 %Script.class;
                 %Misc.extra;"> 
<group name="Misc.class">
   <choice minOccurs="0"
           maxOccurs="unbounded">
      <element ref="Edit.class"
               abstract="true"/>
      <element ref="Script.class"
               abstract="true"/>
      <element ref="Misc.extra"
               abstract="true"/>
   </choice>
</group>

".mix" models - these models correspond to content models that are mixed groupings of  ".class", ".content", and ".mix" entities and serve as abstract content models often used in common by many elements in the DTD. They correspond to groups in XML Schema:

Example 3 - ".mix" Group
DTD Schema
<!ENTITY % Block.mix "%Heading.class;
                       | %List.class;
                       | %Block.class;
                       %Misc.class;">
<group name="Block.mix">
   <choice minOccurs="0"
           maxOccurs="unbounded">
       <group ref="Heading.class"/>
       <group ref="List.class"/>
       <group ref="Block.class"/>
       <group ref="Misc.class"/>
       </choice>
</group>

In addition to these three content model groupings, XHTML-MOD includes an additional grouping ".extra". These are currently omitted from the schema modules. (If needed, a developer could add them to the schema modules in a conformant way.)

2.1.4.2. Attributes and Attribute Groups

Attributes and Attribute lists in DTDs correspond directly to attribute and attributeGroup elements in XML Schema. The translation from one to the other is relatively simple and straightforward. Here is an example:

Example 4 - Attribute Group
DTD Schema
<!ENTITY % title.attrib
"title %Text.datatype; #IMPLIED">
<attributeGroup name="title">
   <attribute name="title" type="string"/>
</attributeGroup>

Complex attribute groups that are used by many different elements are grouped in the DTDs using entities suffixed with ".attrib". These attribute entities map directly onto attributeGroup elements in XML Schema as shown above.

2.1.4.3. Complex Types and Element Definitions

The XML Schema specification allows elements as well as attribute values to be strongly typed. In defining elements in the modularized schema, an element type is created for each element that is a complex type composed of the content model (element.content) and the attribute list (element.attlist) as shown below:

Example 5 - Element Types In Schema
<complexType name="form.type">
    <group ref="form.content"/>
    <attributeGroup ref="form.attlist"/>
</complexType>

Elements are then declared to be of the type element.type:

Example 6 - Element Definition
<element name="form" type="form.type"/>

This allows the author the greatest degree of flexibility while retaining strict type checking via XML Schema. It also allows for extension of the element via type substitution.

Note that in the case of an element with a mixed content model, a complexType is necessary.

In summary, each element is composed of a content model and an attribute list, which are composed into a type for that element.

2.1.4.4. Attribute and Element Redefinitions

XML Schema allows inheritance and redefinition of elements, groups, attributes and attributeGroups. In several cases modules require modification of previously declared attribute lists. This is done by using the <xsd:redefine> element to redefine the attributeGroup that needs to be modified

Example 7 - attributeGroup Redefinition Example
<!- - new attribute to be added - - >
<attributeGroup name="align.legacy.attlist">
   <attribute name="align">
      <simpleType>
         <restriction base="NMTOKEN">
            <enumeration value="left"/>
            <enumeration value="center"/>
            <enumeration value="right"/>
            <enumeration value="justify"/>
         </restriction>
      </simpleType>
   </attribute>
</attributeGroup>

<!- - add it to the caption element's attribute group - - >
<redefine schemaLocation="xhtml-table-01.xsd">
   <attributeGroup name="caption.attlist">
      <extension base="align.attlist"/>
      <attributeGroup ref="align.legacy.attlist"/>
      </extension>
   </attributeGroup>
</redefine>

In this example, we redefine the attribute list for the caption element in the tables module to add the align attribute defined in align.legacy.attlist.

2.1.4.5. Support Structures

The modularized DTDs contain support mechanisms for XHTML. Some of these are DTD-specific and are not fully supported in XML Schema.

This modularization framework attempts to recreate these support structures to the greatest extent possible.

2.1.4.5.1. Notations

Notations are an SGML feature that allows non-SGML data within documents to be interpreted locally [CATALOG]. Notations for XHTML are preserved in the Schema modules using the notation element in a straightforward way.

Example 8 - Notations
DTD Schema
<!NOTATION character
   PUBLIC "-//W3C//NOTATION
   XHTML Datatype: Character//EN">
<notation name="charset"
   public="-//W3C//NOTATION
   XHTML Datatype: Charset//EN"/>
2.1.4.5.2. Data Types

The strong typing mechanism in XML Schema, along with the large set of intrinsic types and the ability to create user-defined types, provides for a high level of type safety in instance documents. This feature can be used to express more strict data type constraints, such as those of attribute values, when using XML Schema for validation.

Example 9 - Simple Data Types
DTD Schema
<!ENTITY % Length.datatype "CDATA" >
<simpleType name="Length">
   <restriction base="string"/>
</simpleType>
2.1.4.5.3. Named Character Entities

XML Schema provides no means of duplicating XHTML's named character entity mechanism. In most cases data abstraction through entities can be dispensed with in schemas. However, in the case of named character references, no replacement method is available.

Character entities are used to represent characters that occur in document data that may not be processed natively on the user's machine, for instance the copyright symbol. XHTML makes use of 3 sets of named character entities: the ISO Latin 1, Symbols, and Special.

A general solution for the resolution of language-specific named character entities is outside the scope of this document.

Entities are currently referenced in this framework as using a DTD reference to three individual DTD modules that define them.

2.1.4.6. Mapping Summary

The following table summarizes the mapping of DTD data structures onto XML Schema structures.

Table 1 - Mapping of DTD and Schema Data Structures
DTD Entity Use Schema Element
.content Element content model group
.class Abstract content model group
.mix Abstract content model group
.attlist Attribute lists attributeGroup
.attrib attributes Attribute
.extra Abstract attribute group attributeGroup
elements Element definitions Elements+complexType
attribute redefinition Attribute list redefinition AttributeGroup w/redefine
notation SGML specific notation
datatypes attribute datatypes simpleType
entities Character replacement DTD reference
DTD "driver" Framework document "Hub" Schema document

One further issue of note in the conversion of DTDs to XML Schema is that it is absolutely necessary to define all elements globally. Otherwise they are not considered to be in the XHTML namespace but only "associated"[XMLSCHEMA_COMPOSITION] with it. This document does not make use of this association feature in XML Schema.

2.2. Framework Conventions

This section is normative.

This modularization framework consists of a complete set of XHTML schema modules and a set of framework conventions that describe how to use them. The use of the framework conventions is required for conformance.

2.2.1. Modularized Schemas

The modularized XHTML schema uses three types of modules, which when combined comprise the entire XHTML definition.

2.2.1.1. Hub document

The Schema hub document is the base document for the schema. It contains only annotations and modules, which in turn contain <xsd:include> statements referencing other modules. The hub document corresponds to the DTD "driver" module in XHTML-MOD, but is much simpler. The hub document allows the author to modify the schema's contents by the simple expedient of commenting out modules that are not used. Note that some modules are always required in order to ensure conformance.

The (non-normative) example hub document described here contains <include> elements for two modules, named "required" and "optional". Each of these included modules is itself a container module.

2.2.1.2. Container Modules

Module containers, reasonably enough, include other modules. Modules and their containers are organized according to function. Including the hub document, which is a special case of a module container, there are ten included module containers.

2.2.1.3. Element modules

In addition to the module containers listed above, there are around forty schema modules which contain only element definitions and their associated attribute and content model definitions. By convention, Schema modularizations may contain either <include> statements or element definitions but not both.

2.2.2. Module Naming

In order to easily identify the contents of any particular schema module, it is useful to provide here a module naming convention syntax. This syntax also provides a simple means of distinguishing modules based on their language version, which may improve maintainability of the modules themselves.

The module naming convention adopted here is the same in almost all respects as that used in XHTML-MOD.

Schema modules for XHTML should have names that:

Modules used in this modularization framework must have names that conform to the following syntax:

Example 10 - Schema Module Naming Convention
Pattern
languagename-filecontentsdescription-versionnumber.xsd
Example
xhtml-table-01.xsd

Exceptions to this rule are made for the Schema hub modules whose names are the same as above but may omit the content description syllable for brevity.

Version numbers of hub modules may omit the leading zero in the version number, but should include the minor version number.

Example: xhtml-1.1.xsd

In the case where a hub module contains elements or attributes from external namespaces, the name(s) of the external module(s) should be appended to the base language name using the "+" character.

Example: xhtml+fml-1.0.xsd

This module naming convention is intended also to comply with the required use of the media type in [XHTMLMIME].

2.2.3. Module Hierarchy Structure

In order to establish a physical structure for the composition of the Schema modules that corresponds to the abstract modules in XHTML, a module hierarchy structure has been used to organize the physical modules. The hierarchy structure looks like this:

Table 2 - Schema Module Hierarchy Structure
xhtml/
xhtml/req/
xhtml/req/framework/
xhtml/req/core/
xhtml/req/core/text/
xhtml/opt/
xhtml/opt/pres/
xhtml/opt/legacy/
xhtml/opt/legacy/misc/
xhtml/opt/legacy/frames/

These correspond to the divisions of XHTML into abstract modules described in detail in Section 3.2. The hierarchy structure is intended to match the abstract module structure as closely as possible. This feature is not present in DTD modularization, and is not required for Schema modularization. It does, however, allow the developer to organize the modules in accordance with their hierarchical structure. The directories listed in Table 2 also correspond exactly to the module container modules in this framework.

2.2.4. Names for Data Structures

The consistent use of naming conventions is important for the maintenance and development of complex software applications. 

Adhering to these conventions provides numerous benefits to developers:

With few exceptions, the naming conventions used in XHTML-MOD are preserved in this framework.

The naming convention in XHTML-MOD uses suffixing of object names to indicate functionality, as described below.

2.2.4.1. Attributes

Abstract attribute groups and attribute lists are suffixed with the ".attrib" and ".attlist" suffixes respectively.

2.2.4.2. Content models

Three different suffixes are used in content model names. They are ".content" for element content models, and ".class" or ".mix" for abstract content models.

2.2.4.3. Elements

Element names are not suffixed in XHTML-MOD. This document uses the notion of element types, which are complexTypes used to define elements and are suffixed with ".type". The ".type" suffix was used in XHTML-MOD for attribute data types. This is superfluous in XML Schema (since attribute types are arguments to the "type" attribute) and so the suffix is used in a different way in this framework.

2.2.5. Module Structure

This document establishes a convention for the internal structure of XHTML Schema modules. This convention provides a consistent and predictable way of organizing schema modules internally. This convention applies also to the hub document, which is itself simply a module of modules, albeit a somewhat specialized one.

Each schema module is composed of several components, some of which are required for functional reasons and some of which provide metadata as a convenience to the author. Not every component is included in every module.

2.2.5.1. Schema Element

Each module begins with a <xsd:schema> root element (after the optional xml declaration and DOCTYPE).

2.2.5.1.1. Use of Version Attribute

In the XHTML schema modules, the version number for the specific language being defined (e.g. "1.1") is used as the default value of the version attribute on the schema element.

2.2.5.1.2. Qualified names

This framework uses the value of "unqualified" for the value of the elementFormDefault attribute on the schema root element. Elements within the html namespace do not need to use a namespace prefix.

2.2.5.2. Annotation Block

After the root element each module contains an annotation element containing several documentation sections briefly describing the purpose of the module.

2.2.5.2.1. Module Description

This is an annotation element that contains a short description of the module and its purpose.

2.2.5.2.2. Versioning Block

An annotation element containing authoring and versioning information for the module should always be included.

2.2.5.2.3. Copyright

The standard W3C copyright statement is included in each module through the use of an include element. An exception is the hub document, which contains the full copyright text.

2.2.5.2.4. Documentation

This is a module specific documentation element providing detailed information about the module's contents, its organization, and any noteworthy items of interest to developers.

2.2.5.3. 3. Module elements

Module elements contain include statements, import statements, or other modules (or comments). They must precede any other definitions in the module.

2.2.5.4. 4. Content model groups

These include groups with names ending in ".content", ".class", or ".mix".

2.2.5.5. 5. Attributes and Attribute groups

These are suffixed with either ".attrib" or ".attlist".

2.2.5.6. 6. Element type definitions

These are complexType elements defining each element's type.

2.2.5.7. 7. Element definitions

These define individual elements in the module.

Additional constraints on the internal structure of schema modules are:

Each module must contain include statements for other modules or data structure definitions, but not both.

Each module must include at least sections 1 and 2 above, as well either section 3 or some combination of sections 4-7.

2.2.6. Namespace Conventions

The handling of namespaces in XML Schema is entirely different from that in XHTML-MOD. Namespaces are integral to XML Schema and their use in modularization arises naturally from the schema syntax.

One convention chosen for this framework is that the names of elements and attributes in the modules are unqualified i.e. no namespace prefix is required for XHTML elements.

This is set by using the value of "unqualified" on the elementFormDefault attribute of the xsd:schema element.

2.2.7. Documentation Conventions

A consistent commenting convention has been imposed on the modules described here. The purpose of a commenting convention is to allow for generating documentation from the comments (as well as general comprehension). Documentation elements containing Annotation-level comments are assumed to be of the highest importance and should be used to denote information about the module itself, and for important notes for developers.

ModuleF-level comments are denoted as usual with SGML comment delimiters "<!--" and "-->". By means of this convention, modules can become self-documenting. Tools for extracting these comments and formatting them suitably may (hopefully) be developed in the future.