XML processor profilesNOTE-xml-proc-profiles-20140206W3C Working Group Note06February2014XMLHenry S. ThompsonUniversity of Edinburghht@inf.ed.ac.ukNorman WalshMarkLogic Corporationnorman.walsh@marklogic.comJames FullerWebcomposite S.R.Ojim.fuller@webcomposite.com
This specification defines several XML processor
profiles, each of which defines how any given XML
document should be processed, both operationally and in
terms of what information must be made available to applications. It is intended as a resource for other specifications, which can by
a single normative reference establish precisely what input processing they
require as well as what information they require.
This section describes the status of this document at
the time of its publication. Other documents may supersede this
document. A list of current W3C publications and the latest revision
of this technical report can be found in the
W3C technical reports index
at http://www.w3.org/TR/.
This Note describes several possible profiles of XML that might be
useful to authors of other specifications. It also attempts to classify
some axes along which profiles might occur.
This document
is a product of the
XML Processing Model
Working Group which is part of the W3C
XML Activity.
Comments on this document should be sent to the
public mailing list
public-xml-processing-model-comments@w3.org (public
archives are available).
Earlier efforts by this working group focused on making this a
Recommendation track document. In the intervening years, some of this
work has been overtaken by events. Many of the sorts of XML languages
that might have found value in the profiles described herein would
today be more likely expressed in JSON or some other format.
One impetus for this document was to publish a normative
description of the XML processing model. Experience
suggests that the profiles in this document are simultaneously too
numerous and not numerous enough.
It is clear that
XML has more than one processing model, but it is not clear that
there is community consensus on what those models are or even by
what axes they should be classified.
Though the Working Group has decided not to continue this document
on the Recommendation track, we have decided to publish it as a Note
in the hopes that the classifications it does provide may prove
useful.
Publication as a Working Group Note does not imply endorsement
by the W3C Membership. This is a draft document and may be updated,
replaced or obsoleted by other documents at any time. It is
inappropriate to cite this document as other than work in
progress.
This document was produced by a group operating under the 5
February 2004 W3C Patent Policy. W3C maintains a public
list of any patent disclosures made in connection with the
deliverables of the group; that page also includes instructions for
disclosing a patent. An individual who has actual knowledge of a
patent which the individual believes contains Essential
Claim(s) must disclose the information in accordance with section
6 of the W3C Patent Policy.
EnglishOriginal done at TPAC09Additional profile, name change, FPWD tweaks by HST: 2010-04, 05Per , minor edits in response to comments, also remove stale use of 'default' and 'model'Added new profile(s?) NDW: 2010-06-24Renamed most profiles, started on invariants section HST: 2010-07-01Adopted pgrosso's base URI and bibliography suggestions,
try to address pgrosso's concerns in section 3,
be a bit more expansive wrt impl-dep HST: 2010-09-22Corrected one typo, added caveat wrt xml:id HST: 2010-10-12Added caveat to invariant claim HST: 2011-01-31Reverted caveat, and tried to address the data-model
construction requirements issue head-on HST: 2011-02-15Renamed and articulated some classes, changed status of notations
and unparsed entities, elt content whitespace, per feedback and telcon discussion
HST: 2011-02-17Adopted some but not all of Liam Quin's suggestions: http://lists.w3.org/Archives/Public/public-xml-processing-model-wg/2011Mar/0003.htmlPrepared for 2nd Last Call WD HST: 2011-03-30.Removed diff markup, added section on validation HST: 2011-07-21Some editorial fixes per minutes of 2011-10-06 HST: 2011-10-13Changes per minutes of 2011-10-13, to address inter alia LCC 3 from Liam Quin HST: 2011-10-13'implement' all diffs since 2011-04-12 WD by removing diff
markup HST: 2011-11-24Changes per actions 1, 4--8 in f2f minutes of 2011-10-31 HST: 2011-11-24Remaining actions from f2f, incl. pretty substantial terminology
changes HST: 2011-11-28Actions from f2f JF: 2011-12-15Simplification of invariants section, per f2f and subsequent
discussion with AM HST: 2011-12-15Added figure, coalesced short profile descriptions, other bits
per recent feedback HST: 2011-12-15Changed short prof. description layout per NW suggestion HST: 2011-12-16Added notes wrt validation, in response to CMSMCQ HSt: 2012-09-13
Introduction
Few specifications are implemented in their entirety, in exactly the
same way, by every implementor. Many specifications contain optional
features or areas of acknowledged variation and some implementors
choose to ignore required features that aren't needed by the community
they serve, choosing to trade conformance for other benefits.
In the case of XML, there are not only optional features
in the XML
Recommendation itself, but there are a whole family of additional
specifications which an implementor may choose to support or ignore.
In principle, there are an enormous number of possible variations. In
practice, there are dependencies between the specifications that limit
the number of possible variations and implementors aren't motivated to
implement completely arbitrary selections.
The
gave the community a vocabulary for discussing
the information items passed by an XML processor
to an application. This specification
gives the community a vocabulary for describing common
sets of higher level features by
describing profiles, collecting specific sets of features
drawn from the family of specifications, and providing names for them.
One goal of this work is to help establish a lower bound on the number
and nature of features supported. The ability to communicate by sending XML documents back and forth
is predicated on the notion that we have the same understanding of
those documents. While we might
wish for the richest possible understanding, that's not likely to be
supported by the widest range of implementations. Establishing a few
basic profiles, we hope, provides a foundation on which other
specifications can build.
Background
The XML specification defines an XML processor
as "a software module…used to read XML documents and provide
access to their content and structure…on behalf of another module,
called the application."
XML applications are often defined in terms of
operations on instances of XML data models such as
or ,
or on information identified by terms in the
vocabulary.
Such definitions have suffered to some extent from an uncertainty
inherent in using that kind of foundation, in that the kind of
processing which XML processors carry out on XML documents, as well as
the amount of information they provide to applications as a result, is
flexible to a certain extent. Some of this flexibility stems from the
XML specification itself, which is not always explicit about what
information must be passed from processor to application, and which
also leaves open the possiblity of reading and interpreting external
entities, or not. Another kind of flexibility has arisen from the
growth of the XML family of specifications: if the input document
includes uses of XInclude, for instance, the XML processor
may or may not perform the indicated inclusions.
This specification addresses this issue by defining several XML processor
profiles, each of which defines how any given XML
document should be processed, both operationally and in
terms of what information must be made available to applications. It is intended as a resource for other specifications, which can by
a single normative reference establish precisely what input processing they
require as well as what information they require.
The profiles presented here are designed for use with respect
to static outcomes, that is, to the result of XML processing as (if) produced by a
batch process.
They do not attempt to address the question of the
preservation or lack thereof of information itself, or of information
invariants, in the course of incremental construction or in the face of
piecemeal modification.
The profiles defined here are appropriate for processing both XML 1.0 and XML 1.1 documents. References to XML or XML Namespaces below should be understood as references to 1.0 or 1.1 as required by the relevant document or application.
Terminology
The key words
must, must not, required,
shall, shall not, should,
should not, recommended, may,
and optional in this specification are to be interpreted
as described in .
A
base URI
is an absolute URI against which relative URIs are
resolved; this specification assumes that base URIs are established
and used as specified in .
The term
implementation-defined indicates an aspect that may
differ between implementations, but must be specified by the
implementor for each particular implementation.
The term
implementation-dependent indicates an aspect that may
differ between implementations, is not specified by this or any W3C
specification, and is not required to be specified by the implementor
for any particular implementation.
The term
profile refers to a named collection of items and properties
that must be made available to the application.
XML processor profiles
The profile
definitions which follow all assume that the starting point is a well-formed and namespace well-formed XML document. This specification does not consider documents
that are not namespace well-formed. Documents which are not well-formed are not XML.
Each profile is defined in terms of conformance requirements on processors
with respect to various XML-family specifications, and in terms of requirements
on the information they provide to applications. Information provision requirements are specified by
reference to classes of information items and properties, as further defined in .
It is the information itself which is required, not
the particular packaging of it implied by the items and properties
used to define those information classes. Processors typically package
information in terms of more-or-less standardized data models
or application program interfaces (APIs). How the information
required for conformance to a particular profile defined below is
conveyed by a data model or API need not correspond point-for-point to
the Infoset terminology. For example, a data model may
define
element
content as an array of strings and not as an array
of characters. That does not prevent it from
conforming to the requirements expressed below in terms of the 's Character Information Items, for
example requirement (3) of .
The four profiles defined here identify four increasingly rich
profiles, in terms of kinds of processing and amount of information
provided to applications, starting from a profile very close to what many XML
processors do already in their minimal configuration:
The Basic profile adds only support for xml:base processing to
the minimum expected of all processors, in order to allow for
correct resolution of relative URIs;
The Id profile adds xml:id processing in order to identify
IDs in the possible absence of complete attribute type declaration
information;
The External Declarations
profile adds mandatory external
markup declaration processing in order to guarantee all
information-affecting declarations are processed;
The Full profile adds xi:include processing, in order to
transclude linked infosets as parsed XML or as text, recursively as
required.
The precise nature of each of these profiles is described in the sections which follow.
The basic XML processor profile
To conform to the basic profile an XML processor must
Process the document as required of conformant
non-validating XML processors while not reading any external markup declarations;
Maintain the base URI of each element in
conformance with ;
Accurately provide to the application the information in the
document corresponding to information items and properties in
classes Core, Signal, Decl and ImplDef;
Since the specification
requires validating processors to read the external
subset, it follows that a processor which validates cannot be
conforming to this profile, nor to the
defined below.
If an XML document which specifies standalone="no" in its XML
Declaration is processed with either this profile or the ,
defined below, the resulting infoset may be lacking items that the author deemed
significant. This is not an error, because checking the standalone declaration
is a validity constraint.
The id XML processor profile
To conform to the id profile an XML processor must
Process the document as required of conformant
non-validating XML processors while not reading any external markup declarations;
Maintain the base URI of each element in
conformance with ;
Perform ID type assignment for all xml:id attributes as
required by by reporting their
attribute type Infoset property as ID to the application;
Accurately provide to the application the information in the
document corresponding to information items and properties in
classes Core, Signal, Decl and ImplDef.
This profile, like the , reads only declarations in
the internal subset, this means that types, such as ID, that appear in declarations
in the internal subset will be processed while such declarations in the external
subset will not.
The external declarations XML processor profile
To conform to the external declarations profile an XML processor must
Process the document as required of conformant
non-validating XML processors while reading and processing all external markup declarations (as
specified in the discussion of non-validating processors in the XML specification);
Maintain the base URI of each element in
conformance with ;
Perform ID type assignment for all xml:id attributes as
required by by reporting their
attribute type Infoset property as ID to the application;
Accurately provide to the application the information in the
document corresponding to information items and properties in
classes Core, Extended
and ImplDef;
Conformance to this profile, or to the defined below, neither requires nor excludes
validation. They leave it open to specifications which cite them
to forbid, allow or require validation.
A non-validating processor (see 5.1 Validating and Non-Validating
Processors in ) conformant to this profile gives the
complete infoset of a well-formed XML document. In the absence of well-formedness
and validity errors, a validating processor using this profile gives the
complete infoset of a valid XML document.
The full XML processor profile
To conform to the full profile an XML processor must
Process the document as required of conformant
non-validating XML processors while reading and
processing all external markup declarations (as
specified in the discussion of non-validating processors in the XML specification);
Maintain the base URI of each element in
conformance with ;
Perform ID type assignment for all xml:id attributes as
required by by reporting their
attribute type Infoset property as ID to the application;
Recursively replace all include elements in the XInclude
namespace, and carry out namespace, xml:base and xml:lang fixup of the result, as required for
conformance to ;
Accurately provide to the application the information in the
document corresponding to information items and properties in
classes Core, Extended
and ImplDef.
The following pipeline implements the when executed by a
conformant XProc processor which
Processes its input as required by point 1 above;
Recognizes and reports the ID type of all xml:id attributes in
conformance with .
XProc pipeline which implements the full processor profile
<p:pipeline xmlns:p="http://www.w3.org/ns/xproc">
<p:xinclude fixup-xml-base="true" fixup-xml-lang="true"/>
</p:pipeline>
Classes of Information
For the profile definitions above and the invariants below, we
categorize the information expressed in XML documents, which
may be made available to applications, into a number of
(overlapping) classes. What follows is a complete tabulation of all the
information items and their properties from , annotated
with one or more class labels.
The glosses which follow immediately below here are explanatory: the
actual class definitions are given in the subsequent table.
Items and properties which are fundamental for
all XML applications and so must be provided by all profiles.
Items and properties which depend on declarations and so
must be provided by and only.
These items and properties may be absent if the or are used.
Items and properties which only are relevant when entity declarations are not available and so must be provided by and only.
Items and properties which depend on declarations.
For and , they will not be provided if the relevant declaration
is in an unprocessed external entity, or is after the first reference to an external entity
which is not processed.
Items and properties which will be present for
validating processors, but for which support by non-validating processors is
implementation-defined. Non-validating processors must document whether they
provide this information to applications or not.
Items and properties for which support is
implementation-defined. Processors must document whether they
provide this information to applications or not.
The tabulation which follows defines the information classes by
enumerating their membership in terms of information items and their
properties—each class contains all and only those items and properties
against which its name appears below.
the item itself
Core
[children]
ImplDef
[document element]
Core
[notations]
Extended, Decl
[unparsed entities]
Extended, Decl
[base URI]
Core
[character encoding scheme]
Core
[standalone]
Core
[version]
Core
[all declarations processed]
Core
the item itself
Core
[namespace name]
Core
[local name]
Core
[prefix]
Core
[children]
Core
[attributes]
Core
[namespace attributes]
Core
[in-scope namespaces]
Core
[base URI]
Core
[parent]
Core
the item itself
Core
[namespace name]
Core
[local name]
Core
[prefix]
Core
[normalized value]
Extended, Decl
[specified]
Core
[attribute type]
Extended, Decl
[references] to Element Information Items, i.e. for attributes of types IDREF and IDREFS
Extended, Decl
[references] to Notation and Unparsed Entity Information Items, i.e. for attributes of types ENTITY, ENTITIES and NOTATION
ImplDef
[owner element]
Core
the item itself
Core
[target]
Core
[content]
Core
[base URI]
Core
[notation]
ImplDef
[parent]
Core
This type of information item will not occur at all if
standalone="yes"is specified and is correct.
the item itself
Signal
all properties
Signal
the item itself
Core
[character code]
Core
[element content whitespace]
Validated
[parent]
Core
the item itself
Core
[content]
Core
[parent]
Core
the item itself
ImplDef
all properties
ImplDef
the item itself
Extended, Decl
all properties
Extended, Decl
the item itself
Extended, Decl
all properties
Extended, Decl
the item itself
Core
[prefix]
Core
[namespace name]
Core
Relations and Invariants
Whenever a
document is processed in conformance with one of the profiles defined
above, the information made available to applications will
be guaranteed to have certain properties. The relation between the profiles and information classes
defined above is summarized in the illustration below (PNG,SVG), then the sub-sections which follow describe
this in terms of invariants with respect to the information made available.
Note: in an effort to maintain consistent
relationships in the diagram, the label for the inner-most circle,
around “Full Profile”, has been omitted. It should be read as if it was
labeled “Perform XInclude processing”.
The following table summarizes the properties associated with each profile.
Profiles
process with non validating XML Processor
maintain base URI
signal+decl class
core+impldef class
ID type assignment
Extended class
Perform XInclude Processing
Basic Profile
X
X
X
X
ID Profile
X
X
X
X
X
External Decl Profile
X
X
X
X
X
Full Profile
X
X
X
X
X
X
Information invariants within a given profile
Every instance of processing a given namespace-well-formed XML
document in conformance with the
same profile will make available
exactly the same information with respect to the information items and
properties which
that profile is required to provide accurately, as tabulated above.
Information variation between profiles
In comparing two cases when a given namespace-well-formed XML
document is processed in conformance with
two different profiles, the information made available will in some cases (depending on
the specifics of the document in question) differ with respect to the following information items and
properties (leaving aside the items and
properties classified as implementation-defined above):
Between basic and richer profiles
[normalized value],
[attribute type],
[references]—These properties may vary for xml:id attributes
And all the differences listed in the next two sections.
Between id and richer profiles
Where an id processor reports an Unexpanded Entity
Reference, richer ones will report the entity expansion, that is, they will report
some number of information items and their associated properties. For this reason,
the information reported from an id processor may differ from that reported by
a processor conforming to a richer profile with respect to any or all of
Element, Attribute, Character, Comment, Namespace, Processing Instruction and
Unexpanded Entity Reference Information Items.
With respect to [normalized value],
[specified],
[attribute type] and
[references] where an id processor has not processed the relevant
declaration, but a richer one has.
And all the differences listed in the next section.
Between external declarations and full profiles
Parallel to the case for expanding entity references in the previous
section, XInclude processing in conformance with the full profile may replace
some (XInclude) Element Information Items reported by processing in conformance
to other profiles with some number of different
Element, Attribute, Character, Comment,
Namespace and Processing Instruction Information Items.
Other profiles (non-normative)
The profiles defined here can be used as a starting point for the definition of further profiles. For example, the media type registrations for stylesheet languages applicable to XML such as application/xslt+xml or text/css might define a profile specifying appropriate <?xml-stylesheet type="[their media type]"…?> processing in addition to the processing required by .
Validation (Non-normative)
Specifying desired information outcomes is not sufficient to completely
determine XML processor behavior. In particular, if validation is performed
and errors detected, the result may be no outcome at all.
A range of schema languages and approaches to validation
exist. Some may provide for additional information items and/or properties
which are not addressed by this specification. Also, the validation-dependent [element content whitespace] property of Character Information Items
may only be
reliably provided in conjunction with some approaches
to validation, specifically DTD validation.
Furthermore, not all of the profiles defined above can be combined with all forms of validation: in particular, DTD
validation requires that all external markup declarations be read and processed, and so cannot be required in conjunction with or .
Accordingly, specifications referencing this one should also specify
whether validation is forbidden, optional or required, with respect to which
schema language(s) with what validation control settings, if
any. If the is involved,
careful consideration is required as to whether validation is to happen before XInclude processing, or after, or both.
Specifying validation
Given the number of XML validation technologies available, and the
constraints on where in the process they can occur, specifying patterns of
required, allowed or forbidden validation may not be straight-forward.
To enable a degree of consistency in this area, specifications are
recommended to consult the following diagrams, and express
their requirements in this area with reference to them:
Examples of recommended wording:
Conforming implementations must process XML
documents and make information available as required by the id XML processor
profile, with no non-DTD validation (A).
... the id XML processor profile, with validation (A) using XML Schema 1.0 followed optionally by validation (B) using Schematron.
... the external declarations XML processor profile, with DTD
validation (A) followed by validation (B) using XML Schema 1.1 with support for the lightweight type-aware subset of the PSVI.
... the full XML processor profile, with DTD
validation (A), then XInclude, then
validation (D) using Relax NG.
References
XML Information Set,
World Wide Web Consortium. Most recent edition (the second) is dated
04 Feb 2004, John Cowan and Richard Tobin, Editors.
The latest version
is available at http://www.w3.org/TR/xml-infoset/.
RFC 2119: Key words for use in RFCs to Indicate Requirement Levels.
Internet Engineering Task Force, 1997.RFC 3986: Uniform Resource Identifier (URI): Generic Syntax.
Internet Engineering Task Force, 2005.XProc: An XML Pipeline Language,
Norman Walsh, Alex Milowski, and Henry S. Thompson, Editors.
World Wide Web Consortium, 9 March 2010.
This version is http://www.w3.org/TR/2010/REC-xproc-20100511/.
The latest version
is available at http://www.w3.org/TR/xproc/.
XML Path Language (XPath) 2.0
(Second Edition), Anders Berglund et al. Editors. World Wide Web
Consortium, 14 December 2010. This version is
http://www.w3.org/TR/2010/REC-xpath20-20101214/. The latest version is available at http://www.w3.org/TR/xpath20/.xml:id Version 1.0,
Norman Walsh, Daniel Veillard, and Jonathan Marsh, Editors.
World Wide Web Consortium, 09 Sep 2005.
This version is http://www.w3.org/TR/2005/REC-xml-id-20050909/.
The latest version
is available at http://www.w3.org/TR/xml-id/.
XML Inclusions (XInclude) Version 1.0 (Second Edition),
David Orchard, Jonathan Marsh, and Daniel Veillard, Editors.
World Wide Web Consortium, 15 Nov 2006.
This version is http://www.w3.org/TR/2006/REC-xinclude-20061115/.
The latest version
is available at http://www.w3.org/TR/xinclude/.Extensible Markup Language (XML) 1.0 (Fifth Edition),
Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, et. al., Editors.
World Wide Web Consortium, 28 Nov 2008.
This version is http://www.w3.org/TR/2008/REC-xml-20081126/.
The latest version
is available at http://www.w3.org/TR/xml/.
Extensible Markup Language (XML) 1.1 (Second Edition),
Tim Bray, John Cowan, Jean Paoli, et. al., Editors.
World Wide Web Consortium, 16 Aug 2006.
This version is http://www.w3.org/TR/2006/REC-xml11-20060816/.
The latest version
is available at http://www.w3.org/TR/xml11/.
Namespaces in
XML 1.0 (Third Edition), Tim Bray, Dave Hollander, Richard
Tobin, and Andrew Layman, Editors. World Wide Web Consortium,
8 Dec 2009. This version is
http://www.w3.org/TR/2009/REC-xml-names-20091208/. The latest version is available
at http://www.w3.org/TR/xml-names/.Namespaces in XML 1.1 (Second Edition), Tim Bray, Dave Hollander, Andrew Layman, and Richard Tobin, Editors. World Wide Web Consortium, 16 Aug 2006. This version is http://www.w3.org/TR/2006/REC-xml-names11-20060816/. The latest version is available at http://www.w3.org/TR/xml-names11/.XML Base (Second Edition),
Jonathan Marsh, Editor. World Wide Web Consortium, 28 January 2009.
This version is http://www.w3.org/TR/2009/REC-xmlbase-20090128/.
The latest version
is available at http://www.w3.org/TR/xmlbase/.
XML Path Language (XPath) Version 1.0,
James Clark and Steven DeRose, Editors.
World Wide Web Consortium, 16 Nov 1999.
This version is http://www.w3.org/TR/1999/REC-xpath-19991116/.
The latest version
is available at http://www.w3.org/TR/xpath/.
XQuery 1.0 and XPath 2.0 Data Model (XDM),
Ashok Malhotra, Jonathan Marsh, Norman Walsh, et al., Editors.
World Wide Web Consortium, 14 Dec 2010.
This version is http://www.w3.org/TR/2010/REC-xpath-datamodel-20101214/.
The latest version
is available at http://www.w3.org/TR/xpath-datamodel/.