Copyright © 2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark and document use rules apply.
XML Normalization defines a means by which XML parsers can produce normalized output of any parsed document. This normalized form is similar to that produced by Canonicalized XML 1.1 [XML-C14N11], though the two are not interchangeable. Its intent is also different than that of Canonicalized XML 1.1: it exists primarily to assist clients of XML parser APIs such as SAX [SAX] to ensure that they are provided XML data in a predefined representation, whether as events or DOM nodes.
Any XML document is part of a set of XML documents that are logically equivalent within an application context, but which vary in physical representation based on syntactic changes permitted by XML 1.0 [XML10] and Namespaces in XML 1.0 [XML-NAMES]. This specification describes a method by which parsers can generate XML events or DOM nodes according to a normalized form that accounts for the permissible changes. It also allows for external specification of certain attributes of this normalized form.
The aim of this standard is to define a means by which a low-overhead streaming XML parser can output events in a manner which can be anticipated by a client of the parser, thus reducing that client's need for additional logic to handle variations in representation. It also provides a supplemental guide to implementing the same algorithm for DOM parsers. It is not intended to provide a canonicalized form of a document as defined by Canonical XML 1.1 [XML-C14N11], and has some incompatibilities with that standard, though its output is frequently similar. However, two semantically equivalent documents will produce similar output when processed using the same normalization parameters and algorithm.
Normalization for Streaming XML Parsers is applicable to XML 1.0. It is not defined for XML 1.1.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document was published by the XML Security Working Group as an Editor's Draft. If you wish to make comments regarding this document, please send them to public-xmlsec@w3.org (subscribe, archives). All comments are welcome.
Publication as an Editor's Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The key words MUST, MUST NOT, REQUIRED, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this specification are to be interpreted as described in [RFC2119].
See [XML-NAMES] for the definition of QName.
Since the XML 1.0 Recommendation [XML10] and the Namespaces in XML 1.0 Recommendation [XML-NAMES] define multiple syntactic methods for expressing the same information, XML applications tend to take liberties with changes that have no impact on the information content of the document. XML normalization is designed to be useful to applications that wish to process an XML document in regards to a predetermined semantic representation, allowing clients of a stream or event parser to delegate the handling of differing representations of semantically-identical XML documents to the parser itself.
For example, a representation may make use of a well-known XML namespace prefix or it may use one of its own devising. The algorithm defined in this specification can be used to translate those prefixes while parsing, such that the client API need not anticipate multiple prefixes, nor need to manually compare potentially long namespace URIs at every step. This also applies to any XPath or QName values contained within the document.
Another example allows a client to instruct the parser to ignore certain subtrees, or to only return certain subtrees, and whether to report them as DOM elements or as raw text. For example, an XML-RPC request might consist of a document fragment containing protocol information and a document fragment containing response data. This specification allows a stream or event parser client to request that only one of these fragments is parsed and reported; it may also request that the raw text content of the other fragment be reported as a single block of text which can then be fed into a less-able parser further back in the chain. This can provide a performant alternative to the use of XPath expressions in some simple use cases.
Two XML documents may have differing information content that is
nonetheless logically equivalent within a given application context. Although
two XML documents are equivalent (aside from limitations given in this section)
if their normalized forms are identical, it is not a goal of this work to establish
a method such that two XML documents are equivalent if and only if their
normalized forms are identical. Such a method is unachievable, in part due to
application-specific rules such as those governing unimportant whitespace and
equivalent data (e.g. <color>black</color>
versus
<color>rgb(0,0,0)</color>
). There are also equivalencies
established by other W3C Recommendations and Working Drafts. Accounting for
these additional equivalence rules is beyond the scope of this work. They can
be applied by the application or become the subject of future
specifications.
The normalized form of an XML document may not be completely operational within the application context, though the circumstances under which this occurs are unusual.
The difficulties arise due to the loss of the following information not available in the data model:
In the first case, the loss of external unparsed entity references and the notations that bind them to applications means that normalized forms cannot properly distinguish among XML documents that incorporate unparsed data via this mechanism. This is an unusual case precisely because most XML processors currently discard the document type declaration, which discards the notation, the entity's binding to a URI, and the attribute type that binds the attribute value to an entity name. For documents that must be subjected to more than one XML processor, the XML design typically indicates a reference to unparsed data using a URI in the attribute value.
In the second case, the loss of attribute types can affect the normalized
form in different ways depending on the type. Attributes of type ID cease to
be ID attributes. Hence, any XPath expressions that refer to the normalized
form using the id()
function cease to operate. The attribute
types ENTITY and ENTITIES are not part of this case; they are covered in the
second case above. Attributes of enumerated type and of type ID, IDREF,
IDREFS, NMTOKEN, NMTOKENS, and NOTATION fail to be appropriately constrained
during future attempts to change the attribute value if the normalized form
replaces the original document during application processing. Applications can
avoid the difficulties of this case by ensuring that an appropriate document
type declaration is prepended prior to using the normalized form in further XML
processing. This is likely to be an easy task since attribute lists are
usually acquired from a standard external DTD subset, and any entity and
notation declarations not also in the external DTD subset are typically
constructed from application configuration information and added to the
internal DTD subset.
Normalization for Streaming XML Parsers solves many of the major issues that have been identified by implementers with Canonical XML 1.0 [XML-C14N] and 1.1 [XML-C14N11]. It thus provides a better alternative to the use of canonicalization algorithms for the purposes outlined in this specification.
Canonicalization will be slow if the implementation uses the Canonical XML 1.1 specification as a formula without any attempt at optimization. This specification rectifies this problem by incorporating lessons learned from the implementation of that specification. Most mature canonicalization implementations solve the performance problem by inspecting the signature first, to see if it can be canonicalized using a simple tree walk algorithm whose performance is similar to regular XML serialization. If not they fall back to the expensive nodeset-based algorithm.
The use cases that cannot be addressed by the simple tree walk algorithm are mostly edge cases. This specification restricts the input to the normalization algorithm so that implementations can always use the simple tree walk algorithm. This facet is what lends this specification's suitability for use as part of a stream or event parser directly.
C14N 1.x uses an "XPath 1.0 Nodeset" to describe a document subset. This is the root cause of the performance problem and can be solved by not using a nodeset. This specification does not use a nodeset, visits each node exactly once, and only visits the nodes that are being normalized.
A streaming implementation is required to be able to process very large documents without holding them all in memory; it should be able to process documents one chunk at a time.
Whitespace handling in parser clients frequently means trimming all node contents. This specification provides a means for a parser to perform this duty internally depending on input from the parser client, and for such processing to be done in an intelligent manner with regards to QNames and XPaths in content. Specifically it uses three techniques for normalizing text content:
xsi:type
attribute,C14N 1.x algorithms are complex and depend on a full XPath library. This increases the work required for scripting languages make use of it as an XML document pre-processing tool. This specification addresses this issue by not using the complex nodeset model, and therefore not relying completely on XPath.
The input to the normalization algorithm consists of an XML document subset, and set of options. The XML document subset can be expressed in two ways, with a DOM model or a Stream model.
In the DOM model the XML subset is expressed as:
D
or a list of one or more element
nodes E1
, E2
, … En
. Ei
is a descendant of another
Ej
, then that element node Ei
is ignored.)
E1
,
E2
, … Em
and a list of zero or more attribute
nodes A1
, A2
, … AM
. xml
namespace.
The XML subset consists of all the nodes in the Inclusion list and their descendants, minus all the nodes that are in the Exclusion list and their descendants.
The element nodes in the Inclusion list are also referred as apex nodes.
Note: This input model is a very limited form of the generic XPath Nodeset that was the input model for Canonical XML 1.x. It is designed to be simple and allow for a high performance algorithm, while still supporting the most essential use cases. Specifically:
This model does not support re-inclusion; i.e. all the exclusions are applied after all the inclusions. It is effectively a simplified form of the XPath Filter 2 model [XMLDSIG-XPATH-FILTER2] with one intersect followed by one optional subtract operation. Re-inclusion complicates the normalization algorithm, especially in the areas of namespace and XML attribute inheritance.
Exclusion is limited to complete subtrees and attribute nodes. Other kinds of nodes (text, comment, PI) cannot be excluded.
Attribute exclusion is also limited, such that namespace declaration and attributes from the XML namespace cannot be excluded.
Some examples of subsets that were were permitted in the Canonical XML 1.x, but not in this new version:
The DOM model of XML Normalization does not support direct input of an octet stream; the Stream model exists for that purpose. The transformation of such a stream into the input model required for DOM processing by this specification is application-specific and should be defined in specifications that reference or make use of this one.
In the Stream model, the XML subset is again expressed as an Inclusion List and an Exclusion List.
For streaming, however, nodes are identified using a set of simple XPath paths. An empty XPath in
the Inclusion list SHALL be interpreted as referring to the document's root element as though its
value were /
. An empty XPath in the Exclusion list SHALL be ignored.
Specifically, only absolute XPaths are allowed, and only if they are comprised of element names and QNames. In addition, the following special characters and wildcards are permitted:
//
to allow for selection of deeply-nested elements.*
to allow for any single unnamed element.The parser MUST treat the inclusion of any other XPath components as an error, including:
.
) and parent-node (..
) references.The purpose of this is to limit the description of included/excluded nodes such that they can be easily compared against a stack of node names or QNames assembled by the parser to keep track of its current location in the document.
Since XPath 1.0 [XPATH] requires that any namespaced elements be identified by QName, and since the canonicalization algorithm provides a means to rewrite namespace prefixes, the XPaths used as input MUST use the rewritten prefix values.
Instead of separate algorithms for each variant of normalization, this specification takes the approach of a single algorithm subject to a variety of parameters that change its behavior to address specific use cases.
The following dictionaries define the logical parameters supported by this
algorithm. The actual serialization that expresses the parameters in
use may be defined as appropriate to specific applications of this
specification (e.g., the <ds:CanonicalizationMethod>
element in [XMLDSIG-CORE2]).
dictionary QNameAware {
DOMString Name;
};
QNameAware
MembersName
of type DOMStringNCName
name of an element or attribute.dictionary Element : QNameAware
{
DOMString NS;
};
Element
MembersNS
of type DOMStringdictionary QualifiedAttribute : QNameAware
{
DOMString NS;
};
QualifiedAttribute
MembersNS
of type DOMStringdictionary UnqualifiedAttribute : QNameAware
{
DOMString ParentName;
DOMString ParentNS;
};
UnqualifiedAttribute
MembersParentNS
of type DOMStringParentName
of type DOMStringNCName
of this attribute's parent element.dictionary XPath : QNameAware
{
DOMString NS;
};
XPath
MembersNS
of type DOMStringdictionary Parameters {
boolean IgnoreComments = true;
boolean TrimTextNodes = true;
object PrefixRewrite = "none";
QNameAware
[] QNameAware = [];
array[QNameAware] ReturnCharacters = [];
};
Parameters
MembersIgnoreComments
of type boolean, defaulting to true
PrefixRewrite
of type object, defaulting to "none"
"none"
, prefixes are left unchanged. With
a string value of "sequential"
, prefixes are changed to "n0", "n1",
"n2" … except the special prefixes xml
and xmlns
which are left unchanged. With a value of type HashMap,
prefixes are rewritten only for namespaces whose URIs defined in the enumeration,
except for xml
and xmlns
as described above.
QNameAware
of type array of QNameAware
, defaulting to []
ReturnCharacters
of type array[QNameAware], defaulting to []
TrimTextNodes
of type boolean, defaulting to true
xml:space="preserve"
attribute, then text node descendants of
that element are not trimmed regardless the value of this parameter.
All of these parameters MUST be implemented.
In the XML Canonicalization space there were two separate canonicalization algorithms - Inclusive Canonicalization [XML-C14N11]
and Exclusive Canonicalization [XML-EXC-C14N]. The major differences between these two algorithms is the treatment of namespace
declarations and inherited attributes in the xml:
namespace.
But in the current version of Canonical XML 2.0, Inclusive canonicalization has been removed completely.
Exclusive canonicalization has been far more popular than inclusive, because of its "portability" property. I.e. if a subdocument is signed with exclusive canonicalization, and then this subdocument is moved off to a different XML context, the signature on that subdocument still remains valid. Inclusive canonicalization doesn't have this portability property, however inclusive canonicalization has an advantage over exclusive canonicalization 1.0, when it comes to QNames in content.
Exclusive canonicalization 1.0 only emits namespaces declarations that it considers are visibly utilized, so if there is QName embedded in
text node or an attribute node, it doesn't recognize it. For example in this attribute xsi:type="xsd:string"
, the "xsd"
prefix is embedded in the content, and so Exclusive canonicalization 1.0 will not consider the "xsd" prefix to be visibly utilized and
hence not emit the xsd namespace declaration. Not emitting the declaration, makes it susceptible to certain wrapping attacks. Exclusive
canonicalization 1.0 offers the "InclusiveNamespace" mechanism to deal with these kinds of prefixes. Any prefixes mentioned in this list
will be treated inclusively, i.e. their namespace declarations will be emitted even if they are not used.
XML Normalization addresses the shortcomings of Exclusive Canonicalization 1.0 with the QNameAware
parameter. This parameter
can be used to list element or attribute nodes that are expected to have QNames. XML Normalization will scan for prefixes in these
elements and attributes and consider them to be visibly utilized too. Since this is a superior approach, no equivalent to Inclusive
canonicalization is defined in this specification.
The algorithm for prefix scanning doesn't cover all kinds of prefix embedding. For example if a text node's value is a space separated list of QNames, this algorithm will not detect the prefixes of these QNames. It will only detect two kinds of embedding:
Inclusive canonicalization also preserves the values of xml:
attributes in context; it looks at the ancestors of the
subdocument being processed, and collects the value of any inheritable xml attributes, specifically xml:lang
,
xml:space
and xml:base
, from these ancestor elements and emits them at the root of the subdocument.
Exclusive canonicalization does not do this as it this violates the portability requirement. Likewise, XML Normalization ignores
these attributes as well.
The basic normalization process consists of traversing the tree and outputting octets for each node. In DOM mode, this is literally an ordered tree traversal, while in Stream mode the traversal involves the parsing and posting of events for each element and node as it is encountered in the input stream.
Input: The XML subset consisting of an Inclusion list and an Exclusion list.
Processing for DOM mode
D
there is nothing to sort. Otherwise
remove all element nodes Ei
that are descendants of some other element node
in the inclusion list. Then sort the remaining element nodes E1
,
E2
, … En
by document order.
Ei
or document node D
in
the sorted list, do a depth first traversal to visit all the
descendant nodes in the Ei
subtree, and
normalize each one of them in-place. While traversing, if the current
node is an element and that element is in the exclusion list, prune
the traversal, i.e. skip over that element and all its
descendants.
Processing for Stream mode
During traversal of each node (or upon encountering each token type), normalize the value depending on its type as follows:
PrefixRewrite
parameter is sequential
or predefined
,
the element's QName will be written with the changed prefix.
If the element is identified by the ReturnCharacters
parameter, then the source
octet-stream for this element is used to replace the element node with a CDATA node. In Stream
mode, all text encountered from the start of the start-element token to the end of the
corresponding end-element token is reported as a CDATA block. In neither case is any
normalization applied to the identified element or its content.
&
)
with &
, all open angle brackets (<
) with <
,
all quotation mark characters with "
, and the whitespace characters
#x9
, #xA
, and #xD
, with character references.
The character references are written in uppercase hexadecimal with no leading zeroes
(for example, #xD
is represented by the character reference 
).
If parameter PrefixRewrite
is sequential
or predefined
and the attribute name has a namespace prefix, the prefix is changed to the rewritten prefix.
Also with prefix rewriting enabled, the attribute content is treated specially if the attribute is
among those enumerated for the QNameAware
parameter. If so, the QName value of the
attribute is rewritten with the new prefix.
&
,
all open angle brackets (<
) are replaced by <
, all closing
angle brackets (>
) are replaced by >
, and all #xD
characters are replaced by 
.
If parameter TrimTextNodes
is true and there is no xml:space="preserve"
declaration in context, trim the leading and trailing whitespace. E.g. trim <A> <B/>
to <A><B/>
and trim <A> this is text </A>
to <A>this is text</A>
. Whitespace
is as defined in [XML10] i.e. it consists of one or more space (#x20) characters, carriage returns, line feeds, or tabs.
A DOM parser might split up a long text node into multiple adjacent text nodes, and a Stream parser might report multiple consecutive text tokens, some of which may be empty. Be aware when trimming whitespace in such cases; the net result should be equivalent to doing so as if the adjacent text nodes were concatenated.
When any element is treated as character data due to the effects of the
ReturnCharacters
parameter, the resulting text node/event SHALL NOT be normalized
according to these rules.
If parameter PrefixRewrite
is sequential
or predefined
and
if the parent element node is among those enumerated for the QNameAware
parameter, then
the QName value of the text node is rewritten with the new prefix.
Although some XML models such as DOM don't distinguish namespace declarations from attributes, Normalization needs to treat them separately. In this document, attribute nodes that are actually namespace declarations are referred as "namespace nodes", other attributes are called "attribute nodes".
As part of the normalization process, while traversing the subtree, use the following algorithm to look at all the namespace declarations in an element, and decide which ones to output.
The following concepts are used in Namespace processing:
In DOM, there is no special node for namespace declarations, they are just present as regular attribute nodes. An "explicit" namespace declaration is an attribute node whose prefix is "xmlns" and whose localName is the prefix being declared.
DOM also allows declaring a namespace "implicitly", i.e. if a new DOM element or attribute is
constructed using the createElementNS
and createAttributeNS
methods, then
DOM adds a namespace declaration automatically when serializing the document.
xmlns="..."
. To make the algorithm simpler this will be treated
as a namespace declaration whose prefix value is ""
i.e. an empty string.E
in the document subset
visibly utilizes a namespace declaration, i.e. a namespace prefix P
and bound value
V
, if any of the following conditions are true:
E
itself has a qualified name that uses the prefix P
.
(Note if an element does not have a prefix, that means it visibly utilizes the default namespace.)
E
is among those enumerated for the QNameAware
parameter, and the QName value of the element uses the prefix P
(or, lacking a prefix,
it visibly utilizes the default namespace)
E
is among those enumerated for the QNameAware
parameter,
and is listed as an XPathElement
. This value of the element is to be interpreted as
an XPath 1.0 expression and any prefixes used in this XPath expression are considered to be
visibility utilized.
A
of that element has a qualified name that uses the prefix
P
, and that attribute is not in the exclusion list. (Note that unlike elements, if an
attribute doesn't have a prefix, that means it is a locally scoped attribute. It does NOT mean that
the attribute visibly utilizes the default namespace.)
A
of that element is among those enumerated for the QNameAware
parameter, and the QName value of the attribute uses the prefix P
(or, lacking a prefix,
it visibly utilizes the default namespace)
When the parameter PrefixRewrite="sequential"
or PrefixRewrite="predefined"
is set, all the prefixes except "xml" are rewritten to new prefixes. In the normalized output
there is a one to one mapping between namespace URIs and rewritten prefixes. E.g. if in the input
document fragment, a particular prefix is declared to many different namespace URIs at different
parts of the document, during normalization this prefix will get rewritten to different prefixes,
one rewritten prefix for each different namespace URI. Similarly if in the input document, many
prefixes are declared to the same namespace URI, all of these prefixes will be normalized to the
same rewritten prefix.
With PrefixRewrite="sequential"
the prefixes are rewritten to "n0", "n1", "n2", … etc.
With PrefixRewrite="predefined"
the prefix for any namespace in the predefined set is replaced
using the value provided by the input set.
Prefix Rewriting also considers QNames in content, and during normalization the prefixes in these QNames are also rewritten.
with PrefixRewrite="sequential"
, the normalized output will never have a
default namespace, as that is also rewritten into a "nN" style prefix. With PrefixRewrite="predefined"
the default namespace is rewritten with an explicit prefix only if one has been specified in the input set.
Note that when using predefined
it is not possible to promote a namespace to the default by
supplying a prefix of ""
(the empty string)— this is an error.
Initialization: For sequential prefix rewriting maintain a counter N
.
This counter should be set to 0 at the beginning of the normalization process.
Also maintain a map of namespace URI to rewritten prefixes; this map should be initialized
to empty.
The following steps need to be executed at every Element node E
.
Step 1: Create a list of visibly utilized prefixes.
E
itself has a qualified name that
uses the prefix P
, then P
is visibly utilized. Note if E
does not have
a prefix, that means it visibly utilizes the default
namespace.
A
of that element
E
has a qualified name that uses the prefix
P
, and that attribute is not in the exclusion
list. Note that, unlike elements, if an
attribute doesn't have a prefix, that means it is a
locally scoped attribute. It does NOT mean that
the attribute visibly utilizes the default namespace.
QNameAware
parameter, check
whether the E
or its attributes is enumerated
in it as follows:
Element
subchild, whose
Name
and NS
attributes match
E
's localname and namespace
respectively, then E
is expected to have a
single text node child containing a QName. Extract the
prefix from this
QName, and consider this prefix as visibly utilized.
QualifiedAttr
subchild,
whose Name
and NS
attributes
match one of E
's qualified attribute's
localname and namespace respectively, then that
attribute is expected to contain a QName. Extract this
prefix from the QName and consider this
prefix as visibly utilized.
UnqualifiedAttr
subchild, whose Name
attribute match one
of E
's unqualified attribute's name,
and its ParentName
and
ParentNS
attributes match E
's
localname and namespace respectively, then that attribute
is expected to contain a QName. Extract this prefix from
the QName and consider this prefix as visibly utilized.
XPathElement
subchild,
whose Name
and NS
attributes
match E
's localname and namespace
respectively, then E
is expected to have a
single text node child containing a XPath 1.0
expression. Extract the prefixes from this
XPath by using the following algorithm. All of these
extracted prefixes should be considered as visibly
utilized.
:
in the
XPath expression, but do not consider single colons
inside quoted strings.
Double colons are used for axes, e.g. in
self::node()
, "self:" is not a prefix,
but an axis name.
NCName
match. e.g. in /soap : Body
, extract
the "soap".
The NCName
production is defined in
[XML-NAMES].
s/"[^"]*"//g
and s/'[^']*'//g
. Removing
the quoted string
eliminates false positives in the next step.
m/([\w-_.]+)?\s*:(?!:)/
Note prefixes follow the NCName production,
i.e. consists of alphanumeric or hyphen or underscore
or dot,
but cannot start with digit, hyphen or dot. . In an
NCName, the allowed alphanumeric characters are not just
Ascii, but any Unicode alphanumeric characters.
However the regular expression provided here is a very
simplified form of NCName production.
PrefixRewrite
parameter is set to
sequential
each of the prefixes found in
the above steps would need to be replaced
by the a new prefix. For efficiency, consider
combining this searching for prefixes step with the
subsequent replacing prefixes step.
Create a list containing the namespace declarations for these visibly utilized prefixes. Remove the "xml" prefix from the this list if present.
XML Normalization never emits the declaration for the xml
or xmlns
prefixes. As mentioned in [XML-NAMES] a valid XML document should
never have the declaration for xmlns
, so XML Normalization should never
encounter this declaration. Also a valid XML document can optionally declare the
xml
prefix, but if present it MUST be bound to
http://www.w3.org/XML/1998/namespace
. XML Normalization SHOULD ignore this
declaration.
Step 2: If the PrefixRewrite="sequential"
parameter is set , then
compute new prefixes for all the namespaces declarations in the list from Step 1, as
follows:
PrefixRewrite="sequential"
, assign a new prefix value
"nN
" to each prefix, and then increment the value of counter
N
. The counter should be set to 0 in the beginning of the
normalization process. (e.g. if the value of this counter was 5 when the traversal
reached this element, and this element had 3 prefixes to be output, then use the
prefixes "n5", "n6", "n7" and set the counter to 8 after that).
PrefixRewrite="predefined"
, then look in the input set for the
namespace's URI. If a match is found, assign the prefix from the match. Otherwise,
the prefix remains unchanged.
Step 3: Filter the list to remove prefixes that have already been output.
E
's ancestors, say Ej
,
and has not been redeclared since then to a different value, i.e not been redeclared by an element
between Ej
and E
, then remove it from this list.
Step 4: Sort this list of namespace declarations in lexicographic (ascending) order
of prefixes. In case of prefix rewriting, sort by rewritten prefixes, not original prefixes.
Note that default namespace declaration has no prefix, so it is considered lexicographically least.
Step 5: Output each of these namespace nodes, as specified in the Processing model.
This following XML snippet will be used to determine the various options of prefixRewriting.
<wsse:Security xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd"> <wsse:UserName wsu:Id="i1"> ... </wsse:UserName> <wsse:Timestamp wsu:Id="i2"> ... </wsse:Timestamp> <wsse:Security>
PrefixRewrite="none"
<wsse:Security xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd"> <wsse:UserName xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" wsu:Id="i1"> ... </wsse:UserName> <wsse:Timestamp xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" wsu:Id="i2"> ... </wsse:Timestamp> </wsse:Security>
Note how the "wsu" prefix declaration is present in wsse:Security
, but is not utilized.
Normalization will "push the declaration down" into
<UserName>
and <Timestamp>
where it is really used,
i.e. the wsu
declaration will be output twice, once in
<UserName>
and another in <Timestamp>
, as shown above.
PrefixRewrite="sequential"
<n0:Security xmlns:n0="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd"> <n0:UserName xmlns:n1="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" n1:Id="i1"> ... </n0:UserName> <n0:Timestamp xmlns:n1="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" n1:Id="i2"> ... </n0:Timestamp> </n0:Security>
Now observe what happens with sequential prefix rewriting, the "wsse" prefix is rewritten to "n0" and the "wsu" prefix is rewritten to "n1".
PrefixRewrite="predefined"
Using the following predefined namespace prefixes:
http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd = "secutil"
<wsse:Security xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd"> <wsse:UserName xmlns:secutil="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" secutil:Id="i1"> ... </wsse:UserName> <wsse:Timestamp xmlns:secutil="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" secutil:Id="i2"> ... </wsse:Timestamp> </wsse:Security>
Note that the "wsu" prefix was rewritten to "secutil" while the "wsse" prefix remained unchanged.
Namespace declarations are not considered as attributes, they are processed separately as namespace nodes.
Processing the attributes of an element E
consists of the following steps:
PrefixRewrite
parameter is sequential
, modify the QName
of the attribute name to use the new prefix. i.e. one of n0
, n1
, n2
, ... etc. Do not do this for the xml
prefix, as this is not changed during prefix rewriting.QNameAware
parameter, then change the QName in that attribute value to use the new prefix.
This section is non-normative.
This section presents an IDL representation of the normalization algorithm for DOM parsers, with function descriptions in the form of pseudocode.
The DOM normalization algorithm consists of two components: a HashMap, which is a simple dictionary mapping namespace URIs to prefixes; and an interface representing the normalizer functionality itself.
HashMap
typeThis section is non-normative.
[Constructor]
interface HashMap {
readonly attribute unsigned long count;
getter DOMString valueForKey ([TreatNullAs = EmptyString] DOMString key);
setter void setValueForKey ([TreatNullAs = EmptyString] DOMString key, DOMString? value);
void removeAll ();
};
This section is non-normative.
count
of type unsigned long, readonly removeAll
void
setValueForKey
null
value removes the entry from the map.Parameter | Type | Nullable | Optional | Description |
---|---|---|---|---|
key | DOMString | ✘ | ✘ | |
value | DOMString | ✔ | ✘ |
setter void
valueForKey
Parameter | Type | Nullable | Optional | Description |
---|---|---|---|---|
key | DOMString | ✘ | ✘ |
getter DOMString
DOMNormalizer
Interface[Constructor]
interface DOMNormalizer {
readonly attribute unsigned int prefixCounter;
readonly attribute HashMap
rewrittenPrefixes;
attribute Parameters
properties;
attribute DOMString[] outputPrefixes;
void normalize (object<> inclusionList, object<> exclusionList);
void normalizeSubtree (object node);
void processNode (object node, HashMap
namespaceContext);
void processDocument (object documentNode, HashMap
namespaceContext);
void processElement (object elementNode, HashMap
namespaceContext);
void processText (object textNode, HashMap
namespaceContext);
void processComment (object commentNode, HashMap
namespaceContext);
void addNamespaces (object elementNode, HashMap
namespaceContext);
DOMString[] processNamespaces (object elementNode, HashMap
namespaceContext);
};
outputPrefixes
of type array of DOMString, prefixCounter
of type unsigned int, readonly sequential
mode. It is
initialized to zero.
properties
of type Parameters
, rewrittenPrefixes
of type HashMap
, readonly uri -> rewrittenPrefix
. It is initialized to empty. Finding
out the rewritten prefix for an original prefix is a two step lookup: first look up the
URI for the original prefix in the namespaceContext hash table, then look up the
rewritten prefix for the URI in the rewrittenPrefixes hash table.
addNamespaces
properties.ReturnCharacters
array).
Pseudocode:
addNamespaces(element, namespaceContext) { for each explicit and implicit namespace declaration in the element { if namespaceContext already has this prefix with the same URI { do nothing } else if namespaceContext already has this prefix with a different URI { update the namespaceContext hash table with the new prefix->URI mapping if this prefix exists in outputPrefixes remove it } else if namespaceContext doesn't have this prefix { add the new prefix -> URI mapping to the namespaceContext } } }
Parameter | Type | Nullable | Optional | Description |
---|---|---|---|---|
elementNode | object | ✘ | ✘ | |
namespaceContext |
| ✘ | ✘ |
void
normalize
Pseudocode:
normalize(list of subtree, list of exclusion elements and attributes) { put the exclusion elements and attributes in hash table for easier lookup sort the multiple subtrees by document order for each subtree { normalizeSubtree(subtree) } }
Parameter | Type | Nullable | Optional | Description |
---|---|---|---|---|
inclusionList | object<> | ✘ | ✘ | |
exclusionList | object<> | ✘ | ✘ |
void
normalizeSubtree
Pseudocode:
canonicalizeSubtree(node) { if (node is the document node or a document root element) { // (whole document is being processed, no ancestors to worry about) processNode(node) } else { starting from the element, walk up the tree to collect a list of ancestors for each of this node's ancestor elements starting with the document root, but not including the element itself addNamespaces(element) processNode(node) } }
Parameter | Type | Nullable | Optional | Description |
---|---|---|---|---|
node | object | ✘ | ✘ |
void
processComment
Preudocode:
processComment(commentNode, namespaceContext) { if properties.IgnoreComments remove the node from the DOM }
Parameter | Type | Nullable | Optional | Description |
---|---|---|---|---|
commentNode | object | ✘ | ✘ | |
namespaceContext |
| ✘ | ✘ |
void
processDocument
Pseudocode:
processDocument(document, namespaceContext) { for (each child node) { processNode(child, namespaceContext) } }
Parameter | Type | Nullable | Optional | Description |
---|---|---|---|---|
documentNode | object | ✘ | ✘ | |
namespaceContext |
| ✘ | ✘ |
void
processElement
Pseudocode:
processElement(elementNode, namespaceContext) { if elementNode exists in the exclusion hash table return if elementNode is listed in properties.ReturnCharacters { serialize elementNode as UTF-8 text replace elementNode with a text node containing that text return } make copies of namespaceContext and outputPrefixes in the stack //(by copying, any changes made can be undone when this function returns) nsToBeOutputList = processNamespaces(element) attributeList = [] if (properties.PrefixRewrite != "none") { determine the namespace for the element and update its prefix according to namespaceContext and rewrittenPrefixes element.namespace.prefix = new prefix value } for each of the namespaces in the nsToBeOutputList add appropriate "xmlns" attribute to attributeList for each non-namespace attribute in the element { replace/apply namespace prefix according to properties.PrefixRewrite if the element is in Properties.QNameAware adjust prefixes within its content as appropriate add attribute to attributeList } element.attributes = attributeList Loop through all child nodes and call processNode(child, copy(namespaceContext)) remove namespace prefixes in nsToBeOutputList from outputPrefixes }
Parameter | Type | Nullable | Optional | Description |
---|---|---|---|---|
elementNode | object | ✘ | ✘ | |
namespaceContext |
| ✘ | ✘ |
void
processNamespaces
Pseudocode:
processNamespaces(element) { addNamespaces(element) create a list of visibly utilized prefixes - visiblePrefixes, which includes a) the prefix used by the element itself b) the prefix used by all the qualified attributes of the element c) the prefix embedded in the attribute value of any QName aware attributes d) the prefix embedded in the any text node child, if QName aware if properties.PrefixRewrite != "none" { newNamespaceURIs = [] // empty List for each prefix in visiblePrefixes get the URI for this prefix from the namespaceContext hash table check if the URI already exists in the rewrittenPrefixes hash table if it does not add the URI to newNamespaceURIs sort the newNamespaceURIs list in lexical order if properties.PrefixRewrite = "sequential" { for each URI in the newNamespaceURIs list assign a prefix "nN" where N is value of prefixCounter increment prefixCounter by 1 add the mapping URI -> nN into the rewrittenPrefixes hash table } else if properties.PrefixRewrite is a HashMap { for each URI in the newNamespaceURIs list lookup the prefix for this URI in properties.PrefixRewrite if there is a prefix add the mapping URI -> prefix into rewrittenPrefixes } } nsToBeOutput = [] // empty hash table for each prefix in visiblePrefixes { find the URI that this prefix maps to in the namespaceContext hash table if PrefixRewrite != "none" convert this prefix to rewrittenPrefix, by using the URI to lookup the rewrittenPrefix in the rewrittenPrefixes hash table if this prefix (original or rewritten) does not exist in outputPrefixes add this prefix to outputPrefixes add the prefix-> URI mapping into the nsToBeOutput hash table } sort the nsToBeOutputList by the prefix return nsToBeOutputList }
Parameter | Type | Nullable | Optional | Description |
---|---|---|---|---|
elementNode | object | ✘ | ✘ | |
namespaceContext |
| ✘ | ✘ |
DOMString[]
processNode
Pseudocode:
processNode(node, namespaceContext) { call the appropriate function - processDocument, processElement, processTextNode, ... depending on the node type. }
Parameter | Type | Nullable | Optional | Description |
---|---|---|---|---|
node | object | ✘ | ✘ | |
namespaceContext |
| ✘ | ✘ |
void
processText
Pseudocode:
processText(textNode) { if this text node is outside document root return in the text replace all ampersands by &, all open angle brackets (<) by <, all closing angle brackets (>) by >, and all #xD characters by 
. if properties.TrimTextNodes is true and there is no xml:space="preserve" declaration in scope { if previous node was not a text node trim leading whitespace if next node is not a text node trim trailing whitespace } if propertiesPrefixRewrite != "none" and this text node is a child of a QName aware element { search for embedded prefixes, and replace with rewritten prefixes } replace the text content of the node with the modified text }
Parameter | Type | Nullable | Optional | Description |
---|---|---|---|---|
textNode | object | ✘ | ✘ | |
namespaceContext |
| ✘ | ✘ |
void
This section is non-normative.
Unlike DOM parsers which represent XML document as a tree of nodes, streaming parsers represent an XML document as stream of events like "start-element", "end-element", "text" etc. A document subset can also be represented as a stream of events. This stream of events in exactly in the same order as a tree walk, so the same approach can be also used to normalize an event stream. Below you can find a description of the SAX2 [SAX] event-handler interface with comments on the application of normalization to the generated events.
Since this algorithm and that employed for StAX [XML-PARSER-STAX] relies on much the same parsing events, we leave the application of this algorithm to a 'pull' parser up to the reader.
ElementStack
TypeThis section is non-normative.
The ElementContext
dictionary is used to store information about a
single element. One of these is pushed onto the stack during processing of a
startElement()
event, and it is removed while processing the
corresponding endElement()
event.
dictionary ElementContext {
HashMap
namespaceContext = [];
DOMString[] outputPrefixes = [];
DOMString elementQName = "";
DOMString localName = "";
DOMString prefix = "";
boolean isQNameAware = false;
};
ElementContext
MembersThis section is non-normative.
elementQName
of type DOMString, defaulting to ""
isQNameAware
of type boolean, defaulting to false
localName
of type DOMString, defaulting to ""
elementQName
property.
namespaceContext
of type HashMap
, defaulting to []
outputPrefixes
of type array of DOMString, defaulting to []
prefix
of type DOMString, defaulting to ""
elementQName
property.
The ElementStack
interface implements a basic stack of ElementContext
dictionaries. Its push()
operation duplicates some of the properties of the
current top-of-stack object for you.
[Constructor]
interface ElementStack {
unsigned int count ();
ElementContext
push (DOMString QName);
ElementContext
top ();
void pop (DOMString QName);
};
count
unsigned int
pop
ElementContext
to ensure that it matches the given QName, and
removes it from the stack if it matches. If it does not match, a DOMException
is raised.
Parameter | Type | Nullable | Optional | Description |
---|---|---|---|---|
QName | DOMString | ✘ | ✘ |
void
push
ElementContext
on the top of the stack and replaces
its elementQName
, localName
, and prefix
properties
based on the provided QName
parameter. The new object is placed on top of the
stack and returned.
Parameter | Type | Nullable | Optional | Description |
---|---|---|---|---|
QName | DOMString | ✘ | ✘ |
ElementContext
top
ElementContext
without modifying the stack.
ElementContext
The following interface describes some events defined by the SAX2 parser specification. Any events not enumerated below are unchanged by this algorithm.
interface SAXEvents {
const readonly int StartDocument = 1;
const readonly int EndDocument = 2;
const readonly int StartElement = 3;
const readonly int EndElement = 4;
const readonly int Characters = 5;
const readonly int IgnorableWhitespace = 6;
const readonly int ProcessingInstruction = 7;
const readonly int Comment = 8;
const readonly int CDATABlock = 9;
const readonly int StartPrefixMapping = 10;
const readonly int EndPrefixMapping = 11;
};
CDATABlock
of type readonly int
block.
Characters
of type readonly int&
will be replaced by the
resulting &
character, and so on.
Comment
of type readonly int<!-- A Comment -->
was parsed. The event contains
the text content of the comment, i.e. A Comment
.
EndDocument
of type readonly intEndElement
of type readonly intEndPrefixMapping
of type readonly intxmlns
attribute has been closed.
IgnorableWhitespace
of type readonly intProcessingInstruction
of type readonly int<?name param1="1" param2="2"?>
has
been parsed. The event provides the name
component along with the remaining
characters as a single character string (i.e. param1="1" param2="2"
).
StartDocument
of type readonly intStartElement
of type readonly intStartPrefixMapping
of type readonly intxmlns
attribute and has mapped a prefix to a
URI.
Below is a partial definition of a SAX2 event handler interface. The documentation for each event defines how the parser should normalize the parameters for that event.
Note that handling
of characters when TrimTextNodes
is true
involves buffering each
Characters
event until the next event arrives. If the next event is not also
Characters
, then the buffered text has trailing whitespace trimmed and its
event is posted to the client. It TrimTextNodes
is false
, then
no buffering occurs.
[Constructor]
interface SAX2Normalizer {
attribute ElementStack
elementStack;
attribute Parameters
normalizationParameters;
attribute char[] currentCharacters;
attribute HashMap
pendingNamespaces;
attribute int rewriteCounter;
attribute HashMap
rewrittenPrefixes;
void postStartPrefixMappingEvent (DOMString prefix, DOMString uri);
void postStartElementEvent (DOMString uri, DOMString localName, DOMString qName, object[] attrList);
void postEndElementEvent (DOMString uri, DOMString localName, DOMString qName);
void postIgnorableWhitespace (char[] text);
void postComment (char[] comment);
void postCDATA (char[] data);
void postCharacters (char[] text);
};
currentCharacters
of type array of char, normalizationParameters.TrimTextNodes
is true
, the text
for a Characters
event are first placed into this variable. The event is
function is passed these characters once the following event has been received. In this way,
the parser can determine whether to trim whitespace from the end of the string without
accumulating the entire text block in memory.
elementStack
of type ElementStack
, normalizationParameters
of type Parameters
, pendingNamespaces
of type HashMap
, rewriteCounter
of type int, normalizationParameters.PrefixRewrite
is "sequential"
,
this attribute is used to generate the new, numbered prefixes. It is initialized to
zero.
rewrittenPrefixes
of type HashMap
, normalizationParameters.PrefixRewrite
.
postCDATA
Characters
event.
Parameter | Type | Nullable | Optional | Description |
---|---|---|---|---|
data | char[] | ✘ | ✘ |
void
postCharacters
TrimTextNodes
is enabled, they
are buffered in case of needing to trim trailing whitespace based on the type
of the next event.
Pseudocode:
void postCharacters(text) { if normalizationParameters.TrimTextNodes is true { if currentCharacters is empty // better: if previous event was not EndElement, Characters, or CDATA { // start of a text node trim leading whitespace } else { output any buffered characters (no trimming) currentCharacters := [] } } replace all instances of "&" with "&" replace all instances of "<" with "<" replace all instances of ">" with "&rt;" replace all carriage returns ('\r') with "
" replace all tabs ('\t') with "	" if normalizationParameters.TrimTextNodes is true { currentCharacters := text } else { post the event immediately: characters(text) } }
Parameter | Type | Nullable | Optional | Description |
---|---|---|---|---|
text | char[] | ✘ | ✘ |
void
postComment
IgnoreComments
is true
, does not post the event.
Parameter | Type | Nullable | Optional | Description |
---|---|---|---|---|
comment | char[] | ✘ | ✘ |
void
postEndElementEvent
Pseudocode:
void postEndElementEvent(uri, localName, qName) { trim and post any buffered characters context := elementStack.top() elementStack.pop(qName) // throws an exception if qNames do not match if normalizationParameters.PrefixRewrite is not "none" { prefix := rewrittenPrefixes(uri) qName := prefix + ":" + localName } post event: endElement(uri, localName, qName) }
Parameter | Type | Nullable | Optional | Description |
---|---|---|---|---|
uri | DOMString | ✘ | ✘ | |
localName | DOMString | ✘ | ✘ | |
qName | DOMString | ✘ | ✘ |
void
postIgnorableWhitespace
TrimTextNodes
is true
, does not post the event.
Parameter | Type | Nullable | Optional | Description |
---|---|---|---|---|
text | char[] | ✘ | ✘ |
void
postStartElementEvent
Pseudocode:
void postStartElementEvent(uri, localName, qName, attrList) { trim and post any buffered characters if normalizationParameters.ReturnCharacters references this element { postEvent(CDATABlock, element outer XML) skip processing of element subtree and EndElement event return } context := elementStack.push(qName) for each [prefix, uri] pair in pendingNamespaces { if context.namespaceContext(prefix) does not match attribute value { context.namespaceContext(prefix) := attribute value context.outputPrefixes(prefix) := null // remove from outputPrefixes } } pendingNamespaces.removeAll() for each xmlns or xmlns:prefix attribute in attrList { remove attribute from attrList } if element is QName aware context.isQNameAware = true // get a HashMap of prefix -> uri // this also rewrites contents of QNameAware attributes usedNamespaces := visiblyUsedNamespaces(context, attrList) if qName has a prefix and normalizationParameters.PrefixRewrite is not "none" { prefix := element prefix if rewrittenPrefixes(uri) is not null { prefix := rewrittenPrefixes(uri) } else if normalizationParameters.PrefixRewrite is "sequential" { prefix := "nN" where N is the value of rewriteCounter increment rewriteCounter rewrittenPrefixes(uri) := prefix } else if normalizationParameters.PrefixRewrite is a HashMap and it contains a value for the uri { prefix := normalizationParameters.PrefixRewrite(uri) rewrittenPrefixes(uri) := prefix } qName := prefix + ":" + localName } append any default attributes for the element to attrList for each [name, value] in attrList { if name has a prefix other than 'xml' and normalizationParameters.PrefixRewrite is not "none" { // all prefixes have been enumerated by now split name into prefix and local attrUri := context.namespaceContext(prefix) if rewrittenNamespaces(attrUri) is not null { prefix := rewrittenNamespaces(attrUri) name := prefix + ":" + local // replace name in attrList } } normalize attribute value } for each [prefix, uri] pair in usedNamespaces { if prefix is an empty string { insert new attribute with name "xmlns" and value uri at start of attributes } else { insert new attribute with name "xmlns:" + prefix and value uri at start of attributes } } post event: startElement(uri, qName, localName, attrList) }
Parameter | Type | Nullable | Optional | Description |
---|---|---|---|---|
uri | DOMString | ✘ | ✘ | |
localName | DOMString | ✘ | ✘ | |
qName | DOMString | ✘ | ✘ | |
attrList | object[] | ✘ | ✘ |
void
postStartPrefixMappingEvent
Parameter | Type | Nullable | Optional | Description |
---|---|---|---|---|
prefix | DOMString | ✘ | ✘ | |
uri | DOMString | ✘ | ✘ |
void
Dated references below are to the latest known or appropriate edition of the referenced work. The referenced works may be subject to revision, and conformant implementations may follow, and are encouraged to investigate the appropriateness of following, some or all more recent editions or replacements of the works cited. It is in each case implementation-defined which editions are supported.