Copyright © 2010 W3C ® ( MIT , ERCIM , Keio ), All Rights Reserved. W3C liability , trademark and document use rules apply.
Canonical
XML
Version
2.0
is
a
major
rewrite
of
Canonical
XML
Version
1.1
and
Exclusive
Canonical
XML
1.0
to
address
issues
around
performance,
streaming,
hardware
implementation,
robustness,
minimizing
attack
surface,
determining
what
is
signed
and
more.
It
also
incorporates
an
update
to
Exclusive
Canonicalization,
effectively
combines
inclusive
and
exclusive
canonicalization
algorithms
into
a
2.0
version,
single
algorithm,
that
takes
the
canonicalization
mode
as
well.
a
parameter.
Any XML document is part of a set of XML documents that are logically equivalent within an application context, but which vary in physical representation based on syntactic changes permitted by XML 1.0 [ XML10 ] and Namespaces in XML 1.0 [ XML-NAMES ]. This specification describes a method for generating a physical representation, the canonical form, of an XML document that accounts for the permissible changes. Except for limitations regarding a few unusual cases, if two documents have the same canonical form, then the two documents are logically equivalent within the given application context. Note that two documents may have differing canonical forms yet still be equivalent in a given context based on application-specific equivalence rules for which no generalized XML specification could account.
Canonical XML Version 2.0 is applicable to XML 1.0. It is not defined for XML 1.1.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This is a W3C Working Draft of "Canonical XML Version 2.0".
This document is expected to be further updated based on both Working Group input and public comments.
This
document
was
developed
by
the
XML
Security
Working
Group
.
Please
send
comments
about
this
document
to
public-xmlsec-comments@w3.org
(with
public
archive
A
diff-marked
version
).
Publication
as
a
Working
Draft
does
not
imply
endorsement
by
of
this
specification
that
highlights
changes
against
the
W3C
Membership.
This
is
a
draft
document
and
may
be
updated,
replaced
or
obsoleted
by
other
documents
at
any
time.
It
previous
version
is
inappropriate
to
cite
this
document
as
other
than
work
available.
Major
changes
in
progress.
this
version:
ignoreDTD
and
expandEntities
have
been
removed.
The
xmlBaseAncestors
,
xmlIdAncestors
xmlLangAncestors
and
xmlSpaceAncestors
have
been
combined
into
XmlAncestors
.
The
parameter
xsiTypeAware
has
been
generalized
to
a
QNameAware
.
QNameAware
parameter.
This document was published by the XML Security Working Group as a Working Draft. This document is intended to become a W3C Recommendation. If you wish to make comments regarding this document, please send them to public-xmlsec@w3.org ( subscribe , archives ). All feedback is welcome.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy . W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy .
The key words " must ", " must not ", " required ", " shall ", " shall not ", " should ", " should not ", " recommended ", " may ", and " optional " in this document are to be interpreted as described in RFC 2119 [ RFC2119 ].
See [ XML-NAMES ] for the definition of QName.
Since the XML 1.0 Recommendation [ XML10 ] and the Namespaces in XML 1.0 Recommendation [ XML-NAMES ] define multiple syntactic methods for expressing the same information, XML applications tend to take liberties with changes that have no impact on the information content of the document. XML canonicalization is designed to be useful to applications that require the ability to test whether the information content of a document or document subset has been changed. This is done by comparing the canonical form of the original document before application processing with the canonical form of the document result of the application processing.
For example, a digital signature over the canonical form of an XML document or document subset would allow the signature digest calculations to be oblivious to changes in the original document's physical representation, provided that the changes are defined to be logically equivalent by the XML 1.0 or Namespaces in XML 1.0. During signature generation, the digest is computed over the canonical form of the document. The document is then transferred to the relying party, which validates the signature by reading the document and computing a digest of the canonical form of the received document. The equivalence of the digests computed by the signing and relying parties (and hence the equivalence of the canonical forms over which they were computed) ensures that the information content of the document has not been altered since it was signed.
Note: Although not stated as a requirement on implementations, nor formally proved to be the case, it is the intent of this specification that if the text generated by canonicalizing a document according to this specification is itself parsed and canonicalized according to this specification, the text generated by the second canonicalization will be the same as that generated by the first canonicalization.
Two
XML
documents
may
have
differing
information
content
that
is
nonetheless
logically
equivalent
within
a
given
application
context.
Although
two
XML
documents
are
equivalent
(aside
from
limitations
given
in
this
section)
if
their
canonical
forms
are
identical,
it
is
not
a
goal
of
this
work
to
establish
a
method
such
that
two
XML
documents
are
equivalent
if
and
only
if
their
canonical
forms
are
identical.
Such
a
method
is
unachievable,
in
part
due
to
application-specific
rules
such
as
those
governing
unimportant
whitespace
and
equivalent
data
(e.g.
<color>black</color>
versus
<color>rgb(0,0,0)</color>
).
There
are
also
equivalencies
established
by
other
W3C
Recommendations
and
Working
Drafts.
Accounting
for
these
additional
equivalence
rules
is
beyond
the
scope
of
this
work.
They
can
be
applied
by
the
application
or
become
the
subject
of
future
specifications.
The canonical form of an XML document may not be completely operational within the application context, though the circumstances under which this occurs are unusual. This problem may be of concern in certain applications since the canonical form of a document and the canonical form of the canonical form of the document are equivalent. For example, in a digital signature application, it cannot be established whether the operational original document or the non-operational canonical form was signed because the canonical form can be substituted for the original document without changing the digest calculation. However, the security risk only occurs in the unusual circumstances described below, which can all be resolved or at least detected prior to digital signature generation.
The
difficulties
arise
due
to
the
loss
of
the
following
information
not
available
in
the
data
model:
model
:
In
the
first
case,
note
that
a
document
containing
a
relative
URI
[
URI
]
is
only
operational
when
accessed
from
a
specific
URI
that
provides
the
proper
base
URI.
In
addition,
if
the
document
contains
external
general
parsed
entity
references
to
content
containing
relative
URIs,
then
the
relative
URIs
will
not
be
operational
in
the
canonical
form,
which
replaces
the
entity
reference
with
internal
content
(thereby
implicitly
changing
the
default
base
URI
of
that
content).
Both
of
these
problems
can
typically
be
solved
by
adding
support
for
the
xml:base
attribute
[
XMLBASE
]
to
the
application,
then
adding
appropriate
xml:base
attributes
to
document
element
and
all
top-level
elements
in
external
entities.
In
addition,
applications
often
have
an
opportunity
to
resolve
relative
URIs
prior
to
the
need
for
a
canonical
form.
For
example,
in
a
digital
signature
application,
a
document
is
often
retrieved
and
processed
prior
to
signature
generation.
The
processing
should
create
a
new
document
in
which
relative
URIs
have
been
converted
to
absolute
URIs,
thereby
mitigating
any
security
risk
for
the
new
document.
In the second case, the loss of external unparsed entity references and the notations that bind them to applications means that canonical forms cannot properly distinguish among XML documents that incorporate unparsed data via this mechanism. This is an unusual case precisely because most XML processors currently discard the document type declaration, which discards the notation, the entity's binding to a URI, and the attribute type that binds the attribute value to an entity name. For documents that must be subjected to more than one XML processor, the XML design typically indicates a reference to unparsed data using a URI in the attribute value.
In
the
third
case,
the
loss
of
attribute
types
can
affect
the
canonical
form
in
different
ways
depending
on
the
type.
Attributes
of
type
ID
cease
to
be
ID
attributes.
Hence,
any
XPath
expressions
that
refer
to
the
canonical
form
using
the
id()
function
cease
to
operate.
The
attribute
types
ENTITY
and
ENTITIES
are
not
part
of
this
case;
they
are
covered
in
the
second
case
above.
Attributes
of
enumerated
type
and
of
type
ID,
IDREF,
IDREFS,
NMTOKEN,
NMTOKENS,
and
NOTATION
fail
to
be
appropriately
constrained
during
future
attempts
to
change
the
attribute
value
if
the
canonical
form
replaces
the
original
document
during
application
processing.
Applications
can
avoid
the
difficulties
of
this
case
by
ensuring
that
an
appropriate
document
type
declaration
is
prepended
prior
to
using
the
canonical
form
in
further
XML
processing.
This
is
likely
to
be
an
easy
task
since
attribute
lists
are
usually
acquired
from
a
standard
external
DTD
subset,
and
any
entity
and
notation
declarations
not
also
in
the
external
DTD
subset
are
typically
constructed
from
application
configuration
information
and
added
to
the
internal
DTD
subset.
While
these
limitations
are
not
severe,
it
would
be
possible
to
resolve
them
in
a
future
version
of
XML
canonicalization
if,
for
example,
a
new
version
of
XPath
were
created
based
on
the
XML
Information
Set
[
XML-INFOSET
]
currently
under
development
at
the
W3C
.
W3C.
Canonical
XML
Canonicalization
2.0
solves
most
many
of
the
major
issues
that
have
been
identified
by
implementers
with
Canonical
XML
1.0
[
XML-C14N
]
and
1.1
[
XML-C14N11
].
A
major
factor
in
performance
issues
noted
in
XML
Signature
is
often
C14N11
canonicalization.
Canonical
XML
1.1
processing.
Canonicalization
will
be
slow
if
the
implementation
uses
the
Canonical
XML
1.1
specification
as
a
formula
without
any
attempt
at
optimization.
This
specification
rectifies
this
problem
by
incorporating
lessons
learned
from
implementation
into
the
specification.
Most
mature
C14N
canonicalization
implementations
solve
the
performance
problem
by
inspecting
the
signature
first,
to
see
if
it
can
be
canonicalized
using
a
simple
tree
walk
algorithm
whose
performance
is
similar
to
regular
XML
serialization.
If
not
they
fall
back
to
the
expensive
nodeset
based
nodeset-based
algorithm.
The
use
cases
that
cannot
be
solved
addressed
by
the
simple
tree
walk
algorithm
are
mostly
edge
use
cases.
This
specification
restricts
the
input
of
to
the
canonicalization
algorithm,
so
that
implementations
can
always
use
the
simple
tree
walk
algorithm.
C14N
1.x
uses
an
"XPath
1.0
Nodeset"
to
describe
a
document
subset.
This
is
the
root
cause
of
the
performance
problem
and
can
be
solved
by
not
using
a
Nodeset.
nodeset.
This
version
of
the
spec
specification
does
not
use
a
nodeset,
visits
each
node
exactly
once,
and
it
only
visits
the
nodes
that
are
being
canonicalized.
A
streaming
implementation
is
required
to
be
able
to
process
very
large
documents
without
holding
it
them
all
in
memory,
i.e.
memory;
it
should
be
able
to
process
the
document
documents
one
chunk
at
a
time.
Whitespace
handling
was
a
common
cause
of
signature
breakages.
breakage.
XML
libraries
allow
one
to
"pretty
print"
an
XML
document,
and
most
people
wrongly
assume
that
the
white
space
introduced
by
pretty
printing
will
be
removed
by
canonicalization
but
that
is
not
the
case.
This
specification
adds
three
techniques
to
improve
robustness:
C14N
1.x
algorithms
are
complex
and
depend
on
a
full
XPath
library.
This
makes
it
very
hard
increases
the
work
required
for
scripting
languages
to
use
XML
Signatures.
This
specification
addresses
this
issue
by
not
using
the
complex
nodeset
model,
and
therefore
not
relying
completely
on
XPath
-
also
it
also
introduces
a
minimal
canonicalization
mode.
The input to the canonicalization algorithm consists of an XML document subset, and set of options. The XML document subset can be expressed in two ways, with a DOM model or a Stream model.
In
a
the
DOM
model
the
XML
subset
is
expressed
as
as:
D
or
a
list
of
one
or
more
element
nodes
E
1
,
E
2
,
...
E
n
.
E
i
is
a
descendant
of
another
E
j
,
then
that
element
node
E
i
is
ignored.)
E
1
,
E
2
,
...
E
m
and
a
list
of
zero
or
more
attribute
nodes
A
1
,
A
2
,
...
A
M
.
xml
namespace.
The element nodes in the Inclusion list are also referred as apex nodes .
Note:This
input
model
is
a
very
limited
form
of
the
generic
XPath
Nodeset
that
was
the
input
model
for
Canonical
XML
1.x.
It
is
designed
to
be
simple
and
allow
for
a
high
performance
algorithm,
while
still
allowing
supporting
the
most
essential
use
cases.
Specifically
Specifically:
Note: Canonical XML 2.0, unlike earlier versions, does not support direct input of an octet stream. The transformation of such a stream into the input model required by this specification is application-specific and should be defined in specifications that reference or make use of this one.
Instead
of
separate
algorithms
for
each
variant
of
canonicalization,
this
specification
goes
with
takes
the
approach
of
a
single
algorithm,
which
does
slightly
different
things
depending
on
algorithm
subject
to
a
variety
of
parameters
that
change
its
behavior
to
address
specific
use
cases.
The
following
is
a
list
of
the
parameters.
logical
parameters
supported
by
this
algorithm.
The
actual
serialization
that
expresses
the
parameters
in
use
may
be
defined
as
appropriate
to
specific
applications
of
this
specification
(e.g.,
the
<ds:CanonicalizationMethod>
element
in
[
XMLDSIG-CORE2
]).
Name | Values | Description | Default |
ExclusiveMode
|
true or false |
whether
to
do
inclusive
or
exclusive
dealing
of
namespaces.
In
exclusive
mode
the
|
false |
InclusiveNamespace
|
space separated list of prefixes | list of prefixes to be treated inclusively. Special token #default indicates the default namespace. | empty |
IgnoreComments
|
true or false | whether to ignore comments during canonicalization | true |
TrimTextNodes
|
true or false |
whether
to
trim
(i.e.
remove
leading
and
trailing
whitespaces)
all
text
nodes
when
canonicalizing.
Adjacent
text
nodes
must
be
coalesced
prior
to
trimming.
If
an
element
has
an
xml:space="preserve"
attribute,
then
text
|
false |
Serialization
|
|
whether
to
do
the
normal
XML
http://www.w3.org/2010/xml-c14n2#serializeXML
),
or
do
an
EXI
serialization
(
http://www.w3.org/2010/xml-c14n2#serializeEXI
)
-
which
is
useful
if
the
original
document
to
be
|
|
PrefixRewrite
|
none, sequential, derived | with none, prefixes are not changed, with sequential prefixes are changed to n1, n2, n3 ... and with derived, each prefix is changed to nSuffix, where the suffix is derived by doing a digest of the namespace URI. | none |
SortAttributes
|
true or false | whether the attributes need to be sorted before canonicalization. In some environments the order of attributes changes in transit so sorting is important. | true |
XmlAncestors
|
|
xml:lang
and
xml:space
)
and
combine
the
xml:base
i.e.
similar
to
|
|
QNameAware
|
|
|
|
The
defaults
are
chosen
for
equivalence
to
inherit
xml:id
attributes
from
ancestors
(like
C14N
1.0)
or
not
(like
C14N
Canonical
XML
1.1
or
Exc
C14n
1.0)
none
with
comments
ignored.
Implementations may not support all of these parameters. We have identified the following profiles.
Name | Objective | Supported parameters | Unsupported parameter |
---|---|---|---|
|
|
ExclusiveMode=true/false
,
InclusiveNamespace
,
IgnoreComments=true/false
,
SortAttributes=true
and
XMLAncestors=inherit/none
.
|
TrimTextNodes=false
,
Serialization=Xml
,
PrefixRewrite=none
,
QNameAware=""
|
|
|
ExclusiveMode=true
,
XMLAncestors=none
and
SortAttributes=true
The
input
to
Canonicalization
should
only
be
a
single
complete
subtree
identified
by
ID.
There
is
no
XPath
involved
in
this
profile
and
hence
no
associated
complexities
on
visible
utilization
of
prefixes
in
|
InclusiveNamespace=""
,
IgnoreComments=true
,
TrimTextNodes=false
,
Serialization=Xml
,
PrefixRewrite=none
,
QNameAware=""
|
|
Note "SortAttributes" and "XMLAncestors" may be difficult to support Streaming canonicalization proposal ) |
|
|
The basic canonicalization process consist of traversing the tree and outputting octets for each node.
Input:
The
XML
subset
conisting
consisting
of
an
Inclusion
list
and
an
exlusion
Exclusion
list.
Processing
D
there
is
nothing
to
sort.
Otherwise
remove
all
element
nodes
E
i
that
are
descendants
of
some
other
element
node
in
the
inclusion
list.
Then
sort
the
remaining
element
nodes
E
1
,
E
2
,
...
E
n
by
document
order.
E
i
or
document
node
D
in
the
sorted
list,
do
a
depth
first
traversal
to
visit
all
the
E
i
subtree,
and
canonicalize
each
one
of
them.
While
During traversal of each subtree, generate the canonicalized text depending on the node type as follows:
<
),
the
element
QName,
the
result
of
processing
the
namespaces
,
the
result
of
processing
the
attributes
,
a
close
angle
bracket
(
>
),
traverse
the
child
nodes
of
the
element,
an
open
angle
bracket
(
<
),
a
forward
slash
(
/
),
the
element
QName,
and
a
close
angle
bracket
(
>
.
).
Note
if
the
prefix
rewriting
parameter
is
set,
the
QNames
&
)
with
&
,
all
open
angle
brackets
(
<
)
with
<
,
all
quotation
mark
characters
with
"
,
and
the
whitespace
characters
#x9
,
#xA
,
and
#xD
,
with
character
references.
The
character
references
are
written
in
uppercase
hexadecimal
with
no
leading
zeroes
(for
example,
#xD
is
represented
by
the
character
reference

).
If
the
prefix
rewriting
parameter
is
set,
and
the
attribute
name
has
a
namespace
prefix,
the
prefix
is
changed
to
the
rewritten
prefix.
Also
with
prefix
rewriting
enabled,
the
xsi:type
attribute
content
is
treated
specially
if
the
attribute
is
among
those
enumerated
for
the
option.
If
so,
the
QName
xsiTypeAware="true
.
In
this
case
QNameAware
in
the
or
[
CURIE
]
value
of
the
xsi:type
should
also
be
attribute
is
rewritten
with
the
new
prefix.
N
in
the
same
way
as
an
attribute
node.
&
,
all
open
angle
brackets
(
<
)
are
replaced
by
<
,
all
closing
angle
brackets
(
>
)
are
replaced
by
>
,
and
all
#xD
characters
are
replaced
by

.
trimTextNode
TrimTextNodes
is
true
and
there
is
no
xml:space=preserve
xml:space="preserve"
declaration
<A>
<B/>
to
<A><B/>
and
trim
<A>
this
is
text
</A>
to
<A>this
is
text</A>
.
Note:
The
DOM
parser
might
have
split
up
a
long
text
node
into
multiple
adjacent
text
nodes,
some
of
which
may
be
empty.
In
that
case
be
careful
Be
aware
when
trimming
the
leading
and
trailing
space
-
whitespace
in
such
cases;
the
net
result
should
be
same
equivalent
to
doing
so
as
if
it
the
adjacent
text
nodes
were
concatenated
into
one
concatenated.
If
the
prefix
rewriting
parameter
is
set,
and
if
the
parent
element
node
is
among
those
enumerated
for
the
QNameAware
option,
then
the
QName
or
CURIE
value
of
the
text
node
is
rewritten
with
the
new
prefix.
<?
),
the
PI
target
name
of
the
node,
a
leading
space
and
the
string
value
if
it
is
not
empty,
and
the
closing
PI
symbol
(
?>
).
If
the
string
value
is
empty,
then
the
leading
space
is
not
added.
Also,
a
trailing
#xA
is
rendered
after
the
closing
PI
symbol
for
PI
children
of
the
root
node
with
a
lesser
document
order
than
the
document
element,
and
a
leading
#xA
is
rendered
before
the
opening
PI
symbol
of
PI
children
of
the
root
node
with
a
greater
document
order
than
the
document
element.
<!--
),
the
string
value
of
the
node,
and
the
closing
comment
symbol
(
-->
).
Also,
a
trailing
#xA
is
rendered
after
the
closing
comment
symbol
for
comment
children
of
the
root
node
with
a
lesser
document
order
than
the
document
element,
and
a
leading
#xA
is
rendered
before
the
opening
comment
symbol
of
comment
children
of
the
root
node
with
a
greater
document
order
than
the
document
element.
(Comment
children
of
the
root
node
represent
comments
outside
of
the
top-level
document
element
and
outside
of
the
document
type
declaration).
Note
although
some
xml
XML
models
like
such
as
DOM
don't
distinguish
namespace
declarations
from
attributes,
Canonicalization
needs
to
treat
them
separately.
In
this
document
Attribute
document,
attribute
nodes
that
are
actually
namespace
declarations
are
referred
as
"Namespace
Nodes",
"namespace
nodes",
other
attributes
are
called
"Attribute
"attribute
nodes".
In some cases, particularly for signed XML in protocol applications, there is a need to canonicalize a subdocument in such a way that it is substantially independent of its XML context. This is because, in protocol applications, it is common to envelope XML in various layers of message or transport elements, to strip off such enveloping, and to construct new protocol messages, parts of which were extracted from different messages previously received. If the pieces of XML in question are signed, they need to be canonicalized in a way such that these operations do not break the signature but the signature still provides as much security as can be practically obtained.
As a simple example of the type of problem that changes in XML context can cause for signatures, consider the following document:
<n1:elem1 xmlns:n1="http://b.example"> content </n1:elem1>
this is then enveloped in another document:
<n0:pdu xmlns:n0="http://a.example"> <n1:elem1 xmlns:n1="http://b.example"> content </n1:elem1> </n0:pdu>
The
first
document
above
is
in
canonical
form.
But
assume
that
document
is
enveloped
as
in
the
second
case.
The
subdocument
with
elem1
as
its
apex
node
can
be
extracted
from
this
second
case
with
an
XPath
expression
such
as:
/descendant::n1:elem1
The
result
of
performing
inclusive
canoicalization
canonicalization
to
the
resulting
xml
subset
is
the
following
(except
for
line
wrapping
to
fit
this
document):
<n1:elem1 xmlns:n0="http://a.example" xmlns:n1="http://b.example"> content </n1:elem1>
Note
that
the
n0
namespace
has
been
included
by
inclusive
canoncalization
canonicalization
because
it
includes
namespace
context.
This
change
which
would
break
a
signature
over
elem1
based
on
the
first
version.
As a more complete example of the changes in canonical form that can occur when the enveloping context of a document subset is changed, consider the following document:
<n0:local xmlns:n0="foo:bar" xmlns:n3="ftp://example.org"> <n1:elem2 xmlns:n1="http://example.net"> <n3:stuff xmlns:n3="ftp://example.org"/> </n1:elem2> </n0:local></n0:local>
And
the
following
which
has
been
produced
by
changing
the
enveloping
of
elem2
:
<n2:pdu xmlns:n1="http://example.com" xmlns:n2="http://foo.example"> <n1:elem2 xmlns:n1="http://example.net"> <n3:stuff xmlns:n3="ftp://example.org"/> </n1:elem2> </n2:pdu></n1:elem2> </n2:pdu>
Assume an xml subset produced from each case by applying the following XPath expression:
/descendant::n1:elem2
Applying
inclusive
canonicalization
to
the
xml
subset
produced
from
the
first
document
yields
the
following
serialization
(except
for
line
wrapping
to
fit
in
this
document):
serialization:
<n1:elem2 xmlns:n0="foo:bar" xmlns:n3="ftp://example.org" xmlns:n1="http://example.net"> <n3:stuff></n3:stuff> </n1:elem2><n3:stuff></n3:stuff> </n1:elem2>
However,
although
elem2
is
represented
by
the
same
octet
sequence
in
both
pieces
of
external
XML
above,
the
Canonical
XML
version
of
elem2
from
the
second
case
would
be
(except
for
line
wrapping
so
it
will
fit
into
this
document)
as
follows:
<n1:elem2 xmlns:n1="http://example.net" xmlns:n2="http://foo.example"> <n3:stuff xmlns:n3="ftp://example.org"></n3:stuff> </n1:elem2><n3:stuff xmlns:n3="ftp://example.org"></n3:stuff> </n1:elem2>
Note
that
the
change
in
context
has
resulted
in
lots
of
changes
in
the
subdocument
as
serialized
by
the
inclusive
canonicalization.
In
the
first
example,
n0
had
been
included
from
the
context
and
the
presence
of
an
identical
n3
namespace
declaration
in
the
context
had
elevated
that
declaration
to
the
apex
of
the
canonicalized
form.
In
the
second
example,
n0
has
gone
away
but
n2
has
appeared,
n3
is
no
longer
elevated,
and
an
xml:space
declaration
has
appeared,
due
to
changes
in
context.
elevated.
But
not
all
context
changes
have
effect.
In
the
second
example,
the
presence
at
ancestor
nodes
of
an
xml:lang
and
the
n1
prefix
namespace
declaration
have
no
effect
because
of
existing
declarations
at
the
elem2
node.
On
the
other
hand,
using
Exclusive
canonicalization
with
xmlLangAncestors="none"
and
xmlSpaceAncestors="none"
the
physical
form
of
elem2
as
extracted
by
the
XPath
expression
above
is
(except
for
line
wrapping
so
it
will
fit
into
this
document)
as
follows:
<n1:elem2 xmlns:n1="http://example.net"> <n3:stuff xmlns:n3="ftp://example.org"></n3:stuff> </n1:elem2>
in both cases.
As part of the canonicalization process, while traversing the subtree, use the following algorithm to look at all the namespace declarations in an element, and decide which ones to output.
The following concepts are used in Namespace processing:
createElementNS
and
createAttributeNS
methods,
then
DOM
adds
a
namespace
declaration
automatically
when
serializing
the
document.
xmlns="..."
.
To
make
the
algorithm
simpler
this
will
be
treated
as
a
namespace
declaration
whose
prefix
value
is
""
i.e.
an
empty
string.
E
in
the
document
subset
visibly
utilizes
a
namespace
declaration,
i.e.
a
namespace
prefix
P
and
bound
value
V
,
if
any
of
the
following
conditions
are
true:
E
itself
has
a
qualified
name
that
uses
the
prefix
P
.
(Note
if
an
element
does
not
have
a
prefix,
that
means
it
E
is
among
those
enumerated
for
the
QNameAware
option,
and
the
QName
or
CURIE
value
of
the
element
uses
the
prefix
P
(or,
lacking
a
prefix,
it
visibly
utilizes
the
default
namespace)
A
of
that
element
has
a
qualified
name
that
uses
the
prefix
P
,
and
that
attribute
is
not
in
the
exclusion
list.
(Note:
unlike
elements,
if
an
attribute
doesn't
have
a
prefix,
xsiTypeAware
A
of
that
element
is
xsi:type
QNameAware
P
.
(or,
lacking
a
prefix,
it
visibly
utilizes
the
default
namespace)
IncludedXPath
and
ExcludedXPath
attributes
in
an
XML
Signature
2.0
Transform.
Any
prefixes
used
in
this
XPath
expression
are
considered
to
be
visibility
utilized.
Step
1:
At
first
determine
the
namespaces
to
be
output
for
an
element
E
.
E
by
looking
at
both
implicit
and
explicit
namespace
declarations
in
this
element
and
its
ancestors.
E
's
ancestors,
say
E
j
,
and
has
not
been
redeclared
since
then
to
a
different
value,
i.e
not
been
redeclared
by
an
element
between
E
i
j
and
E
,
then
remove
it
from
this
list.
exclusiveMode="true"
ExclusiveMode="true"
and
this
prefix
being
absent
from
parameter
inclusiveNamespacePrefixList
InclusiveNamespaces
.
For
the
prefixes
that
are
to
be
treated
in
exclusive
mode,
check
if
the
prefix
is
E
,
and
if
it
is
not
then
remove
it.
Step
2:
If
the
option
is
prefixRewrite
PrefixRewrite
specified,
set
to
other
than
"none",
then
compute
new
prefixes
for
all
the
namespaces
declarations
in
this
list,
except
the
prefixes
starting
with
"xml",
as
follows:
prefixRewrite="sequential"
PrefixRewrite="sequential"
sort
this
list
of
namespace
declarations
by
URI.
Then
assign
a
new
prefix
value
"nN"
to
each
prefix,
incrementing
the
value
of
N
for
every
prefix.
The
counter
should
be
set
to
0
in
the
beginning
of
the
canonicalization.
(E.g.
if
the
value
of
this
counter
was
5
when
the
traversal
reached
this
element,
and
this
element
had
3
prefixes
to
be
output,
then
use
the
prefixes
"n5",
"n6",
"n7"
and
set
the
counter
to
8
after
that).
prefixRewrite="digest"
PrefixRewrite="digest"
assign
new
prefix
values
"nD"
to
each
prefix
in
this
list
where
D
is
SHA1
digest
of
the
URI,
"sequential"
mode
of
prefix
rewriting
has
the
advantage
of
a
smaller
canonicalization
output
than
the
"digest"
mode,
but
the
downside
is
that
it
may
result
in
different
namespace
prefixes
in
different
contexts,
see
the
example
below.
With
the
"digest"
mode
the
namespace
prefixes
will
be
identical
across
documents
and
contexts.
Note:
with
prefix
rewriting
the
default
namespace
is
never
output,
i.e.
it
is
also
rewritten
into
a
new
prefix.
Note:
with
exclusive
canonicalization
namespace
declarations
and
output
only
when
they
are
utilized,
this
may
lead
to
one
declaration
being
output
multiple
times,
and
if
prefixRewrite
PrefixRewrite
parameter
is
set
to
sequential,
it
may
be
rewritten
to
a
different
value
every
time.
Step
3:
If
which
is
the
default,
then
sort
this
list
of
namespaces
sortAttributes="true"
SortAttributes="true"
by
as
follows:
In
case
of
PrefixRewrite="none"
sort
the
namespace
declaration
in
lexicographic(ascending)
order
of
prefixes
(the
default
namespace
declaration
has
no
prefix,
so
it
is
lexicographically
least).
In
case
of
PrefixRewrite="sequential"
or
PrefixRewrite="digest"
sort
them
in
ascending
order
of
namespace
URI.
Step 4: Output each of these namespace nodes, as specified in the Processing model .
<wsse:Security xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd"> <wsse:UserName wsu:Id="i1"> ... </wsse:UserName> <wsse:Timestamp wsu:Id="i2"> ... </wsse:Timestamp> <wsse:Security>
PrefixRewrite="none"
<wsse:Security xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd"> <wsse:UserName xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" wsu:Id="i1"> ... </wsse:UserName> <wsse:Timestamp xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" wsu:Id="i2"> ... </wsse:Timestamp> </wsse:Security>Note how the "wsu" prefix declaration is present in wsse:Security, but is not utilized. So exclusive canonicalization will "push the declaration down" into <UserName> and <Timestamp> where it is really used, i.e. the wsu declaration will be output twice, once in <UserName> and another in <Timestamp>, as shown above.
PrefixRewrite="sequential"
<n0:Security xmlns:n0="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd"> <n0:UserName xmlns:n1="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" n1:Id="i1"> ... </n0:UserName> <n0:Timestamp xmlns:n2="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" n2:Id="i1"> ... </n0:Timestamp> </n0:Security>Now observe what happens with sequential prefix rewriting, the wsu namespace is emitted twice, but each time with a different prefix. - "n1" and "n2", as shown above.
PrefixRewrite="digest"
<n533be3d902dc7f54d5027ddd5917639d584e9d38:Security xmlns:n533be3d902dc7f54d5027ddd5917639d584e9d38:="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd"> <n533be3d902dc7f54d5027ddd5917639d584e9d38:UserName xmlns:ne2891a804ace8fbcc4a500f1dbc94cf01e38e023="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" ne2891a804ace8fbcc4a500f1dbc94cf01e38e023:Id="i1"> ... </n533be3d902dc7f54d5027ddd5917639d584e9d38:UserName> <n533be3d902dc7f54d5027ddd5917639d584e9d38:Timestamp xmlns:ne2891a804ace8fbcc4a500f1dbc94cf01e38e023="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" ne2891a804ace8fbcc4a500f1dbc94cf01e38e023:Id="i2"> ... </n533be3d902dc7f54d5027ddd5917639d584e9d38:Timestamp> </n533be3d902dc7f54d5027ddd5917639d584e9d38:Security>With digest prefix rewriting the wsu namespace is emitted twice as well, but it is the same every time. The downside is that the prefixes are very long.
Note: namespace declarations are not considered as attributes, they are processed separately as namespace nodes.
Processing
the
attributes
of
an
element
E
consist
consists
of
the
following
steps:
If
E
is
an
apex
node
node,
then
examine
all
ancestor
element
nodes
along
of
E
's
ancestors
for
the
nearest
occurrences
of
simple
inheritable
attributes
in
the
xml
namespace,
such
as
xml:lang
and
xml:space
that
are
not
already
there
present
in
E
's
attributes.
Then
temporily
temporarily
add
these
attributes
to
E
's
attribute
list.
(Do
this
step
only
if
the
parametes
xmlSpaceAncestors
and
parameter
xmlLangAncestors
XmlAncestors
are
is
set
to
inherit.)
"inherit".)
The
xml:base
attribute
is
not
a
simple
inheritable
attribute
and
requires
special
processing
beyond
a
simple
redeclaration.
Collect
the
values
of
xml:base
for
all
of
E
's
ancestors,
starting
with
the
document
root
element,
and
including
E
itself
into
an
ordered
list.
If
there
are
two
or
more
values
in
the
list,
combining
then
combine
them
two
at
a
time
starting
from
the
beginning,
using
the
join-URI-references
function.
E.g.
if
the
list
has
X
1
,
X
2
,
...
X
m
,
the
then
join
X
1
and
X
2
first,
then
join
the
result
with
X
3
amd
and
so
on.
(Do
this
step
only
if
the
parameter
is
set
to
xmlBaseAncestors
XmlAncestors
"combine").
"inherit").
PrefixRewrite
option
is
set
to
other
than
"none",
modify
the
QNames
for
the
attribute
name
to
use
the
new
prefixes.
xsi:type
and
xsiTypeAware
QNameAware
The
join-URI-References
function
takes
xml:base
attribute
values
from
all
the
ancestor
elements
and
combines
it
them
to
create
a
value
for
an
updated
xml:base
attribute.
A
simple
method
for
doing
this
is
similar
to
that
found
in
sections
5.2.1,
5.2.2
and
5.2.4
of
RFC
3986
with
the
following
modifications:
"abc/"
and
"../"
should
result
in
""
"../"
and
"../"
are
combined
as
"../../"
and
the
result
is
"../../"
".."
and
".."
are
combined
as
"../../"
and
the
result
is
"../../"
Exclusive
Canonicalization
may
be
used
as
a
CanonicalizationMethod
canonicalization
algorithm
in
XML
Digital
Signature
[
XMLDSIG-CORE2
].
],
via
the
<ds:CanonicalizationMethod>
.
Canonical
XML
2.0
takes
many
supports
a
set
of
parameters,
these
are
listed
as
enumerated
in
Canonicalization
Parameters
.
All
parameters
are
optional
and
have
default
values.
When
used
in
conjunction
with
the
<ds:CanonicalizationMethod>
element,
each
parameter
is
expressed
with
a
dedicated
child
element.
They
can
be
present
in
any
order.
Here
is
the
A
schema
definition
for
them:
each
parameter
follows:
Schema Definition: <schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://www.w3.org/2010/xml-c14n2" targetNamespace="http://www.w3.org/2010/xml-c14n2" version="0.1" elementFormDefault="qualified"> <xs:element name="ExclusiveMode" type="xs:boolean"/> <xs:element name="InclusiveNamespaces"> <xs:complexType> <xs:attribute name="PrefixList"> <xs:simpleType> <xs:list itemType="string" /> </xs:simpleType> </xs:attribute> </xs:complexType> </xs:element> <xs:element name="IgnoreComments" type="xs:boolean"/> <xs:element name="TrimTextNodes" type="xs:boolean"/> <xs:element name="Serialization"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="XML"/> <xs:enumeration value="EXI"/> </xs:restriction> </xs:simpleType> </xs:element> <xs:element name="PrefixRewrite"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="none"/> <xs:enumeration value="sequential"/> <xs:enumeration value="derived"/> </xs:restriction> </xs:simpleType> </xs:element> <xs:element name="SortAttributes" type="xs:boolean"/> <xs:element name="XmlAncestors"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="none"/> <xs:enumeration value="inherit"/> </xs:restriction> </xs:simpleType> </xs:element> <xs:element name="QNameAware"> <xs:complexType> <xs:choice maxOccurs="unbounded"> <xs:element ref="Element"/> <xs:element ref="QualifiedAttr"/> <xs:element ref="UnqualifiedAttr"/> <xs:sequence> </xs:complexType> </xs:element> <xs:element name="Element"> <xs:complexType> <xs:attribute name="Name" type="xs:NCName" use="required"/> <xs:attribute name="NS" type="xs:anyURI"/> </xs:complexType> </xs:element> <xs:element name="QualifiedAttr"> <xs:complexType> <xs:attribute name="Name" type="xs:NCName" use="required"/> <xs:attribute name="NS" type="xs:anyURI"/> </xs:complexType> </xs:element> <xs:element name="UnqualifiedAttr"> <xs:complexType> <xs:attribute name="Name" type="xs:NCName" use="required"/> <xs:attribute name="ParentName" type="xs:NCName" use="required"/> <xs:attribute name="ParentNS" type="xs:anyURI"/> </xs:complexType> </xs:element> </schema></schema>
This
section
presents
the
entire
canonicalization
algorithm
in
psuedo
pseudo
code.
It
is
not
normative.
canonicalize(list of subtree, list of exclusion elements and attributes, properties) { put the exclusion elements and attributes in hash table for easier lookup sort the multiple subtrees by document order for each subtree canonicalizeSubtree(subtree) }
Canonicalize an individual subtree.
For efficiency the routines below maintain two contexts
namespaceContext
is
a
hash
table
of
prefix
->
(uri,
hasBeenOutput,
newPrefix)
.
xmlattribContext
is
a
hash
table
of
name
->
value
.
canonicalizeSubtree(node) { initialize namespaceContext to contain the default prefix, mapped to an empty URI, and hasBeenOutput to true if (node is the document node or a document root element) { // (whole document is being processed, no ancestors to worry about) call processNode(node, namespaceContext) } else { starting from the element, walk up the tree to collect a list of ancestors for each of this ancestor elements starting with the document root, but not including the element itself addNamespaces(ancestorElem, namespaceContext) initialize xmlattribContext to empty for each of this ancestor elements starting with the document root, and also including the element itself addXMLAttributes(ancestorElem, xmlattribContext) if there are any attributes in xmlattribContext temporarily add/replace these XML attributes in node processNode(node, namspaceContext) restore the original XML attributes } }
processNode(node, namespaceContext) { call the appropriate function - processDocument, processElement, processTextNode, ... depending on the node type. }
processDocument(document, namespaceContext) { Loop through all child nodes and call processNode(child, namespaceContext) }
processElement(element, namespaceContext) { if this exists in the exclusion hash table return make of copy of xmlattribContext and namespaceContext //(by copying, any changes made can be undone when this function returns) nsToBeOutputList = processNamespaces(element, namespaceContext) output('<') if PrefixRewrite is sequential or digest, temporatily modify the QName to have the new prefix value as determined from the namespaceContext output(element QName) for each of the namespaces in the nsToBeOutputList output this namespace declaration sort each of the non namespaces attributes by URI first then attribute name. output each of these attributes with original QName or a modifiedQName if PrefixRewrite is true output('>') Loop through all child nodes and call processNode(child, namespaceContext) output('</') output(element QName) output('>') restore xmlattribContext and namespaceContext }
processText(textNode) { if this text node is outside document root return in the text replace all ampersands by &, all open angle brackets (<) by <, all closing angle brackets (>) by >, and all #xD characters by 
. If TrimTextNodes is true and there is no xml:space="preserve" declaration in scope trim leading and trailing space output(text) }
Note: The DOM parser might have split up a long text node into multiple adjacent text nodes, some of which may be empty. In that case be careful when trimming the leading and trailing space - the net result should be same as if it the adjacent text nodes were concatenated into one
processPI(piNode) { if after document node output('#xA') output('<?') output(the PI target name of the node) output(a leading space) output(the PI string value) output('?>') if before document node output('#xA') }
processComment(commentNode) { if ignoreComments return if after document node output('#xA') output('<!--') output(string value of node) output('-->') if before document node output('#xA') }
addNamespaces(element, namespaceContext) { for each the explicit and implicit namespace declarations in the element { if there is already a declaration for this prefix, and this declaration is different from existing declaration overwrite the URI , and set hasBeenOutput to false if there is no entry for this prefix add an entry for this URI, and hasBeenOutout to false } }
processNamespaces(element, namespaceContext) { addNamespaces(element, namespaceContext) initialize nsToBeOutputList to empty list for each prefix in the namespaceContext for which hasBeenOutput is false { if ExclusiveMode and this prefix is not in the inclusiveNamespacesList { if the prefix is visibly utilized by this element add the prefix to the nsToBeOutputList and set hasBeenOutput to true } else add the prefix to the nsToBeOutputList and set hasBeenOutput to true } if (PrefixRewrite is none) { sort the nsToBeOutputList by the prefix } else if (PrefixRewrite is sequential) { sort the nsToBeOutputList by URI assign new prefix values "nN" to each prefix in this nsToBeOutputList where N represents an incremented counter value , i.e. n0, n1, n2 .. // the counter should be set to 0 in the beginning of the canonicalization // note: prefix numbers are assigned in the order that the prefixes are present in nsToBeOutputList } else if (PrefixRewrite in digest) { sort the nsToBeOutputList by URI assign new prefix values "nD" to each prefix in this nsToBeOutputList where D represents the SHA1 digest of the URI represented as a hex string } return nsToBeOutputList }
addXMLAttributes(element, xmlattribContext) { for each of the xml: attributes of this element { case xml:lang attribute if XmlAncestors is inherit then store this attribute value, else do nothing case xml:space attribute if XmlAncestors is inherit then store this attribute value, else do nothing case xml:base attribute if XmlAncestors is inherit, and there is a previous value of xml:base then do a "join-URI-References" to combine the new value and the old value else do nothing } }
Unlike DOM parsers which represent XML document as a tree of nodes, streaming parsers represent an XML document as stream of events like "start-element", "end-element", "text" etc. A document subset can also be represented as a stream of events. This stream of events in exactly in the same order as a tree walk, so the above canonicalization algorithm can be also used to canonicalize an event stream.
The
following
informative
table
outlines
example
results
of
the
modified
Remove
Dot
Segments
algorithm
described
in
Section
2.4.
join-URI-references
.
Input | Output |
no/.././/pseudo-netpath/seg/file.ext | pseudo-netpath/seg/file.ext |
no/..//.///pseudo-netpath/seg/file.ext | pseudo-netpath/seg/file.ext |
yes/no//..//.///pseudo-netpath/seg/file.ext | yes/pseudo-netpath/seg/file.ext |
no/../yes | yes |
no/../yes/ | yes/ |
no/../yes/no/.. | yes/ |
../../no/../.. | ../../../ |
no/../.. | ../ |
no/.. | |
no/../ | |
/a/b/c/./../../g | /a/g |
mid/content=5/../6 | mid/6 |
../../.. | ../../../ |
no/../../ | ../ |
..yes/..no/..no/..no/../../../..yes | ..yes/..yes |
..yes/..no/..no/..no/../../../..yes/ | ..yes/..yes/ |
../.. | ../../ |
../../../ | ../../../ |
. | |
./ | |
./. | |
//no/.. | / |
../../no/.. | ../../ |
../../no/../ | ../../ |
yes/no/../ | yes/ |
yes/no/no/../.. | yes/ |
yes/no/no/no/../../.. | yes/ |
yes/no/../yes/no/no/../.. | yes/yes/ |
yes/no/no/no/../../../yes | yes/yes |
yes/no/no/no/../../../yes/ | yes/yes/ |
/no/../ | / |
/yes/no/../ | /yes/ |
/yes/no/no/../.. | /yes/ |
/yes/no/no/no/../../.. | /yes/ |
../../..no/.. | ../../ |
../../..no/../ | ../../ |
..yes/..no/../ | ..yes/ |
..yes/..no/..no/../.. | ..yes/ |
..yes/...no/..no/..no/../../.. | ..yes/ |
..yes/..no/../..yes/..no/..no/../.. | ..yes/..yes/ |
/..no/../ | / |
/..yes/..no/../ | /..yes/ |
/..yes/..no/..no/../.. | /..yes/ |
/..yes/..no/..no/..no/../../.. | /..yes/ |
/ | / |
/. | / |
/./ | / |
/./. | / |
/././ | / |
/.. | / |
/../.. | / |
/../../.. | / |
/../../.. | / |
//.. | / |
//..//.. | / |
//..//..//.. | / |
/./.. | / |
/./.././.. | / |
/./.././.././.. | / |
. | |
./ | |
./. | |
.. | ../ |
../ | ../ |
Dated references below are to the latest known or appropriate edition of the referenced work. The referenced works may be subject to revision, and conformant implementations may follow, and are encouraged to investigate the appropriateness of following, some or all more recent editions or replacements of the works cited. It is in each case implementation-defined which editions are supported.