<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE spec PUBLIC "-//W3C//DTD Specification V2.6//EN"
               "http://www.w3.org/2002/xmlspec/dtd/2.6/xmlspec.dtd" [
  <!-- ================================================================ -->
  <!ENTITY draft.day "12">
  <!ENTITY draft.month "07">
  <!ENTITY draft.monthname "July">
  <!ENTITY draft.year "2007">
  <!ENTITY iso6.doc.date "&draft.year;-&draft.month;-&draft.day;">
  <!ENTITY http-ident "http://www.w3.org/2001/tag/doc/xmlChunkEquality">
]>
<spec w3c-doctype='other'>
<?CVS $Id: xmlChunkEquality.xml,v 1.3 2007/07/12 16:47:42 NormanWalsh Exp $?>
<header>
<title>XML Chunk Equality</title>
<w3c-designation>&http-ident;-&iso6.doc.date;</w3c-designation>
<w3c-doctype>[ABANDONED] TAG Finding</w3c-doctype>
<pubdate><day>&draft.day;</day>
<month>&draft.monthname;</month>
<year>&draft.year;</year>
</pubdate>
<publoc>
<loc href="&http-ident;.html">&http-ident;.html</loc>
</publoc>
<altlocs>
<loc href="&http-ident;.xml">XML</loc>
</altlocs>
<latestloc>
<loc href="&http-ident;.html">&http-ident;.html</loc>
</latestloc>
<authlist>
<author><name>Norman Walsh</name>
<affiliation>Sun Microsystems, Inc.</affiliation>
<email href="mailto:Norman.Walsh@Sun.COM">Norman.Walsh@Sun.COM</email></author>
</authlist>

<abstract>
<p>This finding attempts to provide an answer for the question “when
are two XML chunks equal?” Many applications will require their own special-purpose
algorithms, but this finding provides one solution that attempts to balance
utility and complexity.</p>
</abstract>

<status>
<p><emph>This Finding has been abandoned by the TAG.</emph> The issue will
be closed without further action.</p>

<p>Since this issue was opened, the XSL and XML Query Working Groups
have published <a href="http://www.w3.org/TR/xpath-functions/">XQuery 1.0
and XPath 2.0 Functions and Operators</a>. This document includes
a definition for <code>fn:deep-equals</code> which compares two chunks
of XML. The TAG concludes that this definition suffices for the general
case that this Finding was attempting to address.</p>

</status>
<pubstmt>
<p>Chicago, Vancouver, Mountain View, et al.: World-Wide Web Consortium,
Draft TAG Finding, 2004.</p>
</pubstmt>
<sourcedesc>
<p>Created in electronic form.</p>
</sourcedesc>
<langusage>
<language id="EN">English</language>
</langusage>
<revisiondesc>
<slist>
<sitem>2002-04-30: Published draft</sitem>
</slist>
</revisiondesc>
</header>
<body>

<div1 id="introduction">
<head>Introduction</head>

<p>This finding attempts to provide an answer for the question “when
are two XML chunks equal?” Taken narrowly, a chunk of XML is an information
item (a document, element, attribute, etc.). Taken broadly, it is a set
or sequence of information items (a set of documents, a sequence of elements,
a heterogeneous sequence of items, etc.).</p>

<p>Many applications will require their own
special-purpose algorithms, but this finding provides one general solution
that attempts to balance utility and complexity.</p>

<p>Different applications can have very different notions of what constitutes
identity or equality:</p>

<ul>
<item><p>A digital signature application may need a canonical, bit-for-bit identical
lexical representation of both the data and the markup.</p>
</item>
<item><p>A language runtime system may need to know that two variables refer to
the same data structure in memory.</p>
</item>
<item><p>A semantic inference application may need to know that two representations
have the same URI.</p>
</item>
<item><p>A message passing application may need to know if two
distinct messages are “the same,” if they are structurally equivalent.</p>
</item>
</ul>

<p>It is the latter class of equality that this finding attempts to
address. Given two distinct XML structures, can we decide if they
convey “the same information.”</p>

<p>We describe this equality in terms of the <bibref ref="xml-infoset"/>.</p>

</div1>

<div1 id="infoset-equality">
<head>Infoset Equality</head>

<p>We define chunk equality in terms of the <bibref ref="xml-infoset"/>.
Similar definitions could be defined on top of the <bibref ref="xml-schema1"/>
Post-Schema Validation Infoset (PSVI), the <bibref ref="xpath1"/> data model,
or the <bibref ref="xpath2-data-model"/>. We choose the infoset because it
is a common abstraction for XML specifications.</p> 

<p>A few general notes about how the comparisons are performed:</p>

<ul>
<item>
<p>Information items of different types (elements and attributes or
comments and processing instructions) are never the same.</p>
</item>
<item>
<p>Ordered lists (such as the [children] property) are compared
pairwise and in order. In other words, two ordered lists "A" and "B"
are the same if and only if the first item if "A" is the same as the
first item of "B", the second item of "A" is the same as the second
item of "B", etc. It follows that they can only be the same if they
are the same length.</p>
</item>
<item>
<p>Unordered lists (such as the [attributes] property) are compared
pairwise but without respect to order. In other words, two unordered
lists "A" and "B" are the same if and only if there exists a set of
pairs of items, one from each list, such that the two items in each
pair are equal and no item from "A" or "B" appears in more than one
pair. It follows that they can only be the same if they are the same
length.</p>
</item>

<item><p>XML Base. If the infosets being compared were constructed by an
application that claims conformance to the XML Base recommendation,
then the xml:base attribute is excluded from attribute comparisons.
</p>
<p>In this specification, the base URI is not considered significant.</p>
</item>

<item>
<p>Natural Language. The xml:lang attribute is not treated specially in
the Infoset but is intended to have a scoped effect much like the
base URI. This intention is made explicit in this specification.</p>

<p>If the infosets being compared were constructed by an application
that provides application semantics for xml:lang, then the
application must be able to determine whether or not two elements or
attributes have the same language.</p>

<p>If the infosets being compared were constructed by an application
that does not provide special semantics for xml:lang, then two
elements or attributes have the same language if they have the same
inherited value for xml:lang.</p>

<p>The inherited value for xml:lang is the value of xml:lang on the
element in question or the value from the closest ancestor. In XPath
terms: <code>(ancestor-or-self::*/@xml:lang)[last()]</code></p>

<p>Languages are compared case insensitively.</p>
</item>

<item>
<p>XML Space. This specification does not extend any special status to the
<att>xml:space</att> attribute, nor does it treat whitespace marked as
[element content whitespace] in any special way.
</p>
</item>

<item>
<p>When two information items are compared:</p>

  <ul>
  <item><p>Properties with the value "no value" are equal.</p></item>
  <item><p>Properties with the value "unknown" are not equal.</p></item>
  </ul>
</item>
</ul>

<div2 id="infosets">
<head>Infosets</head>

<p>Two infosets are equal if and only if their root information items are equal.</p>

<p>The comparison explicitly ignores the XML version. The XML version has an
impact on infoset construction (with respect to line-feed normalization, for example),
but it is not necessary to consider it in infoset comparison. Element and attribute
names and element content is the same if it is the same, regardless of how it
was encoded.</p>

</div2>

<div2 id="document-information-items">
<head>Document Information Items</head>

<p>Two document information items are equal if the following properties
are equal:</p>

<ul>
<item><p>[children]</p></item>
<item><p>[document element]</p></item>
<item><p>[all declarations processed]</p></item>
<!--<item><p>[base uri]</p></item>-->
</ul>
</div2>

<div2 id="element-information-items">
<head>Element Information Items</head>

<p>Two element information items are equal if they have the same language
and the following properties are equal:</p>

<ul>
<item><p>[namespace name]</p></item>
<item><p>[local name]</p></item>
<item><p>[children]</p></item>
<item><p>[attributes], exclusive of xml:lang</p></item>
<!--<item><p>[base uri]</p></item>-->
</ul>
</div2>

<div2 id="attribute-information-items">
<head>Attribute Information Items</head>

<p>Two attribute information items are equal if they have the same
language and the following properties are equal:</p>

<ul>
<item><p>[namespace name]</p></item>
<item><p>[local name]</p></item>
<item><p>[normalized value]</p></item>
<item><p>[attribute type]</p></item>
</ul>
</div2>

<div2 id="processing-instruction-information-items">
<head>Processing Instruction Information Items</head>

<p>Two processing instruction information items are equal if the
following properties are equal:</p>

<ul>
<item><p>[target]</p></item>
<item><p>[content]</p></item>
<!--<item><p>[base uri]</p></item>-->
</ul>
</div2>

<div2 id="unexpanded-entity-reference-information-items">
<head>Unexpanded Entity Reference Information Items</head>

<p>Two unexpanded entity reference information items are equal if the
following properties are equal:</p>

<ul>
<item><p>[name]</p></item>
<item><p>[system identifier]</p></item>
<item><p>[public identifier]</p></item>
</ul>
</div2>

<div2 id="character-information-items">
<head>Character Information Items</head>

<p>Two character information items are equal if the following properties
are equal:</p>

<ul>
<item><p>[character code]</p></item>
<item><p>[element content whitespace]</p></item>
</ul>
</div2>

<div2 id="comment-information-items">
<head>Comment Information Items</head>

<p>Two comment information items are equal if the following properties
are equal:</p>

<ul>
<item><p>[content]</p></item>
</ul>
</div2>

<div2 id="document-type-declaration-information-items">
<head>Document Type Declaration Information Items</head>

<p>Two documen type declaration information items are equal if the
following properties are equal:</p>

<ul>
<item><p>[system identifer]</p></item>
<item><p>[public identifier]</p></item>
<item><p>[children]</p></item>
</ul>
</div2>

<div2 id="unparsed-entity-information-items">
<head>Unparsed Entity Information Items</head>

<p>Two unparsed entity information items are equal if the following
properties are equal:</p>

<ul>
<item><p>[name]</p></item>
<item><p>[system identifer]</p></item>
<item><p>[public identifier]</p></item>
<item><p>[notation name]</p></item>
</ul>
</div2>
</div1>

<div1 id="custom">
<head>Customizing the Comparison</head>

<p>The algorithm described in <specref ref="infoset-equality"/> is
very conservative. It could be made more permissive with the addition
of a few parameters. For example, parameters could adjust the algorithm
to do any or all of the following:</p>

<ul>
<item><p>Ignore processing instructions.</p></item>
<item><p>Ignore comments.</p></item>
<item><p>Ignore the document type declaration.</p></item>
</ul>

<p>Even if the algorithm remains conservative, applications can
influence the results by choosing how the infoset is constructed.
There is no single, normative way to construct an infoset.</p>

</div1>

</body>
<back>
<div1 id="references">
<head>References</head>

<blist>
<bibl id="rfc2119" href="http://www.ietf.org/rfc/rfc2119.txt" key="RFC
2119">S. Bradner. <titleref>Key words for use in RFCs to Indicate
Requirement Levels</titleref>. IETF. March, 1997.</bibl>

<bibl id="xml-schema1" key="XML Schema Part 1: Structures">
<titleref href="http://www.w3.org/TR/xmlschema-1/">XML Schema Part 1:
Structures</titleref>, Henry S. Thompson, David Beech, Murray Maloney,
and Noah Mendelsohn, Editors. World Wide Web Consortium, 02 May 2001.
This version is http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/.
The <a href="http://www.w3.org/TR/xmlschema-1/">latest version</a> is
available at http://www.w3.org/TR/xmlschema-1/.</bibl>

<bibl id="xpath2-data-model" key="XQuery 1.0 and XPath 2.0 Data Model">
<titleref href="http://www.w3.org/TR/xpath-datamodel/">XQuery 1.0 and
XPath 2.0 Data Model</titleref>, Marton Nagy, Norman Walsh, Mary
Fernández, <emph>et. al.</emph>, Editors. World Wide Web Consortium,
12 Nov 2003. This version is
http://www.w3.org/TR/2003/WD-xpath-datamodel-20031112/. The <a
href="http://www.w3.org/TR/xpath-datamodel/">latest version</a> is
available at http://www.w3.org/TR/xpath-datamodel/.</bibl>

<bibl id="xml-infoset" key="XML Information Set (Second Edition)">
<titleref href="http://www.w3.org/TR/xml-infoset">XML Information Set
(Second Edition)</titleref>, John Cowan and Richard Tobin, Editors.
World Wide Web Consortium, 04 Feb 2004. This version is
http://www.w3.org/TR/2004/REC-xml-infoset-20040204. The <a
href="http://www.w3.org/TR/xml-infoset">latest version</a> is
available at http://www.w3.org/TR/xml-infoset.</bibl>

<bibl id="xpath1" key="XML Path Language (XPath) Version 1.0">
<titleref href="http://www.w3.org/TR/xpath">XML Path Language (XPath)
Version 1.0</titleref>, James Clark and Steven DeRose, Editors. World
Wide Web Consortium, 16 Nov 1999. This version is
http://www.w3.org/TR/1999/REC-xpath-19991116. The <a
href="http://www.w3.org/TR/xpath">latest version</a> is available at
http://www.w3.org/TR/xpath.</bibl>
</blist>
</div1>

<inform-div1 id="examples">
<head>Examples</head>

<p>This appendix provides a few examples to help to clarify what we
mean by “the same information.” Unless otherwise stated, there are no
unshown, in-scope namespace bindings in any of these examples.</p>

<eg>&lt;element-one/&gt;

attr="value"</eg>

<p>These information items are different, they are not the same kind of
information item.</p>

<eg>&lt;element-one/&gt;

&lt;element-two/&gt;</eg>

<p>These elements are different, they have different [local name]s.</p>

<eg>&lt;element xmlns="http://example.org/ns-one"/&gt;

&lt;element xmlns="http://example.org/ns-two"/&gt;</eg>

<p>These elements are different, they have different [namespace name]s.</p>

<eg>&lt;element attr1="value1"/&gt;

&lt;element attr1="value1" attr2="value2"/&gt;</eg>

<p>These elements are different, they have different [attributes].</p>

<eg>&lt;element attr1="value1"/&gt;

&lt;element attr1="a different value"/&gt;</eg>

<p>These elements are different, the have different attribute values.</p>

<eg>&lt;element attr1="value1" attr2='value2'/&gt;

&lt;element attr2="value2" attr1="value1"/&gt;</eg>

<p>These elements are the same, attribute quoting and order are insignificant
in the infoset.</p>

<eg>&lt;element xmlns="http://example.org/ns"/&gt;

&lt;x:element xmlns:x="http://example.org/ns"/&gt;</eg>

<p>These elements are the same: namespace prefix bindings on element and
attribute names are not significant in the infoset.</p>

<eg>&lt;x:element xmlns:x="http://example.org/ns" attr="x:name"/&gt;

&lt;y:element xmlns:y="http://example.org/ns" attr="y:name"/&gt;</eg>

<p>These elements are different, they have different attribute values.
The infoset does not have attribute values of type “QName”, so it is not possible
to determine if the attribute in this case actually contains a QName or if it
just contains different characters. This specification compares the characters.</p>

<eg>&lt;element&gt;Montréal&lt;/element&gt;

&lt;element&gt;Montr&amp;#233;al&lt;/element&gt;</eg>

<p>These elements are the same: encoding differences are not significant.</p>

<eg>&lt;element xml:lang="us-EN"&gt;
  &lt;element&gt;Some content.&lt;/element&gt;
&lt;/element&gt;

&lt;element xml:lang="us-EN"&gt;
  &lt;element xml:lang="us-en"&gt;Some content.&lt;/element&gt;
&lt;/element&gt;</eg>

<p>These elements are the same: all of the elements have the same content
and are in the same language.</p>

<eg>&lt;element xsi:type="xs:double"&gt;3.0&lt;/element&gt;

&lt;element xsi:type="xs:double"&gt;3&lt;/element&gt;</eg>

<p>These elements are different. The comparison is based on the infoset, not on
properties of the PSVI, even if the content might be the same under some
other interpretations.</p>

<eg>&lt;element&gt;
  &lt;element2>Some content.&lt;/element2&gt;
&lt;/element&gt;

&lt;element&gt;&lt;element2>Some content.&lt;/element2&gt;&lt;/element&gt;</eg>

<p>These elements are different. In the first case, the <code>element</code> has
three children; in the second, it has only one.</p>

<eg>&lt;element&gt;Some content.&lt;/element&gt;

&lt;element&gt;Some
content.&lt;/element&gt;</eg>

<p>These elements are different. New line characters are not normalized in
element content.</p>

<eg>&lt;element attr="one two"&gt;Some content.&lt;/element&gt;

&lt;element attr="one
two"&gt;Some content.&lt;/element&gt;</eg>

<p>These elements are the same. New line characters and whitespace
are normalized in attribute content.</p>

</inform-div1>

</back>
</spec>

