W3C

XML Chunk Equality

[ABANDONED] TAG Finding 12 July 2007

This version:
http://www.w3.org/2001/tag/doc/xmlChunkEquality.html
Latest version:
http://www.w3.org/2001/tag/doc/xmlChunkEquality.html
Editor:
Norman Walsh, Sun Microsystems, Inc. <Norman.Walsh@Sun.COM>

This document is also available in these non-normative formats: XML.


Abstract

This finding attempts to provide an answer for the question “when are two XML chunks equal?” Many applications will require their own special-purpose algorithms, but this finding provides one solution that attempts to balance utility and complexity.

Status of this Document

This Finding has been abandoned by the TAG. The issue will be closed without further action.

Since this issue was opened, the XSL and XML Query Working Groups have published XQuery 1.0 and XPath 2.0 Functions and Operators. This document includes a definition for fn:deep-equals which compares two chunks of XML. The TAG concludes that this definition suffices for the general case that this Finding was attempting to address.

Table of Contents

1 Introduction
2 Infoset Equality
    2.1 Infosets
    2.2 Document Information Items
    2.3 Element Information Items
    2.4 Attribute Information Items
    2.5 Processing Instruction Information Items
    2.6 Unexpanded Entity Reference Information Items
    2.7 Character Information Items
    2.8 Comment Information Items
    2.9 Document Type Declaration Information Items
    2.10 Unparsed Entity Information Items
3 Customizing the Comparison

Appendices

A References
B Examples (Non-Normative)


1 Introduction

This finding attempts to provide an answer for the question “when are two XML chunks equal?” Taken narrowly, a chunk of XML is an information item (a document, element, attribute, etc.). Taken broadly, it is a set or sequence of information items (a set of documents, a sequence of elements, a heterogeneous sequence of items, etc.).

Many applications will require their own special-purpose algorithms, but this finding provides one general solution that attempts to balance utility and complexity.

Different applications can have very different notions of what constitutes identity or equality:

It is the latter class of equality that this finding attempts to address. Given two distinct XML structures, can we decide if they convey “the same information.”

We describe this equality in terms of the [XML Information Set (Second Edition)].

2 Infoset Equality

We define chunk equality in terms of the [XML Information Set (Second Edition)]. Similar definitions could be defined on top of the [XML Schema Part 1: Structures] Post-Schema Validation Infoset (PSVI), the [XML Path Language (XPath) Version 1.0] data model, or the [XQuery 1.0 and XPath 2.0 Data Model]. We choose the infoset because it is a common abstraction for XML specifications.

A few general notes about how the comparisons are performed:

2.1 Infosets

Two infosets are equal if and only if their root information items are equal.

The comparison explicitly ignores the XML version. The XML version has an impact on infoset construction (with respect to line-feed normalization, for example), but it is not necessary to consider it in infoset comparison. Element and attribute names and element content is the same if it is the same, regardless of how it was encoded.

2.2 Document Information Items

Two document information items are equal if the following properties are equal:

  • [children]

  • [document element]

  • [all declarations processed]

2.3 Element Information Items

Two element information items are equal if they have the same language and the following properties are equal:

  • [namespace name]

  • [local name]

  • [children]

  • [attributes], exclusive of xml:lang

2.4 Attribute Information Items

Two attribute information items are equal if they have the same language and the following properties are equal:

  • [namespace name]

  • [local name]

  • [normalized value]

  • [attribute type]

2.5 Processing Instruction Information Items

Two processing instruction information items are equal if the following properties are equal:

  • [target]

  • [content]

2.6 Unexpanded Entity Reference Information Items

Two unexpanded entity reference information items are equal if the following properties are equal:

  • [name]

  • [system identifier]

  • [public identifier]

2.7 Character Information Items

Two character information items are equal if the following properties are equal:

  • [character code]

  • [element content whitespace]

2.8 Comment Information Items

Two comment information items are equal if the following properties are equal:

  • [content]

2.9 Document Type Declaration Information Items

Two documen type declaration information items are equal if the following properties are equal:

  • [system identifer]

  • [public identifier]

  • [children]

2.10 Unparsed Entity Information Items

Two unparsed entity information items are equal if the following properties are equal:

  • [name]

  • [system identifer]

  • [public identifier]

  • [notation name]

3 Customizing the Comparison

The algorithm described in 2 Infoset Equality is very conservative. It could be made more permissive with the addition of a few parameters. For example, parameters could adjust the algorithm to do any or all of the following:

Even if the algorithm remains conservative, applications can influence the results by choosing how the infoset is constructed. There is no single, normative way to construct an infoset.

A References

RFC 2119
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. IETF. March, 1997. (See http://www.ietf.org/rfc/rfc2119.txt.)
XML Schema Part 1: Structures
XML Schema Part 1: Structures, Henry S. Thompson, David Beech, Murray Maloney, and Noah Mendelsohn, Editors. World Wide Web Consortium, 02 May 2001. This version is http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/. The latest version is available at http://www.w3.org/TR/xmlschema-1/.
XQuery 1.0 and XPath 2.0 Data Model
XQuery 1.0 and XPath 2.0 Data Model, Marton Nagy, Norman Walsh, Mary Fernández, et. al., Editors. World Wide Web Consortium, 12 Nov 2003. This version is http://www.w3.org/TR/2003/WD-xpath-datamodel-20031112/. The latest version is available at http://www.w3.org/TR/xpath-datamodel/.
XML Information Set (Second Edition)
XML Information Set (Second Edition), John Cowan and Richard Tobin, Editors. World Wide Web Consortium, 04 Feb 2004. This version is http://www.w3.org/TR/2004/REC-xml-infoset-20040204. The latest version is available at http://www.w3.org/TR/xml-infoset.
XML Path Language (XPath) Version 1.0
XML Path Language (XPath) Version 1.0, James Clark and Steven DeRose, Editors. World Wide Web Consortium, 16 Nov 1999. This version is http://www.w3.org/TR/1999/REC-xpath-19991116. The latest version is available at http://www.w3.org/TR/xpath.

B Examples (Non-Normative)

This appendix provides a few examples to help to clarify what we mean by “the same information.” Unless otherwise stated, there are no unshown, in-scope namespace bindings in any of these examples.

<element-one/>

attr="value"

These information items are different, they are not the same kind of information item.

<element-one/>

<element-two/>

These elements are different, they have different [local name]s.

<element xmlns="http://example.org/ns-one"/>

<element xmlns="http://example.org/ns-two"/>

These elements are different, they have different [namespace name]s.

<element attr1="value1"/>

<element attr1="value1" attr2="value2"/>

These elements are different, they have different [attributes].

<element attr1="value1"/>

<element attr1="a different value"/>

These elements are different, the have different attribute values.

<element attr1="value1" attr2='value2'/>

<element attr2="value2" attr1="value1"/>

These elements are the same, attribute quoting and order are insignificant in the infoset.

<element xmlns="http://example.org/ns"/>

<x:element xmlns:x="http://example.org/ns"/>

These elements are the same: namespace prefix bindings on element and attribute names are not significant in the infoset.

<x:element xmlns:x="http://example.org/ns" attr="x:name"/>

<y:element xmlns:y="http://example.org/ns" attr="y:name"/>

These elements are different, they have different attribute values. The infoset does not have attribute values of type “QName”, so it is not possible to determine if the attribute in this case actually contains a QName or if it just contains different characters. This specification compares the characters.

<element>Montréal</element>

<element>Montr&#233;al</element>

These elements are the same: encoding differences are not significant.

<element xml:lang="us-EN">
  <element>Some content.</element>
</element>

<element xml:lang="us-EN">
  <element xml:lang="us-en">Some content.</element>
</element>

These elements are the same: all of the elements have the same content and are in the same language.

<element xsi:type="xs:double">3.0</element>

<element xsi:type="xs:double">3</element>

These elements are different. The comparison is based on the infoset, not on properties of the PSVI, even if the content might be the same under some other interpretations.

<element>
  <element2>Some content.</element2>
</element>

<element><element2>Some content.</element2></element>

These elements are different. In the first case, the element has three children; in the second, it has only one.

<element>Some content.</element>

<element>Some
content.</element>

These elements are different. New line characters are not normalized in element content.

<element attr="one two">Some content.</element>

<element attr="one
two">Some content.</element>

These elements are the same. New line characters and whitespace are normalized in attribute content.