W3C

How should the problem of identifying ID semantics in XML languages be addressed in the absence of a DTD?

TAG finding, 30 November 2004

This version:
http://www.w3.org/2001/tag/doc/xmlIDSemantics-32-20041130.html
Latest version:
http://www.w3.org/2001/tag/doc/xmlIDSemantics-32.html
Previous version:
http://www.w3.org/2001/tag/doc/xmlIDSemantics-32-20040512.html
Editor:
Chris Lilley, W3C

Abstract

The architecture of the Web benefits from being able to label or point to information at a granularity finer than a complete resource. For XML Media Types, the identifier mechanism has to date been the declaration of identifiers (IDs) using DTD or Schema mechanisms which are, however, optional for conformant XML processors. There is thus an issue when it is desired to have ID-like functionality for parsers which do not fetch an external DTD or Schema, or in the complete absence of a DTD or Schema. This document addresses the issue xmlIDSemantics-32, How should the problem of identifying ID semantics in XML languages be addressed in the absence of a DTD?.

Status of this document

This document has been developed for discussion by the W3C Technical Architecture Group and other interested parties. This finding addresses issue xmlIDSemantics-32. It collects together some possible solutions and discusses their strengths and weaknesses. It was updated in the light of the Last Call Working Draft of xml:id.

At their 12 May 2004 meeting, the TAG approved this finding. Publication of this finding does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time.

Additional TAG findings, both approved and in draft state, may also be available. The TAG expects to incorporate this and other findings into a Web Architecture Document that will be published according to the process of the W3C Recommendation Track.

Please send comments on this finding to the publicly archived TAG mailing list www-tag@w3.org (archive).


0. Introduction

Validation and typing are separable but often conflated concepts. IDness is (in XML 1.0, and without other schema languages) a consequence of parsing a DTD, not of validation. All XML implementations are required to be able to parse the internal DTD subset. They may, optionally, fetch and parse the external DTD subset.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
<!ATTLIST foo partnum ID #IMPLIED> ]>
<foo  partnum="i54321" bar="toto"/>

This instance is well formed, but is not valid and cannot be validated because there are undeclared elements and attributes. However, the partnum attribute on foo is of type ID.

The concept of IDness, which exists in well formed documents, should be distinguished from the three validation constraints that XML places on IDs:

Validity constraint: ID
Values of type ID must match the Name production. A name must not appear more than once in an XML document as a value of this type; i.e., ID values must uniquely identify the elements which bear them.
Validity constraint: ID Attribute Default
No element type may have more than one ID attribute specified.
Validity constraint: ID Attribute Default
An ID attribute must have a declared default of #IMPLIED or #REQUIRED.

Clearly, these constraints only apply in the case of a validatable and validated document.

validity constraint
[Definition: A rule which applies to all valid XML documents. Violations of validity constraints are errors; they must, at user option, be reported by validating XML processors.]

Well-formed XML, in consequence, can have IDs (at the expense of a possibly large internal DTD subset), but there is a desire to have widely interoperable XML identifiers even in the absence of any DTD.

1. What is the problem?

1.1. Consistency: definitions

Different specifications use different definitions of ID-ness. This is easily fixed with a W3C REC that might be entitled "Definition of element identifiers in XML." All future specs could point at that explicitly and most people would interpret the older ones as pointing at it implicitly. It could even explicitly override the obsolete definitions.

1.2 Modernity: Schemas and the Infoset

DTDs are losing popularity. The schema specification implies that schemas can invoke "ID-ness" but is it explicit? Are the infoset annotations caused by a schema declaration of "ID-ness" equivalent to the ones caused by a DTD declaration of "ID-ness"? If yes, then every specification built on the infoset would pick these up "for free". If no, then they wouldn't.

1.3 Freedom from external resources

It is not always appropriate or efficient to fetch a DTD or schema in order to figure out what the IDs of elements are. In some cases, such as the use of XML in protocols, external resources may be prohibited due to efficiency or security concerns. Therefore, it might make sense for there to be an "inline" way to declare an element's ID (either on the element itself, or through some kind of idAttr indirection).

2. Evidence of breakage

Many XML-related specs depend on knowing which attributes are of type ID - but all do it differently. Some specs assume that all documents with DTDs have IDs (untrue), or that parsers only fetch the DTD for HTML but not XML, or that the implementation has some 'built in' knowledge of particular namespaces and the attributes associated with them that are of type ID.

Some implementations assume that any attribute called id is of type ID; or conversely, assume it for HTML but no longer recognise the IDness on the same HTML document when using their well-formed XML parser instead of their 'real-world HTML' parser.

The result is confusion for developers when implementing multiple W3C specifications, and a lack of interoperability for the wellformed, non-validated case. Since document authors cannot indicate, in the instance, that they want or need validation, the content creation community cannot themselves fix the problem.

2.1 DOM Level 2

Access to a specific element node in the Document Object Model is frequently by way of the getElementByID method on the document object. Thus, it relies on the ID. The DOM level 2 spec is vague on exactly how an implementation may "know" that a particular attribute is of type ID - including recognizing a "well known" namespace.

Note: The DOM implementation must have information that says which attributes are of type ID. Attributes with the name "ID" are not of type ID unless so defined. Implementations that do not know whether attributes are of type ID or not are expected to return null.

2.2 CSS2

Stylistic overrides on a per-element basis generally use an ID selector, as a preferable alternative to a style attribute on the element. The CSS spec is correct that in XML 1.0, documents without a DTD do not have IDs. It implies, or at least leaves unspecified, that documents with a DTD (whether internal, external, or both) do have IDs. It also does not cover the case of a document that has no DTD but does have a W3C XML Schema that declares some attributes to be of type ID.

Note. In XML 1.0 [XML10], the information about which attribute contains an element's IDs is contained in a DTD. When parsing XML, UAs do not always read the DTD, and thus may not know what the ID of an element is. If a style sheet designer knows or suspects that this will be the case, he should use normal attribute selectors instead: [name=p371] instead of #p371. However, the cascading order of normal attribute selectors is different from ID selectors. It may be necessary to add an "!important" priority to the declarations: [name=p371] {color: red ! important}. Of course, elements in XML 1.0 documents without a DTD do not have IDs at all.

2.3 XHTML 1.0

XHTML 1.0 is a conversion of HTML 4 into XML syntax, for delivery to existing tag-soup HTML browsers. It thus differentiates between the processing of an XHTML 1.0 document served as text/html and the same document served as 'generic XML'. The cited text is probably intended to differentiate between the newer 'id' attribute and the older 'name' attributeon the 'a' element; however it seems to imply a special processing for one use (pointing) of IDs, leaving unspecified the other uses (styling, DOM, etc) and is open to misinterpretation.

When a user agent processes an XHTML document as generic XML, it shall only recognize attributes of type ID (i.e. the id attribute on most XHTML elements) as fragment identifiers.

2.4 SOAP

For performance and security reasons, SOAP does not permit the use of internal or external DTD subsets. Thus, the attributes of type ID do not exist .... except in the human-readable prose of the specification which describes the type in terms of a Post Schema Validation Infoset (PSVI).

The type of the id attribute information item is xs:ID. The value of the id attribute information item is a unique identifier that can be referred to by a ref attribute information item (see 3.1.5.2 ref Attribute Information Item).

3. Validity-based solutions

These solutions address the problem by insisting that validation be used to create IDness and that documents which are not valid have no IDs.

3.1 Require DTD validation for IDness

This solution re-affirms, XML 1.0 to the contrary, that IDness is, or should be, or was always intended to be, a result of validation. If there is no validation, there are no IDs and any specification that allows or uses IDness - CSS, DOM, XPath, XHTML, SVG, whatever - is wrong and should be ammended to not allow this.

This solution enforces consistency by removing functionality; it does not actually solve the problem as stated, it merely declares that any breakage that occurs is someone elses problem and invalidates a large amount of existing usage. It makes the 'well formed' class of XML significantly less useful.

3.2 Require Full W3C XML Schema validation of all instances

A fully validating XML processor will, almost as a side effect, result in all attributes of type ID being so noted in the Post Schema Validation Infoset.

This solution does get around the need for a DTD. It also uses an existing mechanism that is starting to see deployment. However, it is a somewhat heavyweight solution just to get IDs and is thus unlikely to see significant uptake in areas such as mobile devices, or xml messaging. It thus risks further severing those application areas from the 'desktop Web'. Essentially, precisely those areas that are looking for IDnness without DTDs are the ones least likely to find a mandatory Schema validation an acceptable solution.

Note: Is it true that W3C XML Schema processing is always validation as well?

4. Schema or DTD based solutions

These solutions address the problem by insisting that IDness only comes from DTD or Schema processing (irrespective of whether that processing is related to validation).

4.1 Require use of internal DTD subset

Forcing all ID declarations to be in the internal DTD subset would ensure that all XML processors, validating or not, would assign the same types for ID attributes. However, the flat list of ATTLIST declarations would conflict somewhat with the modular and extensible structure of many DTDs and adding the sometimes cumbersome machinery needed to deal with namespace prefixes in DTDs would also limit the utility of this solution. It would not solve the possible conflicts with types assigned by Schema processing and does not address the case where internal or exteral DTD subsets are not to be used.

The fatal flaw of this solution is that, while improving clarity for spec writers and authors and perhaps improving interoperability it does not provide a satisfactor answer for the question as posed, in other words in the absence of a DTD.

This option was previously, and inaccurately, described as "Require DTD validation of all instances"

4.2 Add #ALL from SGML TC2

This solution uses a much shorter syntactic form from the latest version of SGML. It allows 'common' attributes that exist on all elements to be declared with just a single declaration, for example:

<!ATTLIST #ALL id ID #IMPLIED>

This would allow much shorter internal DTD subsets in the common case where all of type ID have the same name. However, it would break well-formedness backwards compatibility with XML 1.0 and 1.1. It is not clear how well this solution would work in a multiple namespace document where all the ID attributes for one namespace had one name, but all the ID attributes in a different namespace had a different name. It also does not directly address the question as posed, in other words in the absence of a DTD, although it does make the pain of using an internal subset more bearable.

4.3 Require Minimal W3C XML Schema processing of all instances

XML Schema also provides for a "Minimally conformant Schema processor" which might provide a lightweight option. Such a schema processor need not accept arbitrary schemas as input, instead it may be hardwired to apply a single schema such as this one:

  <xs:schema targetNamespace="http://www.w3.org/XML/1998/namespace"
             xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:attribute name="id" type="xs:ID"/>
  </xs:schema>
  

An application applying this mechanism would likely be indistinguishable in its effects from an application just looking for xml:id attributes and applying IDness to them. To enable applications to share a "built-in schema" such as this one, a standardized name for the ID attribute is necessary. xml:id (see Reserved attribute solutions below) is one way to standardise such an attribute without fear of name clashes from other namespaces.

An advantage of this approach is that it does not conflict with existing validation-based ID detection. If a DTD is present, the application may recognize IDs declared therein, and the author of the DTD should control whether ID attributes are spelled "xml:id" or some other way. If a Schema is present, the application need not apply the built-in xml:id schema, and the author of the Schema can control whether ID attributes are spelled "xml:id" or some other way. Since the ID-consuming application is responsible for applying the xml:id schema when no other validation mechanism exists, a new version of XML is unncessary. The option, in sum, unreserves the xml:id attribute name, while encouraging DTD and Schema authors to use this spelling in their schemas, and encouraging applications to apply a minimal built-in schema when full validation is not performed.

A disadvantage is that, when full validation is performed, the xml:id attributes may end up having a different type. It would not be considered appropriate for another spec to re-define the semantics of xml:base, for example, so it is unclear what advantage is to be gained by allowing redeclaration of an xml:id.

5. Reserved attribute solutions

This class of solution adresses the problem by making certain attributes always be of type ID. Other attributes could still be of this type, of course. A possible corollary of this - other attributes are *only* of type ID if declared to be so in a DTD or Schema.

5.1 Steal the string "id"

If an attribute is called 'id' it is of type ID. If you want an interoperable IDness then you must call your attribute id. If you previously had attributes called id and you don't want them to be of type ID, change their names.

This is a somewhat brutal solution, especially for any content that already has attributes called 'id' that are not of type ID. On the other hand, a very large percentage of existiing XML usage does indeed call its ID attributes 'id'. This solution is an XML language change.

5.2 Steal the string "id" if undeclared

In well formed content that does not have a DTD, or that has a partial DTD used for decoration (declaring ID, declaring attribute defaults, etc) if an attribute is called id and has not been declared in the DTD, it is of type ID. This is slightly better than the 'steal all occurrences' option, in that it does not change any pre-existing declarations. However, it does result in different processing depending on whether the DTD is read or not. At minimum, it would need authoring guidelines that warn against the consequences of the DTD saying that the type of any 'id' attribute is of any other type than ID. In practice, this is the same terefore as the previous option - it steals all unqualified attributes called 'id'

5.2 Add a predeclared id attribute of type ID to the xml namespace

The obvious solution to avoiding a clash with existing usage is to use a namespace-qualified attribute name; and the obvious solution for avoiding a dependence on a DTD is to use the reserved xml namespace, which does not need declaration. A new xml:id attribute, (in the same way that xml:base added a predeclared attribute to the existing xml:lang and xml:space attributes) which is predeclared to be of type ID (either using the DTD or, preferably, the schema definitions of IDness), could be used by any XML vocabulary that wanted interoperability even in the case that DTDs were not being read. Because it is predeclared, it could not clash with whatever a DTD or schema said, so processing would be identical for validating parsers, non-validating but DTD-reading parsers, and non-DTD-reading parsers.

This solution is easy to understand. A disadvantage is that XML vocabulary specifications would need to be revised to use this new attribute name, and the XML specification itself (or a supplemental specification, similar to that for xml:base) would need to describe the functionality. Older content would be no better off, but no worse off either. It preserves backward compatibility and does not affect well-formedness. Other attribute names, of type ID, may also be used alongside it (but validation is needed to enforce the constraint that only one attribute may be of type ID on a given element). This is similar to the Require Minimal W3C XML Schema processing of all instances" option, but the minimal schema for xml:id is always applied, even if other schemas are present, and cannot be overridden.

6. Inline declaration solutions

This class of solutions adresses the problem by adding a new, inline declaration mechanism to XML that can be used without DTDs or as a lightweight replacement for some uses of DTDs. Clearly these are all XML language changes. On the whole they do not affect well formedness.

6.1 Add an inline, per-instance ID declaration method

In the same way that xml:base added a predeclared attribute to the existing xml:lang and xml:space attributes, add another one called xml:idAttr. It takes as value the local name of an attribute. All attributes of that name in the per-element partition become of type ID. It may only be used on the root element of the instance. For example:

<root xml:idAttr="foo" foo="x1">
    < subelement foo="a2"/>
</root>

This is fairly easy to explain, as it uses existing xml instance syntax. It does not inadvertently affect any existing content, and minimises the effect on revisions of XML grammars that want to make use of it - no renaming of existing attributes is required. A disadvantage is that the inline declaration might conflict with what the DTD or the Schema says about the type of this attribute. This would at minimum need to be adressed by authoring guidelines. The restriction to the root element hinders composability, which is a pity since a potential benefit of an inline declaration syntax is an increase in composability.

A possible enhancement is to accept either a local name or a qname; if its a qname then resolve to a namespace URI, local name pair on the element that has xml:idAttr and then all attributes with that local name in that namespace are of type ID.

6.2 Add an inline, per subtree ID declaration method

In the same way that xml:base added a predeclared attribute to the existing xml:lang and xml:space attributes, add another one called xml:idAttr. It takes as value the local name of an attribute. All attributes of that name in the per-element partition, on that element and its children become of type ID.

It can be used on any element. It can also take the value "" in which case, no attributes on that element or its children are declared to be of type ID (used when composing multiple namespaces). Such declarations are scoped; a new declaration replaces, rather than adding to, the currently in-scope ID attribute. This avoids the situation where an element could have two attributes both of type ID.

A possible enhancement is to accept either a local name or a qname; if its a qname then resolve to a namespace URI, local name pair on the element that has xml:idAttr and then all attributes with that local name in that namespace are of type ID.

7. Discussion

Requiring DTD validation to get IDs is too big a retrogressive step; it essentially throws away well formedness as a concept and also XML namespaces, and needlessly conflates validation with decoration.

Requiring Full W3C XML Schema validation to get IDs is probably too big a forwards step; it adds a lot of machinery to get a simple but crucial step forward and needlessly conflates validation with decoration. However, the use of a preeclared schama coupled with mandatory, minimally-conforming Schema processing could mitigate this disadvantage somewhat.

It would seem preferable that, if an inline declaration method is chosen, W3C XML Schema be revised so that the behavior of documents that the inline delaration and also use a W3C XML Schema is consistent with regards to the attribute declared of type ID in the instance, whether the Schema is used or not (in other words, an implicit declaration in the instance is the same in the PSVI as if the attribute had been declared of type ID in the Schema, except that part of the PSVI that traces which Schema provided the rule - that part would report that the instance provided the rule).

The fact that a type ID declared in a DTD is not exactly equivalent to a type ID declared in a W3C XML Schema is noted, but not directly addressed in this discussion. Some solutions would give a DTD IDness, other solutions would give a Schema IDness, and yet other solutions could choose the appropriate definition or perhaps give both definitions. In general, defining identity at the infoset level and then working from that back to syntax has been suggested as a sound way to approach the problem, and is the approach taken by and xml:id processor when informing the application of ID assignment.

Requiring specifications that create identity (DTD, Schema, DOM Level 3 Validation) to reference that single definition, would also help. Further, the very ability of Schemas to allow conflicting property assignment has itself been seen as a problem.

Most (but not all) attributes called id are of type ID. Most (but not all) attributes of type ID are called id. 100% of single-namespace documents could be brought into conformance with inline, per-instance ID declaration method by adding a single attribute to the root element. 99% of them would be brought into conformance by adding:

xml:idAttr="id"

to the root element. Crucially, the 1% that do not are still catered for, simply by declaring whatever the attribute is called - a big advantage over other options.

For full generality for multi-namespace documents, the scoped solution would give the greatest expressive power. However, there is a dislike of scoped solutions among some XML practitioners (which is odd considering that xml:base and xml:lang are both scoped).

Given that many specifications choose to use a single name for all their id attributes, and that specifications are periodically revised, then a phased changeover to xml:id as the preferred ID attribute name could yield benefits, if the specification for xml:id encouraged its use and if, say, all W3C XML specifications were expected to change over to using xml:id for their document-scope identifiers (ie, unless they had a particular reason for a different semantic). The more specifications that changed over, the less the drawbacks of not having a scoped syntax would be in practice.

Some people have suggested that this problem is specific example of a more general problem that 'well formed' is a broken concept and should be removed or deprecated, to be replaced with two other conformance classes. One is full infoset processing, where DTDs (including external subsets and external entities) are allways fetched, but validation is onot done. The second class would require conforming processors to not read the internal DTD subset if present, and to not fetch any external DTD subset. This would give an increase in interoperability, or at least consistency, particularly if document instances had some means of declaring which type of processing they expected or if there were some external XMP processing or pipelining work, perhaps as a result of work on XML packaging. The TAG does not take a position on this wider scope, but merely notes that even were these extra two conformance classes to exist, the problem of getting IDness in the lowest (no DTD processing) class would still not be solved.

The desire of some people to have new ID-like types that are, for example, unique among elements of a give type, or unique in a particular namespace, or unique in a given subtree, are also not addressed in this document although it is appreciated that some of these new types would aid in composability. In particular, wrapper formats such as SOAP encoding benefit from having identifiers that are not in scope for the whole document, and which are only unique among attributes of the same name, such as soap-enc:id.

Other limitations of existing XML IDness not addressed in this finding are the restriction to alphanumeric keys (so id="42" is currently invalid), the use of a string basetype (even without the XML NAME constraint, id="00042" will not match idref="42"), or the restriction to single keys. Some may feel that short-term work that does not address these issues delays the day when a more radical revision of the XML identity mechanism is standardized.

8. Conclusions

This document is a summary made available by the TAG; it is hoped that it lists all the solutions that have been proposed, and their advantages and disadvantages, which should help any group chartered to deal with this problem. The TAG was unhappy with the current situation and was pleased to see to see further discussion on this happening in XML Core WG and the publication of a requirements document and a working draft. The TAG reviewed this docuent at Last Call, and would like to see convergence on the solution proposed by XML Core, followed by its widespread adoption.

9. Acknowledgments

The contribution of TAG members and of subscribers to the www-tag mailing list is acknowledged. Textual contributions by Paul Grosso, Noah Mendelsohn, Murata Makato, Jonathan Marsh, and Paul Prescod have improved this draft finding.

10. References