How should the problem of identifying ID semantics in XML languages be addressed in the absence of a DTD?

What is the problem

Validation and typing are separable but often conflated concepts. IDness is a consequence of parsing a DTD, not of validation. All XML implementations are required to be able to parse the internal DTD subset. They may, optionally, fetch and parse the external DTD subset.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
<!ATTLIST foo partnum ID #IMPLIED> ]>
<foo  partnum="i54321" bar="toto"/>

This instance is well formed, but is not valid and cannot be validated because there are undeclared elements and attributes. However, the partnum attribute on foo is of type ID.

The concept of IDness, which exists in well formed documents, should be distinguished from the three validation constraints that XML places on IDs:

Validity constraint: ID

Values of type ID must match the Name production. A name must not appear more than once in an XML document as a value of this type; i.e., ID values must uniquely identify the elements which bear them.

Validity constraint: ID Attribute Default

No element type may have more than one ID attribute specified.

Validity constraint: ID Attribute Default

An ID attribute must have a declared default of #IMPLIED or #REQUIRED.

Clearly, these constraints only apply in the case of a validatable and validated document.

validity constraint

[Definition: A rule which applies to all valid XML documents. Violations of validity constraints are errors; they must, at user option, be reported by validating XML processors.]

Many XML-related specs depend on knowing which attributes are of type ID - but all do it differently. Some specs assume that all documents with DTDs have IDs (untrue), or that parsers only fetch the DTD for HTML but not XML, or that the implementation has some 'built in' knowledge of particular namespaces and the attributes associated with them that are of type ID.

Some implementations assume that any attribute called id is of type ID; or conversely, assume it for HTML but no longer recognise the IDness on the same HTML document when using their well-formed XML parser instead of their 'real-world HTML' parser.

The result is confusion for developers when implementing multiple W3C specifications, and a lack of interoperability for the wellformed, non-validated case. Since document authors cannot indicate, in the instance, that they want or need validation, the content creation community cannot themselves fix the problem

Evidence of breakage

Many XML-related specs depend on knowing which attributes are of type ID - but all do it differently.

DOM Level 2

Access to a specific element node in the Document Object Model is frequently by way of the getElementByID method on the document object. Thus, it relies on the ID. The DOM level 2 spec is vague on exactly how an implementation may "know" that a particular attribute is of type ID - including recognizing a "well known" namespace.

Note: The DOM implementation must have information that says which attributes are of type ID. Attributes with the name "ID" are not of type ID unless so defined. Implementations that do not know whether attributes are of type ID or not are expected to return null.

CSS2

Stylistic overrides on a per-element basis generally use an ID selector, as a preferable alternative to a style attribute on the element. The CSS spec is correct that in XML 1.0, documents without a DTD do not have IDs. It implies, or at least leaves unspecified, that documents with a DTD (whether internal, external, or both) do have IDs. It also does not cover the case of a document that has no DTD but does have a W3C XML Schema that declares some attributes to be of type ID.

Note. In XML 1.0 [XML10], the information about which attribute contains an element's IDs is contained in a DTD. When parsing XML, UAs do not always read the DTD, and thus may not know what the ID of an element is. If a style sheet designer knows or suspects that this will be the case, he should use normal attribute selectors instead: [name=p371] instead of #p371. However, the cascading order of normal attribute selectors is different from ID selectors. It may be necessary to add an "!important" priority to the declarations: [name=p371] {color: red ! important}. Of course, elements in XML 1.0 documents without a DTD do not have IDs at all.

XHTML 1.0

XHTML 1.0 is a conversion of HTML 4 into XML syntax, for delivery to existing tag-soup HTML browsers. It thus differentiates between the processing of an XHTML 1.0 document served as text/html and the same document served as 'generic XML'. The cited text is probably intended to differentiate between the newer 'id' attribute and the older 'name' attributeon the 'a' element; however it seems to imply a special processing for one use (pointing) of IDs, leaving unspecified the other uses (styling, DOM, etc) and is open to misinterpretation.

When a user agent processes an XHTML document as generic XML, it shall only recognize attributes of type ID (i.e. the id attribute on most XHTML elements) as fragment identifiers.

SOAP

For performance and security reasons, SOAP does not permit the use of internal or external DTD subsets. Thus, the attributes of type ID do not exist .... except in the human-readable prose of the specification which describes the type in terms of a Post Schema Validation Infoset (PSVI).

The type of the id attribute information item is xs:ID. The value of the id attribute information item is a unique identifier that can be referred to by a ref attribute information item (see 3.1.5.2 ref Attribute Information Item).

Validity-based solutions

These solutions address the problem by insisting that validation be used to create IDness and that documents which are not valid have no IDs.

Require DTD validation for IDness

This solution re-affirms, XML 1.0 to the contrary, that IDness is, or should be, or was always intended to be, a result of validation. If there is no validation, there are no IDs and any specification that allows or uses IDness - CSS, DOM, XPath, XHTML, SVG, whatever - is wrong and should be ammended to not allow this.

This solution enforces consistency by removing functionality; it does not actually solve the problem as stated, it merely declares that any breakage that occurs is someone elses problem and invalidates a large amount of existing usage. It makes the 'well formed' class of XML significantly less useful.

Require W3C XML Schema validation of all instances

A fully validating XML processor will, almost as a side effect, result in all attributes of type ID being so noted in the Post Schema Validation Infoset.

This solution does get around the need for a DTD. It also uses an existing mechanism that is starting to see deployment. However, it is a somewhat heavyweight solution just to get IDs and is thus unlikely to see significant uptake in areas such as mobile devices, or xml messaging. It thus risks further severing those application areas from the 'desktop Web'. Essentially, precisely those areas that are looking for IDnness without DTDs are the ones least likely to find a mandatory Schema validation an acceptable solution.

Note: Is it true that Schema processing is always validation as well?

Schema or DTD based solutions

These solutions address the problem by insisting that IDness only comes from DTD or Schema processing (irrespective of whether that processing is related to validation).

Require use of internal DTD subset

Forcing all ID declarations to be in the internal DTD subset would ensure that all XML processors, validating or not, would assign the same types for ID attributes. However, the flat list of ATTLIST declarations would conflict somewhat with the modular and extensible structure of many DTDs and adding the sometimes cumbersome machinery needed to deal with namespace prefixes in DTDs would also limit the utility of this solution. It would not solve the possible conflicts with types assigned by Schema processing and does not address the case where internal or exteral DTD subsets are not to be used.

The fatal flaw of this solution is that, while improving clarity for spec writers and authors and perhaps improving interoperability it does not provide a satisfactor answer for the question as posed, in other words in the absence of a DTD.

This option was previously, and inaccurately, described as "Require DTD validation of all instances"

Add #ALL from SGML TC2

This solution uses a much shorter syntactic form from the latest version of SGML. It allows 'common' attributes that exist on all elements to be declared with just a single declaration, for example:

<!ATTLIST #ALL id ID #IMPLIED>

This would allow much shorter internal DTD subsets in the common case where all of type ID have the same name. However, it would break well-formedness backwards compatibility with XML 1.0 and 1.1. It is not clear how well this solution would work in a multiple namespace document where all the ID attributes for one namespace had one name, but all the ID attributes in a different namespace had a different name. It also does not directly address the question as posed, in other words in the absence of a DTD, although it does make the pain of using an internal subset more bearable.

Reserved attribute solutions

This class of solution adresses the problem by making certain attributes always be of type ID. Other attributes could still be of this type, of course. A possible corollary of this - other attributes are *only* of type ID if declared to be so in a DTD or Schema.

Steal the string "id"

If an attribute is called 'id' it is of type ID. If you want an interoperable IDness then you must call your attribute id. If you previously had attributes called id and you don't want them to be of type ID, change their names.

This is a somewhat brutal solution, especially for any content that already has attributes called 'id' that are not of type ID. On the other hand, a very large percentage of existiing XML usage does indeed call its ID attributes 'id'. This solution is an XML language change.

Steal the string "id" if undeclared

In well formed content that does not have a DTD, or that has a partial DTD used for decoration (declaring ID, declaring attribute defaults, etc) if an attribute is called id and has not been declared in the DTD, it is of type ID. This is slightly better than the 'steal all occurrences' option, in that it does not change any pre-existing declarations. However, it does result in different processing depending on whether the DTD is read or not. At minimum, it would need authoring guidelines that warn against the consequences of the DTD saying that the type of any 'id' attribute is of any other type than ID. In practice, this is the same terefore as the previous option - it steals all unqualified attributes called 'id'

Add a predeclared id attribute to the xml namespace

The obvious solution to avoiding a clash with existing usage is to use a namespace-qualified attribute name; and the obvious solution for avoiding a dependence on a DTD is to use the reserved xml namespace, which does not need declaration. A new xml:id attribute, (in the same way that xml:base added a predeclared attribute to the existing xml:lang and xml:space attributes) which is predeclared to be of type ID, could be used by any XML vocabulary that wanted interoperability even in the case that DTDs were not being read. Because it is predeclared, it could not clash with whatever a DTD or schem asaid, so processing would be identical for validating parsers, non-validating but DTD-reading parsers, and non-DTD-reading parsers.

This solution is easy to understand. A disadvantage is that XML vocabulary specifications would need to be revised to use this new attribute name, and the XML specification itself (or a supplemental specification, similar to that for xml:base) would need to describe the functionality. Older content would be no better off, but no worse off either. It preserves backward compatibility and does not affect well-formedness.

Inline declaration solutions

This class of solutions adresses the problem by adding a new, inline declaration mechanism to XML that can be used without DTDs or as a lightweight replacement for some uses of DTDs. Clearly these are all XML language changes. On the whole they do not affect well formedness.

Add an inline, per-instance ID declaration method

In the same way that xml:base added a predeclared attribute to the existing xml:lang and xml:space attributes, add another one called xml:idAttr. It takes as value the local name of an attribute. All attributes of that name in the per-element partition become of type ID. It may only be used on the root element of the instance. For example:

<root xml:idAttr="foo" foo="x1">
    < subelement foo="a2"/>
</root>

This is fairly easy to explain, as it uses existing xml instance syntax. It does not inadvertently affect any existing content, and minimises the effect on revisions of XML grammars that want to make use of it - no renaming of existing attributes is required. A disadvantage is that the inline declaration might conflict with what the DTD or the Schema says about the type of this attribute. This would at minimum need to be adressed by authoring guidelines. The restriction to the root element hinders composability, which is a pity since a potential benefit of an inline declaration syntax is an increase in composability.

A possible enhancement is to accept either a local name or a qname; if its a qname then resolve to a namespace URI, local name pair on the element that has xml:idAttr and then all attributes with that local name in that namespace are of type ID.

Add an inline, per subtree ID declaration method

In the same way that xml:base added a predeclared attribute to the existing xml:lang and xml:space attributes, add another one called xml:idAttr. It takes as value the local name of an attribute. All attributes of that name in the per-element partition, on that element and its children become of type ID.

It can be used on any element. It can also take the value "" in which case, no attributes on that element or its children are declared to be of type ID (used when composing multiple namespaces). Such declarations are scoped; a new declaration replaces, rather than adding to, the currently in-scope ID attribute. This avoids the situation where an element could have two attributes both of type ID.

Discussion

Requiring DTD validation to get IDs is too big a retrogressive step; it essentially throws away well formedness as a concept and also XML namespaces, and needlessly conflates validation with decoration.

Requiring W3C XML Schema validation to get IDs is probably too big a forwards step; it adds a lot of machinery to get a simple but crucial step forward and needlessly conflates validation with decoration. However, it is possible that creative use of xsi:type in the instance could mitigate this disadvantage somewhat.

It would seem preferable that, if an inline declaration method is chosen, W3C XML Schema be revised so that the behavior of documents that the inline delaration and also use a W3C XML Schema is consistent with regards to the attribute declared of type ID in the instance, whether the Schema is used or not (in other words, an implicit declaration in the instance is the same in the PSVI as if the attribute had been declared of type ID in the Schema, except that part of the PSVI that traces which Schema provided the rule - that part would report that the instance provided the rule).

Most (but not all) attributes called id are of type ID. Most (but not all) attributes of type ID are called id. 100% of single-namespace documents could be brought into conformance with inline, per-instance ID declaration method by adding a single attribute to the root element. 99% of them would be brought into conformance by adding:

xml:idAttr="id"

to the root element. Crucially, the 1% that do not are still catered for, simply by declaring whatever the attribute is called - a big advantage over other options.

For full generality for multi-namespace documents, the scoped solution would give the greatest expressive power.

The fact that a type ID declared in a DTD is not exactly equivalent to a type ID declared in a W3C XML Schema is noted, but not directly addressed in this discussion. Some solutions would give a DTD IDness, other solutions would give a Schema IDness, and yet other solutions could choose the appropriate definition or perhaps give both definitions.

Some people have suggested that this problem is specific example of a more general problem that 'well formed' is a broken concept and should be removed or deptrecated, to be replaced with two other conformance classes. One is full infoset processing, where DTDs (including external subsets and external entities) are allways fetched, but validation is onot done. The second class would require conforming processors to not read the internal DTD subset if present, and to not fetch any external DTD subset. This would give an increase in interoperability, or at least consistency, particularly if document instances had some means of declaring which type of processing they expected or if there were some external XMP processing or pipelining work, perhaps as a result of work on XML packaging. The TAG does not take a position on this wider scope, but merely notes that even were these extra two conformance classes to exist, the problem of getting IDness in the lowest (no DTD processing) class would still not be solved.

The desire of some people to have new ID-like types that are, for example, unique among elements of a give type, or unique in a particular namespace, or unique in a given subtree, are also not addressed in this document although it is appreciated that some of these new types would aid in composability.

Conclusions

This document would benefit from more discussion. No conclusion is presented. This document is a summary made available by the TAG; it is hoped that it lists all the solutions that have been proposed, and their advantages and disadvantages, which should help any group chartered to deal with this problem. The TAG is unhappy with the current situation and would like to see further discussion and convergence on a solution.