Draft TAG Blog Entry on Version Labels in Documents

(This is a draft of a possible blog entry to be issued by the TAG. It is provided in fulfillment of an action item assigned me at the June 2007 TAG Face to Face Meeting. Noah.)

Good practice: Version information

A data format specification SHOULD provide for version information.

So, it's always a good idea when you design a language or data format to include something like a version attribute or some code, probably near the beginning of the document, to indicate what version of the language is being used.

What does a version identifier convey?

In fact, do we even agree on what it means to put something like a language version marker on a document? Let's imagine a simple XML language designed for setting down recipes. In the first version of the language, the markup looks like this:

The allowed markup in version 1.0 of the recipe is just what's shown above: an outer <recipe> containing <ingredients> and <steps>, etc. Eventually it's decided that it would be useful to provide optional pictures for ingredients or steps. So in version 2 of the language we can do things like:

Question: let's imagine that version 2 of the language, the one that supports the optional pictures, has been out for awhile, but I still want to write a simple recipe with no pictures:

What's the best value to put in the version attribute? I know that version 2.0 is the latest version of the recipe language. In fact, that's the only version of the specification I have next to me, so maybe I should use that? There's a problem, though. That version="2.0" marker might not work with software that's written to version 1.0, and in fact, my document would otherwise be a fine 1.0 recipe document.

So, maybe I should label it 1.0? Unfortunately, that's a bit hard for me. In general, I would need to know the specifications for every version of the language that's ever existed, so I could pick the oldest one that's OK for my document. If the language has been revised a lot, that's going to be difficult.

Maybe the version attribute should take a list of versions, and I should put in both 1.0 and 2.0? That could be helpful, but I probably won't want to go back and fix it up if someone adds another backwards compatible change to create a version 3.0 of the language, and at very least I'll still need to know about all the versions that existed when I wrote my document.

The best answer is probably different depending on the language, how often it's revised, whether revisions tend to maintain backwards compatibility, etc.

Is having some sort of version identifier always a good idea?

That Good Practice Note quoted above says "provide a version indicator", but we've just shown that we're not always quite clear on what that would do anyway. Is it still good advice to suggest that surely you need something in the instance?

As shown above, it's common for the same instance document to be legal in many versions of a language. As long as such documents are likely to have the same or sufficiently compatible meanings per the different versions, then it may be better to omit any indication of version in the instance, and leave it to the receiving software to decide whether the document can be processed. After all, with the second recipe above, the receiver will soon enough discover that it can or can't process picture attributes, and if not, it either will or won't know that they can be safely ignored. Version attributes can be helpful in giving early warning of incompatibilities, or as a crosscheck for catching errors, but they're usually not essential to correct operation.

One important exception is in the case where the language is likely to change in incompatible ways. If the same document means different things in different versions of a language, then it's very important to indicate which version the author had in mind when creating the document. Putting that version indicator into the document itself is one good way to do it. So maybe the right advice is:

Good practice (revised): Version information

If a language or data format will change in incompatible ways, then indicate the language version used for each instance.

Are namespaces a good way to identify language versions?

If version identifiers aren't always a good bet, what about namespaces? Many modern languages allow the creation of globally unique names, identifiers, tags, etc. In XML this is done through use of Namespaces. In RDF, it's done by using URIs as identifiers, etc..

Sometimes it's appropriate to use new identifiers for each version of a language, and mechanisms like namespaces can make that easier:

In this example, the element with expanded name {http://example.org/recipeLanguage2, step} allows a picture attribute, but {http://example.org/recipeLanguage1, step}, does not.

A full discussions of the pros and cons of using namespaces this way is beyond the scope of this note. One important advantage of using namespaces is that they can be easily applied not just to the root element for the language as a whole, but to mixtures of compound document markup, in which each sublanguage evolves with its own namespaces. Also, because namespace names are URIs, you can use the Web itself to get information about them.

Namespaces do have drawbacks. Imagine if there were 50 different namespaces for a language just because 50 separate bugs had been fixed in different errata. Would you republish all the markup in 50 namespaces? Would each document have lots of namespaces, with each element named with the last namespace in which it had been revised? Namespaces can be very useful for designating language versions, but there's no one idiom that's right for all languages. We note that most widely deployed tag-based languages for the Web (HTML, XML Schema, XSLT) have chosen either to use the same namespace(s) across multiple versions, or in the case of some flavors of HTML, not to use namespaces at all.

Conclusions

So, the TAG is having second thoughts about the suggestion that all data formats SHOULD provide for version identification. Sometimes it's a good thing to do, but sometimes not. Perhaps the right advice will be what's proposed in the revised Good Practice Note above. In any case, the TAG has been working for several years on a finding that will explore in detail many issues relating to versioning, and version attributes are likely to be among the topics covered. In the meantime, we thought we'd take the opportunity to signal that we're not so sure that the advice in the Architecture Document is as good as we thought.

By the way, TAG member David Orchard has covered some of the same topics as well as many others relating to versioning in his personal blog. See for example [Dave, please suggest some appropriate links.] Dave is also the principle author of the TAG's draft finding on versioning. New drafts of that come out every few months, and we're hoping to have something more or less complete, well, real soon now.

Version Identifiers Reconsidered

What does a version identifier convey?

Is having some sort of version identifier always a good idea?

Are namespaces a good way to identify language versions?

Conclusions