[Editorial Draft] Extending and Versioning Languages: Terminology

Draft TAG Finding 13 November 2007

This version:
http://www.w3.org/2001/tag/doc/versioning-20071113.html ( xml )
Latest version:
Previous versions:
Unapproved Editors Drafts: http://www.w3.org/2001/tag/doc/versioning-20070917.html, http://www.w3.org/2001/tag/doc/versioning-20070704.html, http://www.w3.org/2001/tag/doc/versioning-20070518.html, http://www.w3.org/2001/tag/doc/versioning-20070326.html, http://www.w3.org/2001/tag/doc/versioning-20061212.html, http://www.w3.org/2001/tag/doc/versioning-20060726.html, http://www.w3.org/2001/tag/doc/versioning-20060717.html, http://www.w3.org/2001/tag/doc/versioning-20060710.html, http://www.w3.org/2001/tag/doc/versioning-20031116.htmlhttp://www.w3.org/2001/tag/doc/versioning-20031003.html
David Orchard, BEA Systems, Inc. <David.Orchard@BEA.com>


This document provides terminology for discussing language versioning. Separate documents contains versioning strategies and XML language specific discussion.

Status of this Document

This document has been developed for discussion by the W3C Technical Architecture Group. It does not yet represent the consensus opinion of the TAG.

Publication of this finding does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time.

Additional TAG findings, both approved and in draft state, may also be available. The TAG expects to incorporate this and other findings into a Web Architecture Document that will be published according to the process of the W3C Recommendation Track.

Please send comments on this finding to the publicly archived TAG mailing list www-tag@w3.org (archive).

Table of Contents

1 Introduction
    1.1 Terminology
        1.1.1 Compatibility
        1.1.2 Detailed example of Defined and Accept Text set
        1.1.3 Partial Understanding
   Agents that are Consumers and Producers
        1.1.4 Divergent Understanding and Compatibility
        1.1.5 Open or Closed systems
        1.1.6 Compatibility of languages vs compatibility of applications
2 Conclusion
3 References
4 Acknowledgements


A Change Log (Non-Normative)

1 Introduction

The evolution of languages by adding, deleting, or changing syntax or semantics is called versioning. Making versioning work in practice is one of the most difficult problems in computing. Arguably, the Web rose dramatically in popularity because HTML and HTTP provide effective support for extensibility and versioning. Both systems provide explicit extensibility points and rules for understanding extensions that enable their decentralized extension and versioning.

This finding describes terminology of languages and their versioning.

1.1 Terminology

Suggested terminology for describing languages, producers, consumers, information, constraints, syntax, text, extensibility, compatibility etc. follows. Let us consider an example. Two or more systems need to exchange information about people's names. Names may not be the perfect choice of example because of internationalization reasons, but it resonates strongly with a very large audience. The Name Language is created so that documents conformant to the name language can be exchanged between computer applications. Documents in our context are called texts. [Definition: Text is a sequence of characters or bits]. [Definition: A producer is an agent that creates text.] Continuing our example, Fred is a producer of Name Language text. [Definition: An Act of Production is the creation of text.]. A producer produces text for the intent of conveying information. When Fred creates the text, that is an act of production. [Definition: A consumeris an agent that consumes text.] We will use Barney and Wilma as consumers of text. [Definition: An Act of Consumption is the processing of text of a language.] Wilma and Barney consume the text separately from each other in two different consumption events. A consumer is impacted by the instance that it consumes. That is, it interprets that instance and bases future processing, in part, on the information that it believes was present in that instance. Text can be consumed many times, by many consumers, and have many different impacts.

Example 1: Name examples.

name="Dave Orchard" 

<span class="fn">Dave Orchard</span>


[Definition: A Language consists of a set of text, any constraints on the information, and the mapping between texts and information.] Any particular text may or may not have membership in a language. Indeed, a particular text may be a member of many languages, and there will typically be many different texts that are members of a given language. The texts of a language are the units of exchange between a producer and consumer. [Definition: When a text is the outermost unit of exchange, we call it a document]. Documents, in turn may employ use smaller languages internally. For example, a purchase order language may use a name language to represent names. The texts may have regular partitions or divisions in the language. For example, XML has elements indicated by markup and attributes inside the markup and indicated by quotes. [Definition: A component is a partition of the language that is a sub-part of the texts and carries a specific meaning. ] XML elements and attributes are components. In many languages, [Definition: a term is a component.]

The Name Language consists of text set that has 3 terms: name, family, and given. The Name Language specifies syntactic constraints: that a name consists of a given and a family and that given and family are strings. [Definition: A language may have a syntax that determines the set of strings in the language.] The syntax can be defined intensionally in machine processable syntax languages such as XML Schema, microformats, BNF, or Regular expressions; extensionally by just listing the texts that are legal; human readable textual descriptions such as HTML descriptions; or are embodied in software. The total set of syntax constraints on a language determine the strings that qualify for membership in the language.

[Definition: Information is the result of processing, manipulating and organizing a text in a way that adds to the knowledge of the consumer.]. (Ednote: cribbed from wikipedia). [Definition: The information set of a language is the set of results of processing, manipulating and organizing each of the texts in a language.] In the Name Language, given and family provide information about the given and family names of people. [Definition: A binding is the relationship between the items in the text set and the corresponding items in the information set.] Any potential consumption conveys information from text according to the language's binding. In our example, the binding is obvious and trivial, but in many languages the binding is not. Two languages may have the exact same strings but different information derived from them. In general, the intended information of a vocabulary term is scoped by the language in which the term is found. However, there is some expectation that terms drawn from a given vocabulary will convey consistent information across all languages in which they are used. Confusion often arises when terms have inconsistent information across languages. The Name terms might be used in other languages, but it is generally expected that they will still be "the same" in some meaningful sense.

These terms and their relationships are shown below

Diagram of language terms

The diagram shows a simplification where the producer and consumer are relationships from agent rather than separate entities.

We say that Fred engages in an Act of Production that results in a Name Instance with respect to Name Language V1. The Name Instance is in the set of Name V1 Texts, that is the set of strings in the Name Language V1. The production of the Name Instance has the intent of conveying Information, which we call Information 1. This is shown below:

Production instance

We say that Barney engages in an Act of Consumption of a Name Instance with respect to Name Language V1. The consumption of the Name Instance has the impact of conveying Information 1. This is shown below:

Production and consumption instance

Versioning is an issue that effects almost all applications eventually. Whether it is a processor styling documents in batches to produce PDF files, Web services engaged in financial transactions, HTML browsers, the language and instances will likely change over time. The versioning policies for a language, particularly whether the language is mutable or immutable, should be specified by the language owner. Versioning is closely related to extensibility as extensible languages may allow instances that are allowed the language but the terms are not defined by the language. Applications may receive strings of a language which may have been produced using a version of the language that is different from the language version(s) that the receiver was expecting

If a Name Language V2 exists, with its set of strings and Information set, Wilma may consume the same Name Instance but with respect to the Name Language V2 and have impact of Information 2. Name Language V2 relates to V1 by relationship r2, which is forwards compatible when language V1 allows language V2 instances, and backwards compatible when language V2 allows language V1 instances. Similarly, Information 2 - as conveyed by Consumption 2 - relates to Information 1 - as conveyed by Consumption 1 - by relationship r1.

Production and 2 Consumptions Instance

Extensibility is a property that enables evolvability of software. It is perhaps the biggest contributor to loose coupling in systems as it enables the independent and potentially compatible evolution of languages. [Definition: A language is Extensible if the syntax of a language allows information that is not defined in the current version of the language and provides mappings from documents in any extended set to documents that are already defined]. The Name Language is extensible if it can include terms that aren't defined in the language, like a new middle term, and specifies that the middle terrm should be ignored.

1.1.1 Compatibility

As languages evolve, it is possible to speak of backwards and forwards compatibility. A language change is backwards compatible if consumers of the revised language can correctly process all instances of the unrevised language. Backwards compatibility means that a newer version of a consumer can be rolled out in a way that does not break existing producers. A producer can send a text per the unrevised version of a language to a consumer that understands the revised language and still have the text successfully processed. A software example is a word processor at version 5 being able to read and process version 4 documents. A schema example is a schema at version 5 being able to validate version 4 documents. In the case of Web services, this means that new Web services consumers, ones designed for the new version, will be able to process all messages in the old language.

A language change is forwards compatible if consumers of the unrevised language can correctly process all instances of the revised language. Forwards compatibility means that a newer version of a producer can be deployed in a way that does not break existing consumers. A producer can send a text per the revised version of a language to a consumer that understands the unrevised language and still have the text successfully processed. Of course the older consumer will not implement any new behavior, but a producer can send a text of the revised language and still have the text successfully processed. An example is a word processing software at version 4 being able to read and process version 5 documents. A schema example is a schema at version 4 being able to validate version 5 documents. This means that a producer can send a newer version of a message to an existing consumer and still have the message successfully processed. In the case of Web services, this means that existing Web service consumers, designed for a previous version of the language, will be able to process all messages in the new language.

In general, backwards compatibility means that existing texts can be used by updated consumers, and forwards compatibility means that newer texts can be used by existing consumers. Another way of thinking of this is in terms of message exchanges. A backward compatible change to a language enables consumers of the updated language to be deployed without having to update producers. A forward compatible change to a language allows producers of the updated language to be deployed without having to update consumers, shown below:

Example 2: Evolution of Producers and/or Consumers
Versioning Graphic

We need to be more precise in our definitions of what parts of our definitions are compatible with what other parts. Every language has a Defined Text set, which contains only Texts that contain the texts explicitly defined by the language syntax constraints. Typically, a language will define a mapping from each of the definitions to information conveyed by instances of those definitions. Each language has an Accept Text set, which contains texts that are allowed by the language constraints. Typically, the Accept Text set contains Texts that are not in the Defined Text set. The Texts that are in the Accept Text set that are not in the Defined Text set is the Unknown Text set. The mapping of the Unknown Text set to a Defined Text for purposes of determining the Information is specified by the language. For example, a language that has a syntax that says names consists of given followed by family followed by anything. A text that consists of a name with only a given and a family falls in the Defined and Accept Text set. A text that consists of a name with a given, a family and an extension such as a middle falls in the Accept Text set, the Unknown Text set but not the Defined text set. By definition, the Accept Text set is a superset of the Defined Text set. By definition, the Defined Text set and the Unknown Text set are disjoint.

We have discussed backwards and forwards compatibility in general, but there other flavours of compatibility, based upon compatibility between the Accept Text set, Defined Text set and Information conveyed. Syntactic compatibility is compatibility that is wrt the Texts only, not the information conveyed. Because languages have Accept and Defined Text sets, some producers will adhere to the Defined Text set, and others will adhere to the Accept Text set. Compatibility with Producers that produce only Defined Text sets is called "strict" compatibility. Compatibility with Producers that may produce Texts in the Accept Text Set, either in the Defined Text set or the Unknown Text set, is called "full" compatibility.

A more precise definition of compatibility is with respect to the texts, that is whether all the texts in one language are also texts in another language. Another precise form of compatibility is with respect to the information conveyed, that is whether the information conveyed by a text in one language is conveyed by the same text interpreted in another language. The texts could be compatible but the information conveyed is not compatible. For example, the same text could mean different and incompatible things in the different language. Most systems have different layers of software, each of which can view a text differently and affect compatibility. For example, the XML Schema PSVI view is different from the actual text. We can also differentiate between language compatibility and application compatibility. While it is often the case that they are directly related, sometimes they are not, that is 2 languages may be compatible but an application might be incompatible with one of them.

We provide mathematical definitions of a text's compatibility based up on our terminology. Ednote: there are two alternatives for the validity rules provided, reader preference is solicited.

  • Let L1 and L2 be Languages, where L2 is introduced "after" L1.

  • Let T be a text.

  • T is in L1 iff (T is valid per L1 | T is in L1's set of Texts).

  • Let I1 be the information conveyed by Text T per language L1.

  • Let I2 be the information conveyed by Text T per language L2.

  • Text T is "completely compatible" with language L2 if and only if I1 is compatible with I2 and (T is valid per L2 | T is in L2's set of Texts).

  • I1 is compatible with I2 if all of the information in I1 does not replace or contradict any information in I2.

  • Text T is incompatible if any of the information in I2 is wrong (I.e. replaces a value in I1 with a different one) | (T is invalid per L2 | T is not in L2's set of Texts).

We can also provide mathematical definitions of language compatibility:

  • L2 is "fully backwards compatible" with L1 if every text in L1 Accept Text set is fully compatible with L2.

  • L2 is "strictly backwards compatible" with L1 if every text in L1 Defined Text set is fully compatible with L2.

  • L2 is "strictly backwards incompatible" with L1 if any text in L1 Defined Text set is incompatible with L2.

  • L1 is "fully forwards compatible" with L2 if every text in L2 Accept Text set is fully compatible with L1.

  • L1 is "strictly forwards compatible" with L2 if every text in L2 Defined Text set is fully compatible with L1.

  • L1 is "forwards incompatible with L2" if any text in L2 is incompatible with L1.

  • And combined together using the term completely compatible to mean backwards and forwards compatible is: L1 is completely strictly compatible with L2 if every text in L2 Defined Text set is fully compatible with L1 AND if every text in L1 Defined Text set is fully compatible with L2.

The relationships between L1 and L2 Defined Text sets and Accept Text sets for a compatible change are shown below:

Example 3: Compatible evolution of L1 to L2 showing Defined and Accept Text sets
Versioning Language bullet diagram

We can draw a few conclusions. Given L2 is strictly backwards incompatible with L1 if any text in L1 Defined Text set is incompatible with L2, the only way that L2 can be backwards compatible with L1 is if the L2 Defined Text Set is a superset of L1 Defined Text set. Roughly, that means the addition of optional items in L2. Given L1 is "fully forwards compatible" with L2 if every text in L2 Accept Text set is fully compatible with L1, the only way that L1 can be forwards compatible with L2 is if the L1 Accept Text is is a superset of the L2 Accept Text set. Roughly, that means L1 allows all of L2 and more. It is this superset relationship that is a key to forwards compatibility, the allowing of texts by L1 that will become defined in L2.

Compatibility can be restated in terms of superset/subset relationships.

  • [Definition: Language L2 is strictly backwards compatible with Language L1 if L2 Defined Text set > (superset) L1 Defined Text Set AND every text in L1 Defined Text set is compatible with L2.]

  • [Definition: Language L1 is strictly forwards compatible with Language L2 if L1 Accept Text set > (superset) Language L2 Accept Text set AND every text in L2 Accept Text set is compatible with L1.]

  • Language L2 is completely strictly compatible with Language L1 if L1 Accept Text set > (superset) Language L2 Accept Text set > (superset) L2 Defined Text set > (superset) L1 Defined Text Set AND every text in L1 Defined Text set is compatible with L2 AND every text in L2 Accept Text set is compatible with L1.

We have shown that forwards and backwards compatibility is achievable through extensibility where the Accept Text set allows extensions to the Defined Text set, and compatible versioning is a process of gradually increasing the Defined Text Set, reducing or not changing the Accept Text Set, and ensuring the information conveyed is compatible. If ever the set relationships defined earlier do not hold, then the versions are not compatible.


An article on xml.com describes this theory of compatibility and provides graphical representation of the set relationships, at http://www.xml.com/pub/a/2006/12/20/a-theory-of-compatible-versions.html. Should part/all of the article be included in this document? Composition

Many languages are compound languages consisting of multiple languages. For example, a purchase order language could use the name language for names. The forwards, backwards and full compatibility definitions are true for composition of languages because the composed language's Defined and Accept Text sets are incorporated into the compound language's Defined and Accept Text sets. For example, the purchase order language Accept Text set is the Accept Text set defined by just the Purchase Order language plus the Accept Text set of any the items defined OR used by the Purchase Order language, which includes the Accept Text set of the name language.

1.1.2 Detailed example of Defined and Accept Text set

The following is a detailed example of the use of Defined and Accept Text set. The example is motivated by HTML and an extension, and styled by CSS. We will use two mythical languages PTHMTL and PCSS (pretend HTML and pretend CSS). PTHML is a language of well formed XML documents that must have a root tag <PHTML>. A few HTML-like tags, notably <P> for paragraph and <BODY> for body are defined in the PTHML language version 1 specification. As with HTML, PHTML allows for the appearance of arbitrary tags such as <BANANA> not named explicitly in the specification. PHTML defines any document containing such an extension tag to have the the same semantics as a similar document from which that tag has been deleted. Thus, per the PHTML specification, the following two documents have the same meaning:

Example 4: doc1.phtml: PHTML without Banana element
          <P>Versioning is hard.</P>


Example 5: doc2.phtml: PHTML with Banana element
           <P>Versioning is hard.</P>

The invariant in our definitions of Defined Text set and Accept Text set is that every document in the Accept Text set conveys the same meaning as some particular document in the Defined Text set. doc1.phtml above is in the Defined Text set for PHTML, because all of its content has a meaning supplied directly by the specification. doc2.phtml is in the Accept Text set; that document too is in the PHTML language, but the semantics of doc2.phtml are defined by means of its equivalence to a Defined Text Set document, doc1.phtml.

Continuing our example further, we allow our pretend CSS language to style the markup in PHTML documents, and crucially, the styles can be applied to <BANANA> elements as well as to paragraphs.

Example 6: doc3.phtml: PHTML with Banana element and PCSS Style
  <STYLE type="text/pcss">
   P {font-size: 120%}
   BANANA {color:yellow}
   <P>Versioning is hard.</P>

The paragraph will have a large font and will be yellow. PTHML as redefined by PCSS is a different language than PHTML on its own, and all of the legal strings (texts) in that new language are in its Defined Text set -- the Accept Text set is the empty set. The PCSS specification is the one that gives a non-vacuous meaning to <BANANA> elements. In the presence of PCSS, a document with a <BANANA> is no longer equivalent to one without. Thus, according to PHTML as redefined by PCSS, all PHTML documents are in the Defined Text set, and none are in the Accept Text set.

Now imagine that the PHTML language is revised such that <BANANA> elements are added to the language, and this is the only change to the language. The element has a restriction that it cannot contain <PHTML>, <HEAD>, or <BODY> elements. We will call the new language PHTML2. PHTML2 Defined Text set is a superset of the PHTML Defined Text set because it has added the <BANANA> element. The analysis of the Accept Text set size is more complicated. If PHTML does not allow <PHTML>, <HEAD>, or <BODY> elements to be children of <BODY> elements, then PHTML2 does not effectively add any restrictions. This is because a PHTML document with <PHTML>, <HEAD>, or <BODY> elements as children of <BANANA> elements are considered equivalent to a PHTML document with <PHTML>, <HEAD>, or <BODY> elements as children of <BODY> elements. If PHTML does not allow such children, then PHTML2 Accept Text set is equal to the PHTML Accept Text set. Given a PHTML2 Defined Text set that is a superset of PHTML Defined Text set and PHTML2 and PHTML Accept Text sets are equal, then PHTML2 is fully compatible with PHTML. On the other hand, if PHTML does allow such children, then PHTML2 Accept Text set is a subset of the PHTML Accept Text set. A PHTML2 Defined Text set that is a superset of PHTML Defined Text set and a PHTML2 Accept Text that is a subset of PHTML2 Accept Text set, then PHTML2 is strictly compatible with PHTML.

A producer of PHTML documents will be able to generate documents that are consumable by a PHTML2 receiver if the PHTML documents do not have any syntax that is now illegal under PHTML2. PHTML producers will only produce PHTML documents that are not consumable by PHTML2 receivers if the PHTML document contains <BANANA> elements, the language revision disallows certain <BANANA> children, and the PHTML document has <BANANA> elements that violate that restriction. In the example documents shown above, doc2.phtml and doc3.phtml are valid PHTML2 documents regardless of whether PHTML2 allows or disallows <PHTML>, <HEAD>, or <BODY> as children of <BANANA> elements.

1.1.3 Partial Understanding

So far, we have defined compatibility over all possible texts of a language and we’ve been discussing compatibility of the entire Defined and Accept Text sets. However, there are many scenarios where a consumer may consume only part of the information set. Such partial understanding affects the Text set used and the Information conveyed. Partial understanding usually results in a subset of the information being conveyed, because only part of the information is processed and understood by the consumer. Interestingly, such partial understanding consists of supersetting of the Accept Text Set and a parallel subsetting of the Defined Text Set. This is because the process of extracting a part of the text means that extra content, even that which was illegal under the earlier version’s syntax, becomes part of the Accept Text Set.

An example is application that looks only at given names and ignores everything else. My favourite example of this is a "Baby Name" Wizard. The application may use an XPath expression to extract the given from inside a name. The result is effectively a different version of the Name Language, which we will call the Given Name Language. We define the Accept Text set for the Given Name Language to be anything followed by a given followed by anything, or (*,given,*). The Defined Text set for the Given Name Language is given. The information set for the Given Name language contains just given. Because the Given Name Language syntax set is more relaxed that the Name Language V1, an addition of the middle name between the given and family is a compatible change for the Given Name Language. There are a variety of other now acceptable names in the Given Name Language.

The definitions of compatibility and language with respect to versioning need not change to deal with partial understanding. Partial understanding of a language can be thought of as the creation of a new Language L1' that is fully compatible with Language L1. This is true if L1' Accept Text set > (superset) Language L1 Accept Text set > (superset) L1 Defined Text set > (superset) L1' Defined Text Set AND every text in L1' Defined Text set is fully compatible with L1 AND every text in L1' Accept Text set is fully compatible with L1'.

Interestingly, defining a language for partial understanding the language is creating a language L1' that is a subset of the Defined Text set of L1, such that the L1 language is a compatible with L1'. There may be many different versions that are all partial understandings of a language. We call these related languages "flavours". It may be very difficult for a language designer to know how many different language flavours are in existence. However, a language designer can sometimes use the different flavours to their advantage in designing for a mixture of compatible and incompatible changes. Some changes could be compatible with some flavours but incompatible with others. It may be very useful to have some changes be compatible with some flavours, since consumers of those flavours do not need to be updated or changed.

It is crucial to point out that any consumer of the language does not produce a partial version of the language. A client may have relaxed the restrictions on the consuming side, but no producer should do so on the production side of the language. If a flavour of a language was also used for production, it should have to create an instance that is valid according to the Language L1 rules, not the Language L1'. Perhaps the only exceptions are if they are guaranteed that they will be producing for compatible flavours. Typically this is not the case and hard to determine, so the safest course is to produce according to the Language L1 rules.

We have shown how relaxing the syntax constraints on a language when consuming instances of it can turn an otherwise incompatible change into a compatible change. We have also shown that abiding by the language syntax constraints when producing instances is the safest course. Said more eloquently is the internet robustness principle, "be conservative in what you do, be liberal in what you accept from others" from [tcp].

We will call this style of versioning the "liberal" style of versioning. The "liberal" style of versioning is codified in:

Good Practice

Use Least Partial Languages for "liberal" versioning: Consumers should use a flavour of a language that has the least amount of understanding.

The flavor of a language that implies the smallest amount of understanding, which is the smallest Defined Text set, will be also be the most liberal and have the largest possible Accept Text set.

The "liberal" style of versioning has a drawback, however. It can lead to fragile software that is hard to evolve software because the "liberal"ness is difficult to code. In addition, it does not force producers to be correct in what they produce and can lead to a vicious cycle of complexity.


More Information on pros/cons needed. Perhaps change best practices to positions, then make best practices for the different scenarios, ie. Use Conservative Versioning for languages that are only processed by machines, Use Liberal Versioning for languages processed typically by humans or....

There is an opposite style of versioning that says the most effective way of evolving is to force producers to be correct by having strict consumers. We will call this the "conservative" style of versioning. The "conservative" style of versioning is codified in:

Good Practice

Use Only Full Languages for "conservative" versioning: Consumers should fully use and validate a language.

The greatest amount of understanding of a language will find the largest number of errors in producers.

Whether "liberal" or "conservative" versioning is in use by consumers, the advice to producers is always the same:

Good Practice

Produce only full languages: Producers should produce the complete version of a language. They should never produce partial flavors.

"Liberal" consumers may allow correct operation with producers that aren't fully compatible with a full language. "Conservative" consumers will be less tolerant of faults in producers.

EdNote: I think related to principle of least power (http://www.w3.org/2001/tag/doc/leastPower) . The lower the power of the language, the easier to have partial understanding? For example, XPath is "lower power" than Java processing the DOM. Agents that are Consumers and Producers

Our compatibility definitions are about the relationship between texts and different languages Text sets. We have avoided compatibility definitions based upon processors. However, many messaging specifications, such as Web Services, utilize both inputs and outputs. Such processors are both consumers and producers of texts. Using our definitions of compatibility, a Web service that updates its output message language is considered to be a newer producer, because it is sending a text according to the newer version of the message. Conversely, revising the input message language makes a service a newer consumer because it is consuming messages that are in the newer version of the language.

All systems that include inputs and outputs must consider both when making changes and determining compatibility. For full compatibility, a revised consumer or producer must ensure any input message changes are backwards compatible and any output message changes are forwards compatible.

1.1.4 Divergent Understanding and Compatibility

Our treatise so far has described a fairly straightforward evolution of a language, from a first version to a next version. However, extensibility and interoperability are usually directly related. It is an axiom in computing that the lower the optionality (which includes extensibility), the higher the chance of interoperability. Each and every place where extensibility is allowed in a language is also a place where a lack of interoperability can arise Interoperability problems can arise, for example, when producers and consumers do not agree on which version of a language is being used in a text.

Even though a language has a formal definition and extensibility model, there is no guarantee that software that processes it will implement it exactly. Differences in understanding between different software agents is a significant source of divergence in understanding. A classic example of this is the so-called "TAG soup" problem in HTML. Much of the applications commonly used to process HTML, and particularly browsers, have an Accept Text Set that is larger than the formal definition of the HTML language. For example, many situations of interleaved opening and closing of elements in HTML are processed without generating an error. This ensures the user experience, at least in the short term, is of higher quality. However, it does suffer long term problems with interoperability when the illegal texts are copied by mechanisms such as "view source". The reason is that the more undocumented strings that are in an Accept Text Set, the more difficult it is to achieve interoperability. The more liberal an agent in accepting texts by increasing the Accept Text Set through expanding the definition of the language, the more difficult interoperability is because not every agent may have the same Accept Text Set. Further, not every agent may have the same Defined Text set. For example, there could be different DOMs for the same text.

At the other extreme is XML. XML allows almost no extensibility in its constructs. Name characters, Tag closures, attribute quoting and attribute allowed values are all very fixed. This has increased interoperability between implementations of XML. However, it has also made it very difficult to move to XML 1.1 because almost all changes are incompatible because of the lack of extensibility. The XML language design was very specifically trying to avoid the "HTML TAGSoup" problem, and it has arguably done that, at a cost of inability to version. These two extremes of design of extensibility exist because of well-thought design. The trade-off between extensibility, interoperability and the Accept Text set was planned in advance. Language designers should do the same with their languages.


Which references to support this? TAG issue: http://www.w3.org/2001/tag/issues#TagSoupIntegration-54

Good Practice

Analyze Trade-offs for Language: Language designers should analyze the trade-offs between extensibility, interoperability, and actual language Accept Text set.

1.1.5 Open or Closed systems

The cost of changes that are not backward or forward compatible is often very high. Much of the software that uses the affected language must be updated to support the newer incompatible version. The cost of such changes is related to whether the system in question is open or closed.

[Definition: A closed system is one in which all of the producers and consumers are more-or-less tightly connected and under the control of a single organization.] Closed systems can often provide integrity constraints across the entire system. A traditional database is a good example of a closed system: all of the database schemas are known at once, all of the tables are known to conform to the appropriate schema, and all of the elements in each row are known to be valid for the schema to which the table conforms.

From a versioning perspective, it might be practical, in a closed system, to say that a new version of a particular language is being introduced into the system at a specific time. At that time, all of the data that conforms to the previous version of the language is migrated to conform to the new version as part of the upgrade process.

[Definition: An open system is one in which some producers and consumers are loosely connected or are not controlled by the same organization. The internet is a good example of an open system.]

In an open system, it is often not practical to handle language evolution with universal, simultaneous, atomic upgrades to all of the affected software components. Existing producers and consumers, who are outside the immediate control of the organization that is publishing an updated language, may continue to use the previous version for a considerable period of time.

Finally, it's important to remember that systems evolve over time and have different requirements at different stages in their life cycle. Often, early versions of systems will operate as if they are closed. During initial development, when the earliest versions of a language are under construction, it may be valuable to pursue a much more aggressive, draconian versioning strategy. Once a system is more widely deployed, in production it tends to behave more as an open system. There is likely to be an expectation of stability in the language it provides. Consequently, it may be necessary to proceed with more caution and to be prepared to provide forwards and backwards compatibility for changes.

1.1.6 Compatibility of languages vs compatibility of applications

From NoahM:The draft is on pretty firm ground when it talks about the information that can be determined from a given input text per some particular language L. I think there are important compatibility statements we can and should make at just that level (see suggestions above), and we should separate them from statements about the compatibility of a particular pair of applications that may communicate using the language. Both are important to include, I think, but they should be in separate chapters, one building on the other. Once you've cleanly told a story about which information can be reliably communicated when sender and receiver interpret using different language versions, you can go on to tell a separate story about whether the applications can indeed work well together. To illustrate what I mean, here are examples at each of the two levels.

Language level incompatibility: Consider a situation in which the same input connotes different information in one version of a language or another. Without reference to any particular application, we can say that the languages are in that respect incompatible. For example, we might imagine a version of a language in which array indexing is 1-based, and a later version in which 0-based indexing is used; the information conveyed by any particular array reference is clearly in some sense incompatible, regardless of the consuming application's needs.

Application-level incompatibility: Now consider two applications designed render the same version of the HTML language. The same tags are supported, with the same layout semantics, etc. One of the applications, however, has a sub-optimal design. Its layout engine has overhead that grows geometrically with the number of layout elements. If you give it a table with 50 rows, it takes 3 seconds to run on some procesor. If you give it a table with 5000 rows it runs for 3 days. Question: is the second application "compatible" with the 5000 row input? In some ways yes, and in some no. It will eventually produce the correct output, but in practice a user would consider it incompatible. This illustrates that compatibility of applications ultimately has to be documented in terms meaningful to the applications. In this case, rendering time is an issue. I think we should not try in this finding to document specific levels of compatibility at the application level and we should especially not fall into the trap of trying to claim it's a Boolean compatible/incompatible relation; in the performance example, it's a matter of degree. So, the terminology needs to be specific to the application and its domain. I do think we can talk about some meta-mechanisms that work at the application level, such as mustUnderstand, but they should be in a section that's separate from the exposition of texts, information, and the degree to which information may be safely extracted from a given text when sender and receiver operate under differing specifications.

From Rhys: Clearly this is important, but I’m not sure we necessarily need to go to this level of completeness in a TAG finding. Feels like there is a book in this!

The current draft tries to take the approach that we will model application compatibility by defining a new language that is the flavor of (in this case HTML) that a particular consumer will successfully process, but the point is that "success" is sometimes a fuzzy concept. Do we have two languages for this example, one for the documents that completely break application #2 and another for those that just make it run slowly? That seems to be what the finding is doing today, and I'm not convinced it's the right approach. My proposal would be that we just point out the distinction and say: "This first section of the finding for the most part restricts its analysis to the limited question of: what information can be reliably conveyed when a producer and a consumer operate using different versions of what purport to be the same or similar languages? The later sections explore some techniques that can be used by applications to negotiate means of safe interoperation when sender and receiver are written to differing versions of a language specification."

From Rhys: I like Noah’s suggested approach because this is really the problem we are setting out to solve for specific types of XML usage

2 Conclusion

This Finding is intended to provide a terminology basis for further versioning findings.

3 References

Free Online Dictionary of Computing. (See http://wombat.doc.ic.ac.uk/foldoc/.)
Flexible XML Processing Profile. (See http://www.upnp.org/download/draft-goland-fxpp-01.txt.)
RFC 793, TCP (See http://www.ietf.org/rfc/rfc793.txt.)
RFC 1521, MIME. (See http://www.ietf.org/rfc/rfc1521.txt.)
HTML 2.0
RFC 1866, HTML 2.0. (See http://www.ietf.org/rfc/rfc1866.txt.)
WebDAV XMLIgnore post
Yaron GolandXML Ignore proposed for WebDAV (See http://lists.w3.org/Archives/Public/w3c-dist-auth/1997AprJun/0190.html.)
RFC 2518, WebDAV (See http://www.ietf.org/rfc/rfc2518.txt.)
RFC 2616, HTTP (See http://www.ietf.org/rfc/rfc2616.txt.)
HTML 4.0
HTML 4.0. (See http://www.w3.org/TR/1998/REC-html40-19980424/.)
TBL Mandatory Extensions
Berners-Lee. Web Architecture: Mandatory extensions. (See http://www.w3.org/DesignIssues/Mandatory.html.)
TBL Extensible languages
Berners-Lee. Web Architecture: Extensible languages. (See http://www.w3.org/DesignIssues/Extensible.html.)
TBL Evolution
Berners-Lee. Web Architecture: Evolvability. (See http://www.w3.org/DesignIssues/Evolution.html.)
Web Architecture: Extensible Languages
Berners-Lee and Connolly, ed. Web Architecture: Extensible Languages World Wide Web Consortium, 1998. (See http://www.w3.org/TR/1998/NOTE-webarch-extlang-19980210.)
HTML Document types
Connolly, ed. HTML Document dialects World Wide Web Consortium, 1996. (See http://www.w3.org/MarkUp/WD-doctypes.)
SOAP 1.2
W3C Recommendation, SOAP 1.2 Part 1: Messaging Framework (See http://www.w3.org/TR/SOAP/.)
WSDL 1.1
W3C Note, WSDL 1.1 (See http://www.w3.org/TR/WSDL/.)
WS-Policy 1.2
W3C Note, WS-Policy 1.2 (See http://www.w3.org/Submissions/WS-Policy/.)
XML 1.0
W3C Recommendation, XML 1.0 (See http://www.w3.org/TR/REC-xml.)
W3C Working Draft, XML Inclusions (See http://www.w3.org/TR-Xinclude.)
XML Namespaces
W3C Recommendation, XML Namespaces (See http://www.w3.org/TR/REC-xml-names.)
XML Schema Part 2
W3C Recommendation, XML Schema, Part 2 (See http://www.w3.org/TR/xmlschema-2.)
XML Schema Wildcard Test Collection
XML Schema Wildcard Test collection (See http://www.w3.org/XML/2001/05/xmlschema-test-collection/result-ms-wildcards.htm.)
XFront Schema Best Practices
XFront Schema Best Practices (See http://www.xfront.com/BestPracticesHomepage.html.)
XML.com Schema Design Patterns
Dare ObasanjoXML.com Schema design patterns (See http://www.xml.com/pub/a/2002/07/03/schema_design.html.)
Dave Orchard writings on Extensibility and Versioning
Dave Orchard writings on extensibility and versioning (See http://www.pacificspirit.com/Authoring/Compatibility.)

4 Acknowledgements

The author thanks Norm Walsh for many contributions as co-editor until 2005. Also thanks the many reviewers that have contributed to the document particularly David Bau, William Cox, Ed Dumbill, Chris Ferris, Yaron Goland, Rhys Lewis, Hal Lockhart, Mark Nottingham, Jeffrey Schlimmer, Cliff Schmidt, and Norman Walsh.

A Change Log (Non-Normative)

DBO20070518Incorporated Rhys' comments, added version identifier story to forwards compatible evolution, split part 1 into terminology and strategies documents.
DBO20070704Incorporated Dan, Stuart and Noah's comments, f2f minutes from http://www.w3.org/2001/tag/2007/05/30-minutes, many other updates including revisions to all diagrams and adding "bulls-eye" diagram