[Editorial Draft] Extending and Versioning Languages: Terminology

1 Introduction

The evolution of languages by adding, deleting, or changing syntax or semantics is called versioning. Making versioning work in practice is one of the most difficult problems in computing. Arguably, the Web rose dramatically in popularity because evolution and versioning were built into HTML and HTTP. Both systems provide explicit extensibility points and rules for understanding extensions that enable their decentralized extension and versioning.

This finding describes terminology of languages and their versioning.

The terminology for describing languages, producers, consumers, information, constraints, syntax, evolvability etc. follows. Let us consider an example. Two or more systems need to exchange name information. Names may not be the perfect choice of example because of internationalization reasons, but it resonates strongly with a very large audience. The Name Language is created to be exchanged. [Definition: A producer is an agent that creates text.] Continuing our example, Fred is a producer of Name Language text. [Definition: An Act of Production is the creation of text.]. A producer produces text for the intent of conveying information. When Fred does the actual creation of the text, that is an act of production. [Definition: A consumer is an agent that consumes text.] We will use Barney and Wilma as consumers of text. [Definition: An Act of Consumption is the processing of text of a language.] Wilma and Barney consume the text separately from each other, each of these being a consumption event. A consumer is impacted by the instance that it consumes. That is, it interprets that instance and bases future processing, in part, on the information that it believes was present in that instance. Text can be consumed many times, by many consumers, and have many different impacts.

[Definition: A Language consists of a set of text, any syntactic constraints on the text, a set of information, any semantic constraints on the information, and the mapping between texts and information.][Definition: Text is a specific, discrete sequence of characters]. Given that there are constraints on a language, any particular text may or may not have membership in a language. Indeed, a particular string of characters may be a member of many languages, and there may be many different strings of characters that are members of a given language. The text of the language are the units of exchange. Documents are texts of a language. The Name Language consists of text set that have 3 terms and specifies syntactic constraints: that a name consists of a given and a family. [Definition: A language has a set of constraints that apply to the set of strings in the language.] These constraints can be defined in machine processable syntactic constraint languages such as XML Schema, microformats, human readable textual descriptions such as HTML descriptions, or are embodied in software. Languages may or may not be defined by a schema in any particular schema language. The constraints on a language determine the strings that qualify for membership in the language. Vocabulary terms contribute to the set of strings, but they are not the only source of characters to the set of strings in a given language. The language strings may include characters outside of terms, such as punctuation. One reason for additional characters is to distinguish or separate terms, such as whitespace and markup.

Example 1: Name examples.

<name>
  <given>Dave</given>
  <family>Orchard</family>
</name>

name="Dave Orchard" 

<span class="fn">Dave Orchard</span>

urn:namescheme:given:Dave:family:Orchard

The set of information in a language almost always has semantics. In the Name Language, given and family have the semantics of given and family names of people. The language also has the binding from the items in the information set to the text set. Any potential act of interpretation, that is any consumption or production, conveys information from text according to the language's binding. The language is designed for acts of interpretation, that being the purpose of languages. In our example, this mapping is obvious and trivial, but many languages it is not. Two languages may have the exact same strings but different meanings for them. In general, the intended meaning of a vocabulary term is scoped by the language in which the term is found. However, there is some expectation that terms drawn from a given vocabulary will have a consistent meaning across all languages in which they are used. Confusion often arises when terms have inconsistent meaning across language. The Name terms might be used in other languages, but it is generally expected that they will still be "the same" in some meaningful sense.

These terms and their relationships are shown below

We say that Fred engages in an Act of Production that results in a Name Instance with respect to Name Language V1. The Name Instance is in the set of Name V1 Texts, that is the set of strings in the Name Language V1. The production of the Name Instance has the intent of conveying Information, which we call Information 1. This is shown below:

We say that Barney engages in an Act of Consumption of a Name Instance with respect to Name Language V1. The consumption of the Name Instance has the impact of conveying Information 1. This is shown below:

Versioning is an issue that effects almost all applications eventually. Whether it's a processor styling documents in batch to produce PDF files, Web services engaged in financial transactions, HTML browsers, the language and instances will likely change over time. The versioning policies for a language, particularly whether the language is mutable or immutable, should be specified by the language owner. Versioning is closely related to extensibility as extensible languages may allow different versions of instances than those known by the language designer. Applications may receive versions of a language that they aren't expecting.

If a Name Language V2 exists, with its set of strings and Information set, Wilma may consume the same Name Instance but with respect to the Name Language V2 and have impact of Information 2. Name Language V2 relates to V1 by relationship r2, which is forwards compatible comparing language V1 to V2 instances, and backwards compatible comparing language V2 to V1 instances. Similarly, Information 2 - as conveyed by Consumption 2 - relates to Information 1 - as conveyed by Consumption 1 - by relationship r1.

Extensibility is a property that enables evolvability of software. It is perhaps the biggest contributor to loose coupling in systems as it enables the independent and potentially compatible evolution of languages. Languages are defined to be [Definition: Extensible if the syntax of a language allows information that is not defined in the current version of the language.]. The Name Language is extensible if it can include terms that aren't defined in the language, like a new middle term.

We will call this style of versioning the "liberal" style of versioning. The "liberal" style of versioning is codified in:

Good Practice

Use Least Partial Languages for "liberal" versioning: Consumers should use a flavour of a language that has the least amount of understanding.

The flavor of a language that implies the smallest amount of understanding will be also be the most liberal and have the largest possible Accept set.

The "liberal" style of versioning has a drawback, however. It can lead to fragile software that is hard to evolve software because the "liberal"ness is difficult to code. In addition, it does not force producers to be correct in what they produce and can lead to a vicious cycle of complexity.

ednote

More Information on pros/cons needed. Perhaps change best practices to positions, then make best practices for the different scenarios, ie. Use Conservative Versioning for languages that are only processed by machines, Use Liberal Versioning for languages processed typically by humans or....

There is an opposite style of versioning that says the most effective way of evolving is to force producers to be correct by having strict consumers. We will call this the "conservative" style of versioning. The "conservative" style of versioning is codified in:

Good Practice

Use Only Full Languages for "conservative" versioning: Consumers should fully use and validate a language.

The greatest amount of understanding of a language will find the largest number of errors in producers.

Whether "liberal" or "conservative" versioning is in use by consumers, the advice to producers is always the same:

Good Practice

Produce only full languages: Producers should produce the complete version of a language. They should never produce partial flavors.

"Liberal" consumers may allow correct operation with producers that aren't fully compatible with a full language. "Conservative" consumers will be less tolerant of faults in producers.

EdNote: I think related to principle of least power (http://www.w3.org/2001/tag/doc/leastPower) . The lower the power of the language, the easier to have partial understanding? For example, XPath is "lower power" than Java processing the DOM.

Compatibility is defined between the producer and consumer of an individual text. Most messaging specifications, such as Web Services, utilise both inputs and outputs. Using our definitions of compatibility, a Web service that updates its output message language is considered to be a newer producer, because it is sending a newer version of the message. Conversely, updating the input message language makes a service a newer consumer because it is consuming a newer version of the message. All systems that include inputs and outputs must consider both when making changes and determining compatibility. For full compatibility, any output message changes must be forwards compatible. This means that older consumers can receive them successfully. Similarly, input message changes must be backwards compatible, so that they can be received successfully from older producers.

1.1.3 Divergent Understanding and Compatibility

Our treatise so far has described a fairly straightforward evolution of a language, from a first version to a next version. However, extensibility and interoperability are usually directly related. It is an axiom in computing that the lower the optionality (which includes extensibility), the higher the chance of interoperability. Each and every place where extensibility is allowed in a language is also a place where a lack of interoperability can arise Interoperability problems can arise, for example, when producers and consumers do not agree on which version of a language is being used in a text.

Even though a language has a formal definition and extensibility model, there is no guarantee that software that processes it will implement it exactly. Differences in understanding between different software agents is a significant source of divergence in understanding. A classic example of this is the so-called "TAG soup" problem in HTML. Much of the applications commonly used to process HTML, and particularly browsers, have an Accept Text Set that is larger than the formal definition of the HTML language. For example, many situations of interleaved opening and closing of elements in HTML are processed without generating an error. This ensures the user experience, at least in the short term, is of higher quality. However, it does suffer long term problems with interoperability when the illegal texts are copied by mechanisms such as "view source". The reason is that the more undocumented strings that are in an Accept Text Set, the more difficult it is to achieve interoperability. The more liberal an agent in accepting texts by increasing the Accept Text Set through expanding the definition of the language, the more difficult interoperability is because not every agent may have the same Accept Text Set.

At the other extreme is XML. XML allows almost no extensibility in its constructs. Name characters, Tag closures, attribute quoting and attribute allowed values are all very fixed. This has increased interoperability between implementations of XML. However, it has also made it very difficult to move to XML 1.1 because almost all changes are incompatible because of the lack of extensibility. The XML language design was very specifically trying to avoid the "HTML TAGSoup" problem, and it has arguably done that, at a cost of inability to version. These two extremes of design of extensibility exist because of well-thought design. The trade-off between extensibility, interoperability and the Accept Set was planned in advance. Language designers should do the same with their languages.

ed-note

Which references to support this? TAG issue: http://www.w3.org/2001/tag/issues#TagSoupIntegration-54

Good Practice

Analyze Trade-offs for Language: Language designers should analyze the trade-offs between extensibility, interoperability, and actual language Accept Set.

1.1.4 Open or Closed systems

The cost of changes that are not backward or forward compatible is often very high. All the software that uses the affected language must be updated to support the newer version. The magnitude of the associated cost is directly related to whether the system in question is open or closed.

[Definition: A closed system is one in which all of the producers and consumers are more-or-less tightly connected and under the control of a single organization.] Closed systems can often provide integrity constraints across the entire system. A traditional database is a good example of a closed system: all of the database schemas are known at once, all of the tables are known to conform to the appropriate schema, and all of the elements in each row are known to be valid for the schema to which the table conforms.

From a versioning perspective, it might be practical, in a closed system, to say that a new version of a particular language is being introduced into the system at a specific time. At that time, all of the data that conforms to the previous version of the language is migrated to conform to the new version as part of the upgrade process.

[Definition: An open system is one in which some producers and consumers are loosely connected or are not controlled by the same organization. The internet is a good example of an open system.]

In an open system, it's simply not practical to handle language evolution with universal, simultaneous, atomic upgrades to all of the affected software components. Existing producers and consumers, who are outside the immediate control of the organization that is publishing an updated language, may continue to use the previous version for a considerable period of time.

Finally, it's important to remember that systems evolve over time and have different requirements at different stages in their life cycle. Often, early versions of systems will operate as if they are closed. During initial development, when the earliest versions of a language are under construction, it may be valuable to pursue a much more aggressive, draconian versioning strategy. Once a system is more widely deployed, in production it tends to behave more as an open system. There is likely to be an expectation of stability in the language it provides. Consequently, it may be necessary to proceed with more caution and to be prepared to provide forwards and backwards compatibility for changes.