W3C, XML and standardization

W3C

Bert Bos

München
19 September 2001
XML Days Europe

Bert Bos <bert@w3.org>
W3C/INRIA
Sophia Antipolis, France

W3C history

W3C was founded by Tim Berners-Lee with the help of CERN, where he worked and where he invented the Web, and of MIT (in particular LCS, the Laboratory for Computer Science). The goal was to create an international body with staff in both the US and Europe to further develop the Web, which was rapidly taking off in 1994. CERN quickly discovered that its other big project, the new particle accelerator ring, scheduled to be operational in 2003, would take up all its resources, and another European research institute, the French INRIA, took its place. INRIA's core business is computer science, whereas CERN's is particle physics.

Tim moved to the US, with a few of his collaborators and started hiring a few more people. The Consortium officially opened its offices at MIT in October 1994. A few months later hiring began in France, and the official opening of the French host site took place in October 1995, with some talks and a press conference in Paris.

A year later, Keio University in Tokyo joined as the third host.

Since then, the Consortium has opened small offices in various other countries. Many of them were set up on the request of, and with the help of the European Union. Their task is mostly related to communication: they translate texts to the local language, organize symposia, and serve as the first point of contact in their respective countries (or language regions).

There are currently 10 offices. The German one is at the Fraunhofer Institut für Medienkommunikation in Sankt Augustin.

In fact, the reason that I am talking here, is that the conference organizers talked with the German W3C Office, who in turn asked me. The various European W3C Offices together made sure that there is a speaker from W3C or from one of the Offices in all cities where this conference series makes a halt.

The W3C staff, excluding the offices, currently consists of about 60 persons. Each of them is associated with one of the three hosts (I work at INRIA in France), although several are telecommuting, from places as far apart as Seattle, San Diego, Edinburgh, Venice, Melbourne and others.

W3C is a "Consortium" because it is a member organization. It has more than 500 members, many from the computer and telecommunications industry, but also from research, fincance, etc. The members pay a membership fee and contribute in the form of people that particpate in the various working groups. Apart from influence, they also gain (slightly) earlier access to the technologies that are being developed. Typically, new drafts are available to members a few months before they are published, but some working groups are quicker with publishing than others, and some have decided to work completely in public.

W3C standardization process

Core:

"Post-recommendation":

Also: demo software: Amaya, Jigsaw, etc.

The highest status a specification can get inside W3C is "W3C Recommendation." Originally it meant that the W3C members had reviewed it and agreed that it was worth implementing (and that the W3C Director had given his approval). Nowadays it means more than that, as I will explain below.

Just before a specification becomes Rec, during the review period, it is called a Proposed Recommendation (PR). And before that, a specification usually goes through various working drafts (WD).

Working Drafts are written by Working Groups (WG) of domain experts. Most of them are made up of W3C staff (usually 1 or 2) and delegates from the members, forming a group of 10 to 20 people, although some popular topics have recently led to WGs of close to 80 members. (It probably indicates that the topic is important, but whether such a group can work effectively is another question.)

The main product of W3C are its specifications, but they are by no means the only products. W3C maintains a Web site with background information and many pointers to related information. W3C also writes press releases for new Recommendations, often in cooperation with its members, and of course, W3C is present at many conferences, to explain the various standards and especially how the relate to each other.

Test suites are a relatively recent addition to W3C's output. More about them below.

Working groups

Working groups are basically a way to prove that the world is small. They hold hour-long international teleconferences, sometimes two per week, sometimes every other week, but usually once per week; they travel to meetings all over the world, 4, 5, or even 6 times per year; and they maintain lists of hotels in various cities and organize dinners...

Working Group are time-consuming. Apart from attending meetings (usually one hour phone conference per week and four physical meetings per year), the members also have to write the various working drafts (and often test suites as well).

To help them, many working groups have an associated Interest Group (IG), which is usually larger than the WG, and which is used for e-mail discussions in which many more people can take part.

Not all WGs have an IG, but all of them have a public mailing list, where discussions take place between the WG and the people that do not work for a W3C member company. The oldest such groups are www-html@w3.org and www-style@w3.org.

Many WGs also invite people from the public to join the WG directly. They are called "invited experts" and there are quite a few of them. Some are still students. The CSS WG, for example, has 15 members, including 4 invited experts, of which 2 are students. Especially the WGs in the WAI (Web Accessibility Initiative) domain have large proportions of invited experts.

Traditionally, the minutes, charters and internal documents of WGs are only visible to the W3C members, but increasingly WGs decide to be completely open, to make it easier for the public to see what they are doing and to allow earlier feedback. Of course, all the members of the WG have to agree that it is indeed better for them to have an open WG than to see drafts in advance of the general public.

WGs that work in public don't distinguish between a public mailing list and an Interest Group: in their case the public mailing list and the Interest Group are one and the same.

The early standards

PNG was the easiest standard we ever made. In fact, it was created by a group of people who met on Usenet before the Consortium started, and when W3C was formed, they asked it to review and publish the specification. PNG replaces GIF and is overall a much better format (compresses better, is more precise in color representation and doesn't have the arbitrary restrictions of GIF), but above all it is free of the patents that restrict the use of GIF. PNG was quickly implemented in all browsers and in many other graphics tools as well (except that alpha channels are still not handled correctly everywhere, unfortunately), but despite that it still remains less well-known than GIF.

CSS started on the HTML mailing list, well before W3C. It combined two proposals (one by Håkon Lie and one by myself). Both Håkon and I joined W3C, and a little later CSS got its own WG.

There never was an HTML 1.0. The documentation that Tim Berners-Lee had made for HTML was not very precise. People soon recognized that it was based on SGML, but that it needed some work to make it conform to SGML. HTML 2.0 was made by an IETF working group and was a real application of SGML, and when W3C was created, it took over the development of HTML, and created HTML 3.2, also an SGML application. However, in practice, almost nobody implemented HTML as SGML, as we'll see below.

Lessons learnt

The people that worked on PNG, HTML and CSS were all on the Internet before the Web started. They were what marketing people call "early adopters," interested in the technology, able to upgrade to the latest versions and knowledgable about where to look for the error if something didn't work.

Nobody at that time cared much about bugs in software, because they would soon be discovered and fixed, and then everybody simply downloaded the new version. That was a matter of weeks, often even days. That the browsers at the time didn't do HTML 2.0, and later CSS 1, comme il faut, was not too worrying, as long as every new version improved on the previous one.

But the Web started changing. People started connecting to the Internet because they wanted to be on Web. They didn't know the old Internet culture. They were given a browser by their ISP and thought of it as the software equivalent of a TV set: an appliance that would last for 10 years and whose vendor could be blamed when it malfunctioned. The fact that most of the malfunctioning was actually the fault of errors in the Web pages they visited was lost on them.

By now the major browsers were made by commercial companies, first the Mosaic spin-offs, Netscape and Spyglass, soon joined by Microsoft. They did two things that the developers of the specifications hadn't foreseen: they started spending lots of resources on trying to accept incorrect Web pages and they started inventing proprietary extensions. The aim was to gain market share, the result was lack of interoperability and lots of problems for the writers of new specifications, because the HTML as implemented was much too fragile to support real improvements.

To stop the fragmentation of the Web, the HTML WG started working on a compromise that the two major browsers and the various makers of authoring tools could live with. This became HTML 3.2. It wasn't the direction W3C had hoped to take. The original idea was to make improvements to HTML, as exemplified by the HTMLplus and HTML 3.0 proposals, much of which was the work of Dave Raggett. But the compromise worked, although many people are frustrated by the loss of time. HTML 4.0 took the right direction again, and with XHTML 2.0 the HTML WG is finally making real improvements on HTML.

CSS took a different route to counter the growing lack of interoperability and the danger of buggy browsers staying around far longer than any developer wanted. It developed a test suite. With hindsight, the test suite should have been there when CSS1 came out, but at the time we didn't know any better. The test suite is a tremendous help to the programmers. It helps to explain what the specification means, it shows that the browser is not at fault when some site won't display correctly, and it gives the programmers a weapon against the marketeers, when the programmers want to fix the bugs and the marketeers want them to do something else. Being able to claim 100% conformance to the W3C test suite is apparently desirable.

The old adage of the Internet is "be strict in what you produce, be forgiving in what you accept." The idea is that servers should do their best to serve data that adheres to the standard, while clients should accept obvious errors. This, in theory, leads to maximum interoperability.

On the Web, clients (browsers) were indeed very forgiving of errors, but the producers of data were not at all strict in what they generated. Many Web designers had no clue about what the specifications said, let alone of the quoted motto. They trusted the browser: if their pages worked in their browser, they assumed the pages were correct.

In this situation, many despaired of the Web's ability to ever become the "semantic Web" that Tim Berners-Lee had originally predicted. The semantic Web relied on extensibility, or in Tim's words: "evolvability": the first versions might not be perfect, but they were extensible. Improvements could be made without losing what was already there. But with all those pages that consisted of "tag soup" with no discernable structure, that only worked in two browsers, both of which were already of enormous size, how could HTML evolve into something better?

This is one of the main reasons why there is no "HTML 5," but instead an "XHTML 2." XHTML is a clean break with HTML. It is based on XML and no longer on SGML, and inherits the conformance requirements of XML, which are very strict, as we'll see below. (XHTML 1.0 is a transitional format, without any great interest, except that it helps people prepare their software for XML. XHTML 2.0 will be the real thing.)

But browser makers haven't shed their bad habits completely yet, and authoring tools (and the people that use them) haven't really understood how important it is to create conformant data. We already see many pages that claim to be XHTML (1.0), but aren't, so although XML probably improves the situation a bit, it seems it won't improve it as much as its developers had hoped...

Something else that has changed over the years is the level to which different W3C technologies interact. Originally, a WG didn't have to think much about the work of other WGs. It was pretty clear where HTTP ended and HTML began and where HTML ended and CSS began. Once their charters were established, WGs could work pretty much in isolation, concentrating on their own work.

W3C now has Coordination Groups, whose task is not to create specifications, but to identify dependencies between groups and help them to solve them. Horizontal activities, such as Internationalization and Accessibility constantly check on other WGs and suggest improvements. For several of the newer groups, such as XForms and XLink, it is not so clear where their area ends and what they have to leave to each other or to, e.g., HTML or CSS. The boundaries are established after the groups have started and have to be negotiated.

Some groups also find that they don't have the experts for the tasks they are asked to do. The DOM WG, e.g., used to take care of the CSS DOM, but now that CSS is growing, they don't have the necessary resources anymore, and recently the CSS DOM was taken from the DOM charter and added to the CSS one.

Another example: XSL, especially the Formatting Objects (FO) part, was created to cater to certain high-end formatting tasks (most notably printing), for which CSS is too limited. But that means that XSL necessarily also includes most of the features of CSS, and thus the XSL and CSS WGs have to be in constant correspondance, to make sure that they don't diverge on the common parts, or even that they don't misunderstand each other's new drafts.

Solutions

HTML had a very simple extension mechanism. First of all, it had a <!DOCTYPE> at the top, so new versions would always be recognizable by that, but to ease implementation, it also specified that implementation should ignore any tags it didn't recognize. "Ignore" meant that the parser should parse as if the unknown tag, from the opening bracket "<" to the closing one ">" simply wasn't there.

However, until very recently (to be precise: Netscape 6, MacIE 5.5 and WinIE 6), browsers simply ignored the doctype completely, using one and the same parser for all versions of HTML, whether 2.0, 3.2, 4.0 or any of the nonstandard formats produced by some authoring tools. In that environment, the simple "ignore unknown tags" rule soon reached its limits.

The new trick is called "XML Namespaces." Namespaces look very ugly and when they occur in a document, the parser is almost forced to do something with them. Maybe this time parsers will indeed do something different with XHTML 1.0, 2.0, and any successor.

CSS used (and still uses) a different method for extensibility. It doesn't have a "magic number" at the top, like the doctype of HTML, because there aren't any versions of CSS. CSS has levels, with higher numbers indicating more features, but the different levels are supposed to coexist. Therefore CSS was designed with precise parsing rules, that specified exactly what a parser should skip when it encounters an unknown statement. The trick is that it should not ignore just the unknown token, but the whole statement that the token is part of.

Although the earliest implementations (MSIE 3, Netscape 4) had bugs even in these parsing rules, the rules work well, and they allow CSS to acquire new features without breaking old implementations.

But the extensibility rules were not meant to deal with bugs (because, as already said, the developers still thought buggy software wouldn't last long anyway...). So the CSS WG decided to create a test suite. It was the first test suite W3C created. That was in early 1998 and since then test suites have become almost a sine qua non of new specifications. There is an HTML test suite in the works. CSS level 3 will be delivered in several modules, each of which has its own test suite right from the start. SVG was published together with its test suite, so will XSL, and so on for all other new specs.

CSS also quickly set up a validator, a program, also accessible as an on-line service, that checks CSS style sheets for errors. It goes further than the HTML validator, because it not only checks for syntax errors, but also tries to warn authors about possible semantic errors in their style rules.

XML, which started being developed when the problems of non-conformity were already well-known, took a very radical route to try and reverse the tide: XML puts the onus on the client, the receiver of the data. In a reversal of the old IETF rule, XML specifies that, in order to be conformant, an XML parser must report errors and it may not try to guess what the author might have meant. Well-formedness errors are classified as fatal: after such an error, a parser may continue searching for more errors, but it must stop passing on correctly parsed data.

Of course, in practice, XML-processing programs don't follow that rule completely, if only because checking whether something is an error could be quite costly, but XML parsers are in general much stricter than the old HTML parsers. Although XML is a lot more complex than HTML on paper (probably more complex than necessary, because the XML WG was large and for the most part had completely wrong ideas of the future use of XML), parsing real-world XML is easier than parsing real-world HTML in practice.

W3C has taken other measures to deal with the bugginess and lack of interoperability of Web software. One measure is to insert an extra step in the standardization process: the Candidate Recommendation (CR) phase. Until a year ago, a specification was made into a W3C Recommendation when there was consensus that the specification was ready and correct. Now such a specification becomes a CR first. A CR is also a standard, in the sense that W3C believes it is ready and correct and will not change anymore, but it is a conditional standard: it is on active trial for a certain minimum amount of time to see if implementations succeed in conforming to it and becoming interoperable. Only when that has happened will it become a W3C Recommendation. A report on the implementation is made at the start and at the end of the CR phase, and the report is made public.

If CR had existed in 1996, CSS1 would have become an CR instead of a Rec in that year, and it would have become a Rec only in 2000. Although, from a different point of view, if CR had existed in 1996, CSS1 would probably have reached 100% implementation earlier than 2000...

Another measure W3C has taken is the formation of a Quality Assurance (QA) activity. Like the Internationalization and Accessibility activities, QA is a horizontal activity. It is supposed to be involved in every spec from no matter which domain. It helps with and coordinates test suites, checks that the conformance requirements of different specs are compatible with each other, and collects and publishes information and tools that can help with conformance, such as validators.

How a spec gets started

The first W3C REC, PNG, was more or less a donation to W3C. PNG was created in response to problems with GIF (patent problems and limitations of the format), but W3C had no involvement in it, except to review it and publish it. HTML, of course, started with Tim Berners-Lee himself, was taken on by the IETF and later by W3C. CSS also started before W3C.

The W3C technologies that started inside W3C came about in a few different ways. Some were developed because members saw a need for it and asked W3C to handle the development. One example is PICS (now replaced by RDF and P3P), which was created, in record time, when especially the American members saw that there was a danger that the US government would create censorship laws, unless there was some technology that would allow parents to protect their children by themselves.

XML was less urgent, but also came out of a request by W3C members to develop an interchange format for SGML documents on the Web. (That's not what XML is currently used for, but that was the reason the WG was first created.) Some people saw already when the first draft of XML came out that XML had the potential to become something much more interesting than an exchange format for SGML, but they remained a minority until after XML 1.0 was finished.

XLink, Namespaces, Xpointer and others were spin-offs of the XML work. People already started thinking about them before XML was ready, because it was clear that XML would need something like them. (Namespaces is a somewhat strange case. In my opinion, XML didn't, and doesn't, need namespaces, but at least at the time many people thought it did need it urgently, and the spec was pushed through in haste.) But is was also clear that those specifications were sufficiently complex that they needed WGs of their own, which were subsequently created.

Some specifications start not just with a request by one or more members, but with a concrete (although often incomplete) proposal, in the form of a submission. Those submissions are published on the W3C Web site as (non-normative) Notes. Examples are SVG (which started with no less than three, more or less simultaneous submissions), XSL and XML Protocols.

How the standardization process is made

... by trial and error

The process is adapted whenever the need arises. At first, there was a concept of "editorial review board", which became "working group." Other groups were added: interest group, coordination group, project (for software). Candidate Recommendations were added to be able to distinguish standards that have proven themselves in implementations from standards that are still being tested.

Working groups used to work on their own schedule, but for better coordination we recently added a "plenary meeting" once a year, to which all working groups are invited.

Some WGs forgot to send comments on working drafts of other WGs and were then surprised when a draft became a CR. So we added an extra stage: just before a WG asks for CR status for one of its drafts, it has to issue one more draft and send messages to all working groups in the form of a last call for comments. Thus the checklist of things to do before a drafts is published is getting longer...

General principles

(See also my essay on design principles and Tim Berners-Lee's "Design issues")

These are just some of several principles that W3C has learnt to apply over the years. There are others (internationalization, efficiency, stability, etc.) but we don't have time here except for a few examples.

Tim Berners-Lee coined the term "evolvability" for the principle that technologies on the Web should be able to improve without losing access to what already exists. One example is the URL system: a URL doesn't just contain the machine name and the path to the file, but also the protocol. HTTP isn't taken for granted, which allows FTP, gopher, WAIS, etc. to particpate as well. And new protocols can be (and have been) added, e.g., HTTPS. Gopher's addressing system didn't allow such extensibility.

"Semantic Web" is something like "Artificial Intelligence": it is not a quality of a particular Web, not even a specific goal, but a direction in which to develop. It means that the goal is always to develop technologies that enable the computer to take over tasks from the user, but in contrast to Artificial Intelligence, which tries to teach the computer to understand information in human-readable form, the Semantic Web tries to enable the human to create information in machine-readable form.

A simple example is using HTML instead of PNG for text: on many people's screens, a text in PNG is just as readable as a text in HTML, but to the computer, the PNG is not text at all, and thus cannot be spell-checked, cannot be translated, cannot be re-flowed into columns, cannot be indexed in a search engine, etc. AI would try to interpose an OCR system, SemWeb would give the user a WYSIWYG HTML editor.

SemWeb often seems to make things harder for users. They have to mark a word as "emphasized" when all they want is to make it italic. But of course they get something back as well: they can use a style sheet and the text can be read on systems that don't have italics (text terminals, speech synthesizers...) Smart editing tools can help a lot as well. And much of the text on the Web is not typed in, but generated from information that is stored in a database or some other form. There is no reason at all to generate italic text instead of more descriptive mark-up in such a case.

The overal goal is therefore usability of the Web as a whole. It may be easier for an information provider to present information in any odd order and with fantasy mark-up, but for the reader who wants to compare offers from two different providers it will be much easier if his browser can automatically match up corresponding information and display it side by side.

"Evolvability"

As already mentioned, the built-in extensibility of HTML is very limited. There is a rule that allows parsers to ignore tags, but that doesn't really allow HTML to be extended, because it is tags, not elements, that are ignored, and because there is no way to provide alternatives for ignored extensions. (However, HTML does have an extensible embedding mechanism, in the form of the OBJECT element.)

XML (despite its name ☺) is not extensible at all. However, it does have a version number, so there is still a possibility to improve XML without having to change its name.

Many other technologies have more sophisticated extension mechanisms. XSL allows extension functions. CSS allows a certain level of fallbacks to be included in a style sheet, so old and new software can use the same style sheets. PNG has a way of registering new features by means of a four-letter registration number. P3P has an EXTENSION element, that even allows the author of the P3P policy to specify whether the extension is optional (allowing old software to ignore it) or mandatory (thus excluding old software from processing the file).

The problem with all built-in extension mechanisms is that they have to try to predict the future: what kinds of extensions may be necessary in the future? An unforeseen new feature may not fit in the existing extensibility framework. Most XML-based formats are too new to say how well their extension mechanism functions in practice. The one from CSS seems to be powerful enough in practice to allow CSS to provide three levels of vastly different complexity in one file.

"Semantic Web"

Tim Berners-Lee talked about a "semantic Web" in his first writings about the Web, even before the Consortium was created. A "semantic Web" is a Web in which the software is smart enough and the data rich enough to allow automatic inferences.

Inferences can be made from single Web pages. That is what accessibility and device-independence depend on. If you look at the Web page of, say, a travel agent, you may find descriptions, photos and prices of holiday resorts. If you have a recent graphical browser and a reasonably powerful machine, you will be able to look at them and decide to buy or not. But if you have a less powerful machine (e.g., a cell phone), a less sharp screen (e.g., a TV) or some permanent or temporary physical handicap (e.g., blindness), things may not be so clear. But if there is sufficient semantics in the Web page, your browser may be able to improve the presentation: that photo of a swimming pool that you can't see can be described by your browser as "picture of a semi-circular open-air swimming pool"; the blinking animation that said "New!" can be replaced by the word "New!"; and the table with options and prices that was too wide for your screen can be split into narrower tables intelligently.

In nearly all cases, maybe with the exception of art, the particular presentation is only one of many possible presentations. The designer, of course, tries to make the presentation as easy to understand and as attractive as possible, but in the background there is the "essence": that something that the presentation is trying to convey. Making that essence available to the software is the goal of the Semantic Web.

Let's continue the travel agent example. What if you want to compare prices between two travel agents? In the old, paper days, you got their two brochures and with the help of a highlighter, a piece of paper and a pencil you made your own table. Now in the days of the computer, the brochures are on-line, the piece of paper is a spreadsheet program, and making a table is as easy as dragging and dropping the brochures onto the spreadsheet... or is it? Not quite. In fact, the computer doesn't understand a word of those brochures. It has no idea that "2001" in one sentence is a year and "1999" in another is a price.

Let's continue even further. Assume that the brochures were machine-readable. They could, e.g., have RDF statements in a common vocabulary that allowed the browser to extract places, dates, prices and other important things. Now you've got a table and you quickly see that many of the offerings do not interest you at all: you don't like rafting, so the trip down a canyon can be dropped; you have a dog, so the hotels that say "no pets allowed" can be skipped. Can't the computer apply those rules automatically?

The idea of the Semantic Web is that rules are themselves also data and thus you should be able to write a Web page that expresses your personal preferences for selecting holidays. This particular page will probably not be publicly accessible, but the technology is the same. You need a vocabulary for expressing the rules and encoding them in RDF.

We're starting to create vocabularies. The Ontology working group of W3C has just started and will create some basic vocabularies to express facts and rules. It's like building up a mathematical system. You need some basic operations like unions of sets, addition and subtractions, boolean logic, etc. and then you can build on that to create sonething more complex.

The end

http://www.w3.org/Talks/2001/0919-XML-Munich/all