IBM Position Paper for
W3C Workshop on Quality Assurance

Written by David Marston for IBM Corporation


0. Introduction
1. Definition of Roles
2. How should W3C approach this area?
3. What are the current problems with conformance testing?
4. What QA approach should W3C promote?
5. What test components should W3C promote/create/endorse?
6. What education programs about QA should W3C conduct?
7. Should W3C certify test suites or even implementations?
8. What does IBM need from W3C?
A-1. Appendix 1: Experience in assessing XSLT conformance


IBM and its subsidiaries, Lotus and Tivoli, have been involved in nearly every W3C Working Group (WG) as well as many OASIS activities. IBM has over 1000 people working on XML and related W3C Recommendations. On our AlphaWorks website, you can see many free XML tools that push XML toward maturity. The commercial motivation is that IBM has many products that need to interoperate in ways facilitated by XML and by its acceptance as a standard. IBM has numerous products using technologies developed between the W3C and IETF or other bodies, such as Unicode, URI, HTTP, XML Digital Signature, and MPEG-7.

This paper concentrates on conformance testing and Recommendation clarity, but we believe that Quality for W3C also has dimensions related to simplicity and understandability (as well as internationalization and accessibility, which the W3C already oversees). We commend the WG members for consistent and precise use of terminology across specifications. We would like to see some metrics applied to every document before it can exit the Candidate Recommendation (CR) level, such as "a reasonable developer is able to read and understand this and generate conforming content in no more than 3 working days," and the existence of such a metric will lead to attempts to measure it. We do not want a Web where only highly-trained and well-funded participants can extend or build on the W3C's work.

The author works full-time on conformance testing of XSLT processors, an activity which requires deep familiarity with the XSLT and XPath documents and full understanding of several other W3C Recommendations.

Definition of Roles

For purposes of this paper, we characterize the stakeholders as follows:
The W3C, promulgating a vision of interoperable communication;
Builders, building Web sites and applications layered on W3C-conformant software;
Vendors, who provide software implementing the W3C Recommendations;
Test Labs, professional software evaluators delivering information to Builders;
OASIS and possibly others, bringing Labs and Vendors together to agree on test suites.
End Users of the World Wide Web will benefit if all the above perform well in their roles, but this paper does not need to say anything about their stake.

How should W3C approach this area?

Strategically, we want W3C to make incremental improvements in Quality Assurance (QA) and to recognize officially the value of external review. As one might gather from the above definition of roles, we see role definition as a tool for planning the early steps. We say "incremental" to stress that there are several areas where today's approach has a good guiding principle but needs better execution. The Web community also needs W3C to improve architectural coordination among WGs and their respective Recommendations, a distinct strategic concern that can be harmonized with the quality concerns.

We are skeptical about establishing a separate W3C Activity for QA, in part because that approach signals to the WGs that quality comes from an external source. W3C Team members can assist the WGs in looking out for the software-testing issues through all steps of the Recommendation track. Similarly, the staffers can push for architectural coordination and for reliance upon meta-level Recommendations such as InfoSet as foundation documents. One goal for all the coordination work is to maintain progress; we all know how a proprietary format can fill the vacuum that results when a vendor-neutral body tries to "get it right" and thereby delays issuance of their specification.

Having said all that about the W3C organization, we should now emphasize that expanded involvement of outsiders will aid the QA cause. We definitely want review of CRs and even earlier documents with testability in mind. Such reviews within the W3C process should be supplemented by reviews that represent the Test Lab perspective, and OASIS may provide a conduit for delivering that point of view. The W3C should look for ways to expand the review by those who were not privy to the WG thinking, so that the quality of the Recommendation will be improved. One simple early step is to be consistent and simple in providing access to the public discussion lists that accompany each Recommendation.

The goal of making Recommendations stand on their own as implementable and testable specifications could drive additional documentation standards. That is, the verbiage of the Recommendations could be made more like statutory language, or there may be relevant ideas in the generic science of specification writing. As described in the Appendix below, an attempt to write test cases can readily expose gaps or conflicts in a specification document, just as an attempt to write the program would. The next level of refinement in documentation standards for the WGs is to recognize, and possibly impose, a classification scheme for the documents themselves. In addition to the abstract-foundation type of Recommendation (e.g., InfoSet), there are Recommendations that define a vocabulary or syntax (MathML, Namespaces), ones that define a protocol between two systems (XML Protocol), ones that specify behavior of a processing program (XSLT), ones that define an API (DOM), and maybe other classes such as a set of events. The testability needs will vary for each class.

What are the current problems with conformance testing?

Devising a "complete" test suite for a Recommendation can be difficult for substantive reasons, even when the document itself achieves the ideal of full testability. Currently, it can be hard to isolate testable provisions and to determine which interactions among provisions spawn a multiplicity of cases. The larger Recommendations, like XML Schema, can inhibit most reviewers to a superficial level of understanding. Certainly the WGs hope to issue a Recommendation that requires no errata, and errata will still be needed, but the combined attention from document reviewers, testability reviewers, and early implementers will result in a better document. To that end, the W3C may want to make an incremental upgrade in some exit criteria for late stages of the Recommendation track. After the final Recommendation, WGs should provide a rapid-response mechanism for valid questions of interpretation, and publish the decisions in a normative place, with errata to follow at their own deliberate pace.

Improved architectural coordination will also pay off if universal definitions and data models can be enhanced as a foundation on which multiple WGs can build their interlocking Recommendations. The notion of "subset" specifications, originally driven by small-profile hardware like PDAs, can help to make interlocks more limited. One WG ought to be able to register the policy that they will depend upon a particular level or subset of another Recommendation, which implies a process change so that the dependency is honored. (Example: many Recommendations depend upon Namespaces 1.0 and must not be surprised by unexpected interpretations or errata for that document.) Software tools can assist in documenting those dependencies more precisely. The W3C Team should create such tools and post appropriate reports or diagrams that the public can view. We also observe that the W3C needs a process or policy for deprecation of particular bits of syntax in a Recommendation, which can also propagate through a dependency chain.

The problem of documentation quality can be summarized by the question "Who knows what the WG meant when they wrote that?" Vendor participation in a WG helps ensure that the WG's product is implementable and meets a real need, but the Vendor member may forget to note interpretation details that should appear in the Recommendation. Test Labs will only be credible if they can assure their audience that they know what the WG meant, but that should not require WG membership. Of course, the work of a WG would be enhanced by having software testing experts on the panel, but all members should feel responsible for eliminating vagueness from the final product. When a Lab is doing their testing work properly, they can only rely on the published documents. We suggest a free-press model, where any Lab can enter the business of testing Vendor implementations against the W3C Recommendations (plus errata and published official interpretations) and reporting those results to their clients or the public. Allegations of nonconformance should be discussed among Builders, Vendors, and Labs, and any party can submit a question of interpretation to the appropriate email address.

What QA approach should W3C promote?

We advocate more responsibility for WGs regarding understandability and testability of their Recommendations. At the content level, this could mean an expectation for more examples, a more exacting delineation of cases, or sanctioning specific kinds of charts or diagrams for normative use. Issuance of a test suite by a WG seems ambitious, but is probably a worthy experiment. Since conformance testing has a perennial completeness issue, the expectation level of any such experiment should be limited. Indeed, conformance test suites prove nonconformance much more readily than they prove "full" conformance. The Activity Charter should indicate whether responsibility for a test suite rests with the WG developing the Recommendation, another WG building the test suite, or the Activity will proactively work with OASIS or another body for external construction of a test suite.

Most Recommendations are essentially addressed to Vendors. The Builder may need to understand the document, but they can also turn to other instructional resources. The overall success of the interlocked XML Recommendations depends first upon Vendors seeing the benefits of W3C participation as contrasted to pushing proprietary formats, and second upon their correct understanding of the Recommendations they will implement. In some cases, the Vendor is also a Builder, if their software relies upon correct behavior of another Vendor's XML software module. There should be a defined process by which WGs resolve interpretation differences between Vendors that are WG members. In such cases, all Vendor members (strict definition needed) must not vote on the resolution, but should participate in the discussion as advocates of a particular alternative.

Assuring quality of interoperability between modules implementing different Recommendations is a vastly larger challenge. Perhaps the Builders can help set the expectations. For documentation standards, there could be a requirement that each Recommendation describe its interoperability goals and where it fits in the W3C vision of integrated systems. The W3C Team could assist at that level, and perhaps define some examples of expected integration. Actual testing of an integration scenario would probably best be undertaken by Labs rather than WGs or W3C staffers. The W3C could endorse one or more Labs to do this, but should establish a special communication channel or liaison to each affected WG if they do so. The purpose of the special channel is to ensure that all questions of interpretation are handled expeditiously, and that the decision causes appropriate impact on all the WGs and their respective documents.

What test components should W3C promote/create/endorse?

QA specialists employed by W3C should serve as resources for defining appropriate testing materials. The test tools and example documents required vary by the classification into which a Recommendation falls. A testability review of a CR should expose boundary cases and error cases as well as identify the broadening effects of related Recommendations and non-W3C standards referenced as normative. If a WG wishes to issue a test suite with their Recommendation, they should be encouraged to do so. We hope that a typical instance of such a suite would contain mostly "atomic" cases, each testing a single testable sentence and referencing that sentence via XPointer. If the Recommendation specifies precise or named exceptions, common cases that would cause these exceptions should be included in the suite. The next level is what I call "molecular tests" that test the interaction of multiple testable sentences. While the WG should account for all such interactions in their Recommendation, they probably won't have the resources to generate the associated test cases. This is the point where an OASIS Technical Committee (TC) can contribute.

Direct an OASIS TC to think about completeness of the test suite, and you will get a good review of the testability of the Recommendation from a group acting at an appropriate distance from the WG. The TC might have delegates from several Vendors, just as the WG does, but would be required to operate in a vendor-neutral fashion. The TC may issue a test harness or validator. If the testing approach of having an on-line validator is desired, that is best left to a Lab (NIST, for example). Lotus has some highly-evolved testing tools that they might make available. Note that persistent serialized data ("file formats") require permanent availability of validators for each version of the Recommendation.

What education programs about QA should W3C conduct?

We certainly want the W3C to explain the vision of a new Web (and Internet) that is built on interoperable XML tools. There are many publishers, traditional and on-line, who will help carry the message. In selling the idea of modular tools based on Recommendations that are modular, we all must explain how one module's reliance on the behavior of another drives the need for conformant behavior. Notice that the message about QA and conformance is complementary to another message about architectural coordination. The WGs may mainly hear about coordination, but the Builders want both the best functionality and minimal problems with "version skew" among the tools available to them.

The W3C role as a defender of accessibility and internationalization plays to a different audience than the QA work anticipated here, since the QA improvements mainly address the interests of Vendors rather than End Users of the Web. Vendors want to achieve good conformance so that Builders will be willing to use their modules, expecting interoperability with other tools.

When the situation is stable, W3C should be presenting its view of the roles of other parties. OASIS could be identified both as an arm's-length test-development group, and as a central player in the development of XML vocabularies for specific disciplines. W3C should define and explain a separate role for testing-and-reporting Labs, leaving OASIS to be a service to both Labs and Vendors, representing the interests of Builders.

Should W3C certify test suites or even implementations?

An official "seal of approval" from W3C could be worth a lot of money. Therefore, denial of such an approval could be the basis for a lawsuit. Since it's so hard to determine when a test suite is a complete test of every part of the Recommendation, it's hard to grant a certificate of "full" conformance. It's easy for the Vendor who gets denied to find evidence that the Recommendation admits multiple interpretations. For these and other reasons, we oppose such an undertaking by the W3C. The W3C and its working groups should be neither an adjudicator of a Vendor's conformance nor a witness in court.

At this stage of W3C's growth, the possible political pressures about certification of Vendor products would complicate any attempt to make Recommendations more clear and readily testable. Keep in mind that some Vendors respond to an adversarial relationship by aiming to isolate and lock in their customers. We should avoid pushing Vendors into an adversarial position with respect to the W3C.

The earlier reference to W3C endorsement of Test Labs should not extend to certifying Labs. There may be a need to contract out some work, or determine that a particular Lab is competent to perform some work, but certification implies that the Lab gives normative interpretations of the Recommendations they test. All focus should be on proper specification work coming from the WGs, retaining the WG obligation to reduce the need for interpretations and errata.

What does IBM need from W3C?

Issuance of Recommendations continues to be the primary activity we expect from the W3C. Recommendations are proliferating because of the W3C's overall design for a more powerful Web, and we expect W3C Team members to continue to explain these new developments to the public. As a consequence, individual WGs will have more concerns of interoperability and coordination. Improving quality will bring more pressures on the WGs, but we still need the WGs to move toward Recommendation status at a pace that earns respect from the commercial world.

We want more reliable and objective ways to determine conformance. The first charge to the WGs is for clear and testable wording in their documents. At least by the CR stage, the WG should be thinking about how the provisions will be tested, just as WGs think about how Vendors will implement the provisions in their software. We don't insist that either working implementations or a test suite should be exit criteria every time, though they are always desirable because their creation will assist in clarifying the documents. Test suites can be provided by OASIS TCs instead of the W3C. Further downstream, running of tests and reporting results should be the domain of the Labs, not OASIS nor the W3C.

Appendix 1: Experience in assessing XSLT conformance

The author is a member of the OASIS TC on XSLT/XPath Conformance Testing. Our committee found that while the two Recommendations contain many testable sentences, there are many other aspects that are specified in less obvious ways. Even for the testable sentences, there was no system for pointing at them, which we wanted test cases to do. At present, we plan to experiment with an ungainly XPath-style reference to text fragments.

We are collecting test cases from Vendors, including some who are not represented on the committee. A separate paper describes how those cases are consolidated into a test suite, and (more interesting) how they are filtered. One reason for filtration is when we can't tell for certain that a particular behavior is required by the Recommendation.

Another aspect of test selection involves approximately 47 instances in which the Recommendations only limit behavior to a set of alternatives. Fortunately, there are two choices in most of those situations. Since the Recommendations did not collect them into lists, we had to take time to find them, debate their nature, and generate our own catalog of discretionary choices granted to the Vendor. We also plan to create a Vendor questionnaire, asking what choices the Vendor made on each of the 47 items. We hope that all Vendors will answer in a public way, so that Builders and Labs will all have access to the same information.

The discretionary items should not be confused with the numerous gray areas in the XSLT and XPath Recommendations. We have reported these to the WG and many (but not all) have been clarified in subsequent errata. The test harness and test-case annotation are designed to account for errata levels when testing conformance.

Despite the existence of an XML Canonicalization document, there is additional work required to create a vendor-independent framework for comparing outputs of various XSLT processors. Test cases should come with an allegedly-correct output, which the TC members will review as they review each case.

With all the above work, review of completeness of the consolidated test suite remains a distant activity. Like the Vendors, we need to study how the interactions of various XSLT instructions spawn multiple test cases. At this time, we can only assure the Test Labs that our test suite will give a processor a rigorous exercising, not necessarily a "complete" one.

2001 International Business Machines Corporation. All rights reserved.