Copyright © 2008 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
Over the last two years and several workshops of the Semantic Web Services Challenge (SWSC), we, as a community, have discussed and experimented with the best way to evaluate technologies for the mediation, discovery, and composition of web services, and to understand the trade-offs among the various technical approaches. Below, the reader will find some basic principles and fundamental methodologies that are recommended. In addition, there is a an extensive discussion of the evaluation issues leading to what we recommend as best practice
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of Final Incubator Group Reports is available. See also the W3C technical reports index at http://www.w3.org/TR/.
This document describes the SWS Testbed Incubator's Final Report on the best practices for a methodology for evaluating the efficacy of various technigues for mediation, discovery, and composition of web servces, such techniques including software engineering approaches as well as semanantic annotations.
These best practices are based upon two years of experience with five workshops and one year of discussion and meetings on this subject by the members of the SWS Testbed Incubator. These practices will continue to be refined in future workshops, but we would recommend that any workshops attempting such evaluation give these recommendations serious consideration.
The main website of the SWS Challenge should be consulted for much background information.
This document was developed by the W3C SWS Testbed Incubator Group.
Publication of this document by W3C as part of the W3C Incubator Activity indicates no endorsement of its content by W3C, nor that W3C has, is, or will be allocating any resources to the issues addressed by it. Participation in Incubator Groups and publication of Incubator Group Reports at the W3C site are benefits of W3C Membership.
Incubator Groups have as a goal to produce work that can be implemented on a Royalty Free basis, as defined in the W3C Patent Policy.
This document targets Web Service developers who deal with re-use of these services and are interested in semantic annotations that may facilitate such reuse, and then are very interested in how to evaluate the technologies that claim to do so.
This document is also available in a living non-normative format: a Wiki file.
     This document aims at:
Establishing normative best practices for evaluating technologies for
mediting between, discovering, and composing web services with
respect to their software engineering efficacy.
   
Discussion of this document is invited on the public mailing list public-xg-swsc@w3.org (public archives). Public comments should include "[SWSC-Methodology]" as subject prefix .
After reading this document, readers may also be interested in related issues as presented in the SWS Challenge Wiki and are invited to join this initiative by registering at the wiki and participating in forthcoming workshops with either standalone papers and/or participating in the actual evaluation of their technologies.
The general SWSC approach is that:
The "Semantic" in the initiative title refers to the hypothesis that if more of the problem and solution semantics are made declarative and machine-readable, then programmer productivity for correct programs can be improved. SWSC workshop participants are expected, but not required, to formalize aspects of the SWSC scenarios in order to increase programmer productivity. Formulations should be shared and we hope that more successful ones will be re-used by other participants, which results in an informal measure of semantic success. Formalizations include annotations to more fully describe the web services in the scenarios as well as the problem domain.
An informal evaluation of approaches will be that as more participants adopt parts of those technical approaches, it will be apparent that those approaches are particularly general and valuable.
One of the important goals of the SWSC is to develop a common understanding of the various technologies fielded in the workshops. So far, the approaches range from conventional programming techniques with purely implicit semantics, to software engineering techniques for modeling the domain in order to more easily develop application, to partial use of restricted logics, to full semantics annotating the web services.
We hope and expect that the results of the SWSC experiments will be replicated by other initiatives and that the current one will become a reference implementation. Toward that end, we abstract here the SWSC methodology as we have developed it by the end of 2007.
Infrastructure:
Workshop Agendas:
Publication of Results:
Evaluation results are publicly posted on the initiative website 
and certified by the consensus of the workshop in which they were
made.
Apart from these fundamentals, there are many issues about code evaluation that require discussion in order to understand our best practice recommendation.
We have tried many variations of evaluation methodologies 2006-2007 and have changed our evaluation design in response to observed difficulties and inadequacies.
Initially, we attempted to do this by requiring that each participant submit their code after solving the first problem in each scenario. Only then, after this "code freeze", were they given the passwords to access the descriptions of the problem variations within a scenario. At the workshop, the frozen code and the final code were compared.
We found that this was very difficult to enforce and execute, and ultimately impractical.
In subsequent workshops, we tried giving everyone access to all of the problem variations. However, this obscured how difficult changes were, since everyone could write for all solutions. And this was unfair to those who had participated in the original code freezes. This also was not workable.
In Workshop 5, we prepared a "surprise" problem. Participants were evaluated on their solution to this problem the following day. This seems feasible for some of the participants, but as a full-blown test, will require something like a code freeze for Workshop 6, if only a few days before the workshop. This will be the basis of the recommended methodology below.
We initially tried to rank the submissions in difficulty of moving from one problem level or sub-level to another by trying to determine whether code was changed that would necessitate a re-compilation and linking, or whether there was only a change to the declaration of objects upon which the code acted. Further, we wanted to distinguish between whether the current declarations had to be altered, or whether new declarations were simply added. We found that these distinctions could not be made objectively. For example, if someone is writing in Lisp, there is no objective difference between declarations and code. XML schemas and Java present similar though less extreme problems.
We have since resorted to a collective consensus on simply whether code or declarations have been changed as a measure of difficulty in moving from one level solution to another. This has been particularly challenging especially in approaches where solutions are synthesized by arranging software components in a graph with a GUI. One consideration has been whether changing the graph requires a re-compilation and linking, producing new code or whether this is essentially a declarative input to an engine, the code of which never changes: only its behavior. This is a valid consideration but this collective consensus evaluation has not been entirely satisfying to all of the participants.
Most recently, we decided just to put check marks in the evaluation table to indicate verification of a correct result, along with footnotes that annotate this result, by consensus. However, if the surprise problem methodology can be made to work, then we will have a plus associated with the problem variations to indicate that the workshop decided that the problem change was handled with minimal programmer effort.
This procedure again has unfortunate subjective aspects, which may vary from workshop to workshop. The participants must determine that the surprise variations were accomplished largely because of the technical method, rather than superior programming capabilities. However, this kind of software engineering evaluation seems to require such an evaluation and after two years, we have not found any completely objective method.
There is one criteria that that does help make the evaluation more objective and reduces the necessity of code inspection: the surprise problem variations are constructed so that they are backwards compatible with the the ground problem. Thus we can require that any surprise problem variation solutions also work with the ground problem. One objection to this is that this is not representative of many real-world problems. However, it does allow us to evaluate objectively that the surprise variation solutions are an incremental improvement over the ground problem. If the participants can accomplish this in a limited time, then this is confirming evidence that the result is not just the result of a clever programmer who can solve any problem from scratch, though this possibility can never be ruled out.
We tried out the beginning of the surprise methodology at Workshop 5 at Stanford. We had a new scenario and, for the participants who achieved it, we gave them some variations on the scenario to program overnight. Those who succeeded, by passing the correct messages to the infrastructure, got a "plus" by their check mark. This, along with code evaluation to assure participants of the validity of the code, was our essential software engineering evaluation, and most of the discussion here is elaborate on that process to make it as good as possible.
We want to make sure that we are doing good science so something like the repeatability requirements of SIGMOD are certainly relevant. Since we (the SWSC) determine the common problems, we don't need to repeat completely the code execution to verify it. However, we probably do need to ensure some sort of repeatability.
As a standard methodology, we could require that participants submit relevant portions of their code either as an appendix to their papers or, as before, as ftp submissions to a "cold locker". Then, n days before the workshop, they are given the passwords to access the problem variations. At the workshop, their success in solving the problem variations is verified and evaluated.
However, a strong argument against this procedure is that previous experience has indicated that participants find such a requirement extremely difficult to achieve and the workshop ends up "fudging" on the cold locker deadlines. Thus we currently recommend against this. It does mean that that we have to depend more upon the honesty of the participants. So far, this seems viable. In the future, as the SWSC scales, we may have to go back to a cold locker approach.
Keeping the problem variations secret allows new participants to be evaluated on the same basis as previous ones. This makes it not as easy for the public to see what evaluations mean, as they cannot see the actual problem variation details. Nevertheless, this seems, so far to be the best method given our experience. Requiring backward compatibility also reduces the chances of cheating by participants, who have been quite honest to date in any case.
We do want to deploy standards or specifications that are widely-used in industry in order to be relevant to industry.
We have started with three WSDL Web Services simulating a client trying to purchase goods using the RosettaNet protocol. Taking into account different versions of services and the mediation systems that have been implemented to test the system we are operating at present around 20 different Web Services. The implementation of these web services is layered on top of standard middleware such as Axis 2 and Tomcat.
The complexity of the messages used has revealed several bugs in the implementation of the axis2 engine, which caused spending major resources just on the underlying technologies and not purely on the business problem.
In fact it turns out that a variety of skills is required to master such a testbed. First, in-depth knowledge of WSDL and XML Schema to design proper service description utilizing the maximum of the descriptive power of the standards. Most obviously some knowledge on a web service engine (such as axis2) and the underlying application server (such as tomcat) is required as well as a fair amount of database design and web application programming skills. It also turned out to be necessary to understand a good deal about the Internet Protocol and firewalls in order to help participants to manage their invocations. And, last but not least, such an infrastructure requires some monitoring facilities that guarantee a 24/7 live system, which is not the usual approach in a university respectively research environment.
Effectively it demonstrated that in spite of the fact that Web Services are an established technology, current tools are only able to hide a small degree of the underlying complexity. As soon as we reached some border case, understanding of underlying protocols and standards was essential.
We will consider adding REST services to the scenarios because they seem to being used increasingly, and because so many middleware layers do not yet exist for this type of service. It remains to be seen whether this will be true in the future.
However, one of the issues that has caused the most trouble is the large Rossettanet messages. We maintain that this is an industry standard that has to be handled, and if it is causing a problem, then our SWSC has been successful in raising this problem. We will certainly consider less difficult scenarios, but the scenarios with these large messages remain an important part of the challenge and we recommend that other challenges also contain such problems.
In summary, the SWSC has already highlighted problems with the current industrial web services approaches. We hope this leads to more elegant solutions and simpler tools in the future, whether built by industry, pure academia, or some of the the academia-based small companies currently participating in the SWSC.
Besides the technical challenge we realized another important point: We decided to not formalize the problems using a logical formalism, but rather to describe them using natural language documentation. Having to communicate with developers as well as participants, we conclude that only having text based documentation as a common model is suboptimal. We realized that a fair amount of the solution to the problems is its formal description. In fact, had we had such descriptions from the start we could have saved several iterations of discussion with developers.
However, it is not at all clear whether any one formalism for expressing the problem can be chosen that would be fair to all participants and perhaps would not promote a particular solution. For now, we recommend that the problem descriptions be in English, which also promotes new problem definitions.
Having effective means to share information between the organizers and the participants is another important aspect for a successful challenge. We have started with a set of static web pages, however it was soon clear that this is suboptimal. A Wiki that enables corrections and improvements on the documentation in a collaborative fashion turned out to be much more adequate. While this improved the efficiency of the discussions around the different problems sets, it turned out not to be enough to share descriptions of the solutions between participants.
Similar to the problems, also the solutions come with a fair amount of complexity. In order for a team to participate, we required to publish the declarative parts of the teams solution on the Semantic Web Challenge Portal. A Wiki did not provide sufficient means to share such complex structures, so in addition we created FTP accounts. However this turned out to be suboptimal: while it enabled to understand and verify a particular solution, the link between a solution's description in the papers submitted to the workshops, to the related discussion on the Wiki, and finally to the relevant parts of a solution's declarative description is too little integrated. We assume that this is one of the reasons why so far participants only share to a very limited amount of their formalizations.
A best solution for this issue has yet to be found.
Another aspect of involving real Web services is the possibility to automatically verify a solution by issuing a set of different messages and monitor the subsequent message exchanges. This is a useful feature, since it makes the challenge more scalable with respect to the number of participants - it essentially enables to automatically verify solutions. Moreover it allows for teams to participate not only during workshops, but also at any other time by just exposing their Web Services. Other people interested in the claims of a team can just use the online portal to start a test set against a particular solution and verify its coverage.
Debugging support is very important. Already with six teams it was quite often necessary to examine the application server's log, be it to determine a typo in the endpoint addresses used in a mediator implementation, or to identify an invalid message. Over time we added different views to the online portal that allows to examine parts of the message exchange and in particular the status of the systems involved. It is clear that the users, at a minimum, must be provided with log access.
We are evaluating the software engineering advantages of the technologies of the participants. We could measure the time participant teams take to make a change but this might focus more on programming skills than on technologies. The approach below emphasizes making minimal changes to a particular programming solution in a very limited time.
Participants can choose to attempt to solve any of the "ground" problems, the details of which are publicly available on the SWSC Wiki. If the solution is verified, the participating team gets a check-mark by this problem. Problems for which surprise variations are available are noted on the Wiki problem description.
All verification is performed by SWSC staff examining the message exchanges to see if they are correct.
A short time before each workshop, participants who request to do so, may receive the details of surprise variations. This length of time needs to be more precisely determined, but will typically be no longer than two days.
In order to get a plus mark in each problem variation, participant teams must be able to solve the variations at the time of the workshop.
We require that the solutions be reversible to the original problem. This is not only verified by testing at the workshop, but also by consensus inspection of the code. The general procedure will be:
This ground and surprise problem solution procedure requirement lessens the importance of consensus code inspection. And it emphasizes the value of software that works based upon high-level specification of the goals and constraints, associated with the problem description. It is the job of the scenario variance specifier to ensure that problem reversibility is indeed possible.
Any participating team may submit new problems and variations. However, their evaluations will receive only check marks rather than plus marks, indicating that they were the authors of the problems.
We want not only to roughly evaluate the SE aspect of solutions, but also would like to evolve both an understanding of the various technologies and also to encourage re-use of the them, building toward "best practices" where ever possible. This is where the challenge so far has been weakest. We recommend mandatory restrictions on participation, as follows.
Participants must furnish documentation on their solutions. When we discuss "solution", above, we often refer to the code and/or problem representation. The documentation of a solution should not be all of the technology required for the solution but it should be beyond a simple paper, and have the following qualities:
Such documentation should be furnished prior to the workshop and should be discussed at the workshop with consensus on its efficacy. Participants who do not furnish documentation considered useful by the consensus will receive a minus mark, indicating such, in their evaluation.
When such documentation is re-used by another team (other than the one that orignally provided it), the providing team will receive a plus mark for each re-use by another team. This is a strong indication that this technology is generally valuable and may be a "best practice".
The above represent our best practice recommendations after two years of workshops and one year of incubator discussion. These recommendations are not unproblematic and several participants see issues with them, most of of which are documented above. As the SWSC workshops continue, there may be a further revision of these recommendations.
Holger Lausen and Michal Zaremba at the University of Innsbruck performed most of the design and implementation of the SWS Challenge that has led to this result. Professor AmitSheth, Director of the Knoesis Center at Wright University, supported this incubator activity. This project was supported in a major way by the Logic Group at Stanford University led by Professor Michael Genesereth and by the Semantic Technology Institute Innsbruck directed by Professor Dieter Fensel.
All of the members of the SWS Testbed Incubator reviewed this report. In addition to those mentioned above, these included: