Arch/Test Cases - W3C RIF-WG Wiki

This is a work-in-progress page for the issue of RIF test cases.

[The idea of such pages is that they should pin down what the problem is (Framing), explore what RIF might or might not do about it (Discussion), then when some clear picture starts to emerge we can draft text which might go in the final document (Draft text).]

Framing

Test cases can play a number of possibly distinct roles within RIF. We need to identify those roles, decide what to say about them in the architecture document, decide how such test cases should be represented and maintained and (eventually) actually develop some.

First, there is a role for test cases in order to guide the development of the RIF architecture and various RIF dialects (design test cases). For example "lots of relevant languages can express rules such as X how would X be expressed in this RIF dialect?". This does not feel like an issue for the Architecture document, though perhaps is an issue we should be more actively addressing.

Second, there is a role for test cases as part of the specification for a RIF Dialect (RIF Dialect test cases). These would enable a RIF implementer to better understand the spec and to test their implementation.

Will we provide such test cases at all?

Are they intended as a conformance suite (pass all the tests and you conform) or as an informative guideline and implementer's aid?

If RIF Dialects have profiles what role do the test cases have in delineating such profiles?

There are many sorts of RIF tools, should RIF define the space of such tools and have different test cases for different tool types or can the test cases be independent of the application model?

Discussion

Design test cases

Several people have pointed out that our UCR document lacks grounded example rules in specific target rule languages. Whilst this was a deliberate choice for that document, a library of specific rules examples could be a useful resource in helping us develop the RIF Dialects, the associated libraries of builtin operations and perhaps the RIFRAF classification scheme. The most concrete example of this sort of usage to date has been the append example used in several email threads to unpick the question can production rules reasonably implement the sort of recursion required for RIF Core.

However, this does not seem like an issue to cover in the architecture document and so won't be discussed further here.

RIF Dialect test cases

It seems to me that we do want to have concrete test cases to accompany each dialect in order the help implementers develop and test their implementation. The better test coverage we have the more likely tools are to interoperate correctly.

Conformance suite or informative?

We've have not yet agreed what it means to conform to or implement RIF or a given RIF dialect so it is hard to be precise about the role of test cases yet.

Looking at similar W3C working groups then in RDF there is no notion of an RDF processor or a conformance statement for such a thing. There are a set of (normative) test cases but these document the resolution for each issue addressed by the working group rather than claim comprehensive coverage. For example, the parser tests do not cover every grammar rule.

In OWL then the notion of document conformance is defined along with two categories of tool (document syntax checker and document consistency checker). The conformance statements are given in terms of the semantics. The test suite is fairly comprehensive and a useful development aid but whilst the test suite itself is normative the use of these for testing an OWL processor is only informative - i.e. again there is no claim of comprehensiveness of the suite.

The RIF charter requires us to deliver test cases which reflect issue resolution and which aid in conformance evaluation which seems very much in the spirit of the above examples. Any notion of processor or document conformance should be defined independently of test cases. We should provide a suite of test cases which cover any important issues or corner cases that come up during the design work as an aid in testing but should make no claims of any form of completeness. It would not be a conformance test suite.

Categories of RIF Dialect test cases

There will be a variety of RIF tools with different notions of what needs to be tested. For example:

a conforming translator which can translate rulesets in some concrete rule language (L) into a specific RIF Dialect (D) and back again, it might only deal with some subset of the RIF dialect (but it should handle that subset correctly)
a RIF implementation, a conforming translator which can translate any ruleset within the RIF dialect (D) into its native form (L), and back again, preserving its semantics
a rule editor which should be able to emit syntactically correct RIF rules but need have no execution component

Now our test cases should be independent of L. We don't want to get into the business of defining test cases for specific target languages. So what form could our test cases take?

Firstly, note that testing syntactic correctness of the output of an L -> D translation does not require tests cases from RIF. We simply provide the grammar (XML Schema or whatever turns out to be normative) for D and an implementer can check that their translated output parses correctly.

So it seems to me that there are two categories of test case that we could potentially provide:

Implementation test cases: these would each comprise a ruleset in RIF Dialect D, a set of test data and an expected set of results. The format of the results will be dialect dependent; for Horn and LP dialects it will be a set of entailments (a query and expected set of query results); for production rule dialects it may be a specification of the final working memory; for reaction rules it may be some partial order of action firings; for integrity rules it may be a pass/fail flag.

RIF implementations could use such rules to check they cover D. Conforming translators could take a subset of the test cases that fall within the space covered by their rule language and verify they can execute that subset. The set of such tests that a partial implementation passes would provide an (informal) profile for what that implementation covers.

GaryHallmark:

It might be instructive to actually try to write a simple test case for RIF Core as it is now. Since we don't know or care what the actual target rule system is, we have to treat the translator and the target rule system together as a black box (BB).

We would like to feed a RIF document containing rules and data into the BB and get some result out.

Issue 1: we have no way to get any result out of the BB.

Solution 1a: add a print builtin to RIF Core, OR

Solution 1b: add queries to RIF Core and standardize the format of the printing of the results

Issue 2: including data in the RIF document is awkward with our current syntax.

Solution: add ground uniterms and ground frames to rulesets

Using Solution 1a and Solution 2, a very simple test document (using human presentation syntax here) would be

testData(1, "fact one")
testData(2, "fact two")
testResult("fact two")
Forall ?X ?Y(&print("pass 1") :- And(testData(2, ?X) testResult(?X)))
Forall (&print("pass 2") :- Exists(?X ?Y And(testData(2, ?X) testResult(?Y) ?X=?Y)))

The correct "result" is a file containing the following lines in some order:

pass 1
pass 2

AdrianPaschk: We might define a set of pre-defined implementation test case functions (and hence allow specific implementations of these test case functions). A test case for a RIF rule program, is implemented as a set of test data (facts) and tests implemented in terms of test case functions.

Here a set of functions which might be useful to test an interchanged RIF rule set:

testQuery(Literal)                                     test the literal (= rule head or fact)
testNotQuery(Literal)                                negatively test the literal with default negation
testNegQuery(Literal)                                 negatively test the literal with explicit negation
testNumberOfResults(Literal, Number)          test number of results derived for the literal = stated value
testNumberOfResults(Literal, Var, Number)   test number of results for the variable in the literal
testNumberOfResultsMore(Literal,Number)     test number of results for the literal > given value
testNumberOfResultsLess(Literal,Number)     test number of results for the literal < given value
testNumberOfResultsMore(Literal,Var,Number) test number of results for the variable in the literal > given value
testNumberOfResultsLess(Literal,Var,Number) test number of results for the variable in the literal < given value
testResult(QueryLiteral,ResultLiteral)              test if the second literal is an answer of the query literal
testResults(Literal,Var,[<BindingList>])           test if the list of binding results for the variable in the literal can be derived
testResultsOrder(Literal,Var,[<BindingList>])   test if the list of ordered binding results for the variable in the literal can be derived
testQueryTime(Literal, MaxTime)                     test if the literal can be derived in less than the stated time in milliseconds
testNotQueryTime(Literal, MaxTime)                 test if the literal can be derived negatively by default in less than the stated time in milliseconds
testNegQueryTime(Literal, MaxTime)                test if the literal can be derived strongly negative in less than the stated time in milliseconds
getQueryTime(Literal, Time)            get the query time for the literal
getNotQueryTime(Literal,Time)                         get the default negated query time for the literal
getNegQueryTime(Literal,Time)                         get the explicitly negated query time for the literal

Examples:

testQuery(p(X))         test success of query p(X)?
testNegQuery(p(X))         test success of query neg(p(X))? (explicitly negated p(X))
testNumberOfResults(p(X),X,10))       test if ten bindings for the variable X of the query p(X)? can be derived
testResult(p(X),p(a))        test if p(a) is an answer to the query p(X)?
testResults(p(X),X,[a,b,c])        test if a, b and c are result bindings for the variable X of the query p(X)?
testResultsOrder(p(X),X,[a,b,c])       test if a, b and c are answer bindings in exactly this order for the variable X of the query p(X)?
testResultsOrder(p(X),X,[a,a,a,b,c,c])       note the order and number of answers
testQueryTime(p(X),1000)          test if the query p(X)? can be derived in less than 1 second
getQueryTime(p(X),T)            get the time to answer the query p(X); output is the variable T
testQueryTime(testResults(p(X),X,[a,b]))          nested test

Concrete Example:

RIF Rule Program:
                 q(?X) :- p(?X)
Test Data:
                 p(1^^xsd:Integer)
                 p(2^^xsd:Integer)
                 p(3^^xsd:Integer)

Test Cases:
                 testQuery(q(1^^xsd:Integer))
                 testResult(q(?X),q(3^^xsd:Integer))
                 testQueryTime(q(?X),1000^^xsd:Long)
                 testResults(q(?X),?X,[1,2,3])

The implementation of the test case functions might be also defined in terms of a RIF dialect (under the assumption of a certain level of expressiveness), e.g. testQuery(?Query) :- External(Builtin(Eval(?Query))), where "Eval" is a built-in to evaluate a query.

Round trip test cases: these would each comprise simply a rule in RIF Dialect D. The set would be designed to illustrate the range of constructs in D and would be accompanied by a specification of structural rule equivalence. The details of the latter will depend on the outcome of the round-tripping discussions but we might expect rules to round-trip up to certain renamings (e.g. of local rule variables), addition of semantically-neutral metadata (e.g. additional authorship or written-by-tool-X annotations) and reduction to a normal form.

Such tests would serve two purposes. Firstly they we be a checklist for translator implementers, giving them examples of well formed D rules which they should be able to translate. Secondly, for any tool which aims to round trip RIF one can take the rules, perform the D -> L -> D round trip and check syntactic equivalence of the result according to the specified algorithm.

It is not clear to me at this stage what we will say about round tripping and whether such test cases make sense. It's also not clear whether all dialects would have a sufficient notion of connonical/normal form to enable us to specify the structural equivalence algorithm usefully. It may be that the only useful way to define equivalence of test rules is going to be in terms of observational equivalence over test data as covered by the implementation test cases.

Proposal for RIF-BLD test suite

For the purpose of developing a set of test cases for RIF-BLD that illustrate issue resolution and aid in conformance evaluation (per the deliverables section of the charter) along with associated documentation, RIF could follow the general example of the RDF tests and OWL tests.

Overview of RDF and OWL tests

The test cases are designed to be machine-processable and to reflect issue resolution, illustrate language features and generally provide enough language coverage to aid in conformance evaluation. But they do not constitute a complete conformance test suite since they do not completely cover the language. The style of test case is like a specification rather than runnable tests: the task of providing input to the rule engine in some system-dependent manner and checking that the results are correct is left to the tester.

The test case repository is organized into subdirectories, each containing tests devoted to one issue or one feature of the language. Each directory contains the files that participate in the tests in a normative format, and also in additional convenience formats. Each directory also contains a manifest file, written in RDF/XML, for each test in the directory. The manifest file is a machine-readable file containing information such as input file(s), result file, text description of the test case, status, author, etc. The RDF WG defined an RDF Schema for these files, and the OWL WG used the same schema augmented with some new properties and types, which are defined in an OWL ontology. Test cases are categorized according to what aspect of the language (e.g. syntactic validity, entailment) they test.

For accumulating test cases, anyone could submit a test case with status of "proposed," the WG would then vote on the case and the status would go to "approved" or "rejected," and might later be changed to "obsoleted." OWL also has an "extracredit" status. The OWL WG had a web-based test editor which could be optionally used for creating test cases.

RIF test case structure

If we were to adopt the above methodology, we would have to decide on the particulars in (at least) the following areas:

Test case categories

Entailment	The given conclusion must be entailed by the given ruleset
Non-entailment	The given conclusion must not be entailed by the given ruleset
Import entailment tests	The premises document imports other documents that affect the entailment
XS to PS mapping	Do we want to have test cases for the stop of translation from RIF XML to presentation syntax?
Round tripping	Test that the translation from RIF to a language L and back results in equivalent RIF documents
Round tripping	As mentioned in the preceding discussion section, we could provide a specification or tool to determine equivalence of two RIF documents, or a RIF implementation could perform D --> L --> D --> L and check that the two L documents get the same results on the test suite
Consistency	The given document must be found to be consistent as defined by the RIF-BLD semantics
Inconsistency	The given document must be found to be inconsistent as defined by the RIF-BLD semantics
Syntactic validity	for things that cannot be checked by schema validation
	RDF has test categories called PositiveParser and NegativeParser. PositiveParser tests specify a graph, which should be produced from the given input document. NegativeParser tests should result in a report that the input document is in error
	OWL does not have specific test categories or cases for syntactic validity. They do define the notion of an OWL Syntax checker and say that a syntax checker takes a document as input and returns one word (Lite, DL, Full or Other) indicating the level of the document

Manifest file schema

What extensions to the RDF test schema are needed, and do we want to use any of the OWL extensions? How will levels (dialects/extensions) be associated with test cases and results?
- in RDF a test:entailmentRules element in the manifest file specifies whether simple, RDF or RDFS-entailment applies.
- in OWL the manifest file indicates a level for each test and each document in each test. Documents can have language levels Lite, DL or Full. Semantic tests can have one or two language levels which are used to indicate one of three semantic schemes. Also, there is an element (otest:feature) to indicate which OWL feature the test pertains to.

Test case file format

Normative versions in RIF XML, with copies in the presentation syntax also provided.

Test case query format

As AxelPolleres suggested in F2F9, we could use SPARQL and RIF condition language to represent queries in test cases.

SELECT A_LIST_X_OF_FREE_VARIABLES
FROM ruleset
WHERE { RIF_CONDITION_WITH_FREE_VARIABLES_IN_X }

Note the RIF condition can be written in either RIF presentation syntax or XML syntax. The results of test queries should be represented in either RIF presentation syntax or XML syntax.

Repository

We need to set up a test web site on http://www.w3.org. Tests can then be added to the directories using CVS access to the W3C CVS server. See the OWL test repository

RIF-BLD test cases

Sources for test cases

The issues list, to illustrate issue resolution
- candidate issues: 8, 24, 25, 26, 28, 30, 31, 33, 34, 36, 42, 43, 44, 45, 46 ?
Use cases
Corner cases, to highlight notable language features, limitations, areas where people think there is ambiguity.
Basic coverage of the language in order to aid in conformance evaluation

Proposed test cases

ill-typed-literal
- Manifest001.rdf premises001.hrif premises001.nt conclusions001.hrif
membership
- mem_Manifest001.rdf mem_premises001.hrif mem_nonconclusions001.hrif
polymorphic-symbols
- poly_Manifest001.rdf poly_incorrect-syntax001.hrif
UseCase-1
- uc1_Manifest001.rdf uc1_correct-syntax001.hrif

Draft text

TBD.