TCS

From RIF
Revision as of 15:10, 21 August 2008 by AdrianPaschke (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

__NUMBEREDHEADINGS__


Document title:
RIF Test Cases Discussion Page (Second Edition)



[The idea of such pages is that they should pin down what the problem is (Framing), explore what RIF might or might not do about it (Discussion), then when some clear picture starts to emerge we can draft text which might go in the final document (Draft text).]

1 Framing

Test cases can play a number of possibly distinct roles within RIF. We need to identify those roles, decide what to say about them in the architecture document, decide how such test cases should be represented and maintained and (eventually) actually develop some.

First, there is a role for test cases in order to guide the development of the RIF architecture and various RIF dialects (design test cases). For example "lots of relevant languages can express rules such as X how would X be expressed in this RIF dialect?". This does not feel like an issue for the Architecture document, though perhaps is an issue we should be more actively addressing.

Second, there is a role for test cases as part of the specification for a RIF Dialect (RIF Dialect test cases). These would enable a RIF implementer to better understand the spec and to test their implementation.

Will we provide such test cases at all?

Are they intended as a conformance suite (pass all the tests and you conform) or as an informative guideline and implementer's aid?

If RIF Dialects have profiles what role do the test cases have in delineating such profiles?

There are many sorts of RIF tools, should RIF define the space of such tools and have different test cases for different tool types or can the test cases be independent of the application model?

2 Discussion

2.1 Design test cases

Several people have pointed out that our UCR document lacks grounded example rules in specific target rule languages. Whilst this was a deliberate choice for that document, a library of specific rules examples could be a useful resource in helping us develop the RIF Dialects, the associated libraries of builtin operations and perhaps the RIFRAF classification scheme. The most concrete example of this sort of usage to date has been the ``append`` example used in several email threads to unpick the question can production rules reasonably implement the sort of recursion required for RIF Core.

However, this does not seem like an issue to cover in the architecture document and so won't be discussed further here.

2.2 RIF Dialect test cases

It seems to me that we do want to have concrete test cases to accompany each dialect in order the help implementers develop and test their implementation. The better test coverage we have the more likely tools are to interoperate correctly.

2.3 Conformance suite or informative?

We've have not yet agreed what it means to conform to or implement RIF or a given RIF dialect so it is hard to be precise about the role of test cases yet.

Looking at similar W3C working groups then in RDF there is no notion of an RDF processor or a conformance statement for such a thing. There are a set of (normative) test cases but these document the resolution for each issue addressed by the working group rather than claim comprehensive coverage. For example, the parser tests do not cover every grammar rule.

In OWL then the notion of document conformance is defined along with two categories of tool (document syntax checker and document consistency checker). The conformance statements are given in terms of the semantics. The test suite is fairly comprehensive and a useful development aid but whilst the test suite itself is normative the use of these for testing an OWL processor is only informative - i.e. again there is no claim of comprehensiveness of the suite.

The RIF charter requires us to deliver test cases which reflect issue resolution and which aid in conformance evaluation which seems very much in the spirit of the above examples. Any notion of processor or document conformance should be defined independently of test cases. We should provide a suite of test cases which cover any important issues or corner cases that come up during the design work as an aid in testing but should make no claims of any form of completeness. It would not be a conformance test suite.

2.4 Categories of RIF Dialect test cases

There will be a variety of RIF tools with different notions of what needs to be tested. For example:

* a conforming translator which can translate rulesets in some concrete rule language (L) into a specific RIF Dialect (D) and back again, it might only deal with some subset of the RIF dialect (but it should handle that subset correctly)
* a RIF implementation, a conforming translator which can translate any ruleset within the RIF dialect (D) into its native form (L), and back again, preserving its semantics
* a rule editor which should be able to emit syntactically correct RIF rules but need have no execution component

Now our test cases should be independent of L. We don't want to get into the business of defining test cases for specific target languages. So what form could our test cases take?

Firstly, note that testing syntactic correctness of the output of an L -> D translation does not require tests cases from RIF. We simply provide the grammar (XML Schema or whatever turns out to be normative) for D and an implementer can check that their translated output parses correctly.

So it seems to me that there are two categories of test case that we could potentially provide:

Implementation test cases: these would each comprise a ruleset in RIF Dialect D, a set of test data and an expected set of results. The format of the results will be dialect dependent; for Horn and LP dialects it will be a set of entailments (a query and expected set of query results); for production rule dialects it may be a specification of the final working memory; for reaction rules it may be some partial order of action firings; for integrity rules it may be a pass/fail flag.

RIF implementations could use such rules to check they cover D. Conforming translators could take a subset of the test cases that fall within the space covered by their rule language and verify they can execute that subset. The set of such tests that a partial implementation passes would provide an (informal) profile for what that implementation covers.

GaryHallmark:

It might be instructive to actually try to write a simple test case for RIF Core as it is now. Since we don't know or care what the actual target rule system is, we have to treat the translator and the target rule system together as a black box (BB).

We would like to feed a RIF document containing rules and data into the BB and get some result out.

Issue 1: we have no way to get any result out of the BB.

Solution 1a: add a print builtin to RIF Core, OR

Solution 1b: add queries to RIF Core and standardize the format of the printing of the results

Issue 2: including data in the RIF document is awkward with our current syntax.

Solution: add ground uniterms and ground frames to rulesets

Using Solution 1a and Solution 2, a very simple test document (using human presentation syntax here) would be


testData(1, "fact one")
testData(2, "fact two")
testResult("fact two")
Forall ?X ?Y(&print("pass 1") :- And(testData(2, ?X) testResult(?X)))
Forall (&print("pass 2") :- Exists(?X ?Y And(testData(2, ?X) testResult(?Y) ?X=?Y)))


The correct "result" is a file containing the following lines in some order:

pass 1
pass 2


AdrianPaschk: We might define a set of pre-defined implementation test case functions (and hence allow specific implementations of these test case functions). A test case for a RIF rule program, is implemented as a set of test data (facts) and tests implemented in terms of test case functions.

Here a set of functions which might be useful to test an interchanged RIF rule set:


testQuery(Literal)                                     test the literal (= rule head or fact)
testNotQuery(Literal)                                negatively test the literal with default negation
testNegQuery(Literal)                                 negatively test the literal with explicit negation
testNumberOfResults(Literal, Number)          test number of results derived for the literal = stated value
testNumberOfResults(Literal, Var, Number)   test number of results for the variable in the literal
testNumberOfResultsMore(Literal,Number)     test number of results for the literal > given value
testNumberOfResultsLess(Literal,Number)     test number of results for the literal < given value
testNumberOfResultsMore(Literal,Var,Number) test number of results for the variable in the literal > given value
testNumberOfResultsLess(Literal,Var,Number) test number of results for the variable in the literal < given value
testResult(QueryLiteral,ResultLiteral)              test if the second literal is an answer of the query literal
testResults(Literal,Var,[<BindingList>])           test if the list of binding results for the variable in the literal can be derived
testResultsOrder(Literal,Var,[<BindingList>])   test if the list of ordered binding results for the variable in the literal can be  derived
testQueryTime(Literal, MaxTime)                     test if the literal can be derived in less than the stated time in milliseconds
testNotQueryTime(Literal, MaxTime)                 test if the literal can be derived negatively by default in less than the stated time in milliseconds
testNegQueryTime(Literal, MaxTime)                test if the literal can be derived strongly negative in less than the stated time in milliseconds
getQueryTime(Literal, Time)            get the query time for the literal
getNotQueryTime(Literal,Time)                         get the default negated query time for the literal
getNegQueryTime(Literal,Time)                         get the explicitly negated query time for the literal 

Examples:


testQuery(p(X))         test success of query p(X)?
testNegQuery(p(X))         test success of query neg(p(X))? (explicitly negated p(X))
testNumberOfResults(p(X),X,10))       test if ten bindings for the variable X of the query p(X)? can be derived
testResult(p(X),p(a))        test if p(a) is an answer to the query p(X)?
testResults(p(X),X,[a,b,c])        test if a, b and c are result bindings for the variable X of the query p(X)?
testResultsOrder(p(X),X,[a,b,c])       test if a, b and c are answer bindings in exactly this order for the variable X of the query  p(X)?
testResultsOrder(p(X),X,[a,a,a,b,c,c])       note the order and number of answers
testQueryTime(p(X),1000)          test if the query p(X)? can be derived in less than 1 second
getQueryTime(p(X),T)            get the time to answer the query p(X); output is the variable T
testQueryTime(testResults(p(X),X,[a,b]))          nested test 


Concrete Example:


RIF Rule Program:
                q(?X) :- p(?X)
Test Data:
                p(1^^xsd:Integer)
                p(2^^xsd:Integer)
                p(3^^xsd:Integer)
Test Cases:
                testQuery(q(1^^xsd:Integer))
                testResult(q(?X),q(3^^xsd:Integer))
                testQueryTime(q(?X),1000^^xsd:Long)
                testResults(q(?X),?X,[1,2,3])


The implementation of the test case functions might be also defined in terms of a RIF dialect (under the assumption of a certain level of expressiveness), e.g. testQuery(?Query) :- External(Builtin(Eval(?Query))), where "Eval" is a built-in to evaluate a query.


Round trip test cases: these would each comprise simply a rule in RIF Dialect D. The set would be designed to illustrate the range of constructs in D and would be accompanied by a specification of structural rule equivalence. The details of the latter will depend on the outcome of the round-tripping discussions but we might expect rules to round-trip up to certain renamings (e.g. of local rule variables), addition of semantically-neutral metadata (e.g. additional authorship or written-by-tool-X annotations) and reduction to a normal form.

Such tests would serve two purposes. Firstly they we be a checklist for translator implementers, giving them examples of well formed D rules which they should be able to translate. Secondly, for any tool which aims to round trip RIF one can take the rules, perform the D -> L -> D round trip and check syntactic equivalence of the result according to the specified algorithm.

It is not clear to me at this stage what we will say about round tripping and whether such test cases make sense. It's also not clear whether all dialects would have a sufficient notion of connonical/normal form to enable us to specify the structural equivalence algorithm usefully. It may be that the only useful way to define equivalence of test rules is going to be in terms of observational equivalence over test data as covered by the implementation test cases.

2.4.1 Verification Tests

Roughly, verification can be described as: Are we building the rule program right? That is, it ensures the technical correctness of a RIF rule set with respect to syntactic correctness, correct handling of RIF rule sets and errors and anomalies in rule sets.

  • Errors represent problems which directly effect the operations of a rule program.
  • Anomalies are considered as symptoms of genuine errors, i.e. they man not necessarily represent problems in themselves.

The simplest source of errors are typographical syntax mistakes which can be solved by a verifying parser. More complex problems arise due to round-tripping problems between different execution environments or in case of large rule sets which easily might be incomplete or contain contradictions. A distinction between structurally flawed or logically flawed rule sets can be made with structural design tests for redundancy, relevance and reachability, and semantic inference / execution tests for errors and anomalies such as consistency, soundness, and completeness.


2.4.1.1 Syntactic Tests

The tests in this category check for syntactic validity that cannot be checked by schema validation.


2.4.1.2 Round-tripping Tests

(D = rif dialect, R = rule language, round-tripping = D--> L --> D) In the absence of a way to determine the equivalence of two RIF documents, a RIF implementation could perform D--> L --> D --> L and check that the two L documents get the same results on the test suite. We could also check that metadata survived the D-->L-->D translation.


2.4.1.3 Structural Design Tests

This category of tests is concerned with the structural design of rule sets. Typical examples of errors and anomalies which are detected by design test cases are:

  • Unused rules (and facts), which are never used in any rule/query derivation (backward reasoning, e.g. in logic programming) or which are unreachable or have dead-ends (forward reasoning, e.g. in production rule sets).
  • Redundant rules such as identical rules or identical rule chains, e.g. q :- p and q :- p.
  • Subsumed rules, a special case of redundant rules, where two rules have the same rule head but one rule contains more prerequisites (conditions) in the body, e.g. r :- p, q and r :- p.
  • Unrelevant rules like tautologies such as p :- p.


2.4.1.4 Semantic Inference / Execution Tests

This category of tests is concerned with the execution and inferences which can be made from a rule set. Typical examples of errors and anomalies which are detected by inference/execution test cases are:

  • Inconsistency: Conflicting conclusions can be made from a rule set. Special cases of inconsistent rules are:
    • Self-contradicting rules and rule chains, such as r :- p, q, ¬p or ¬p :- p, which can never succeed.
    • Contradicting rules and rule chains, e.g. s :- p, q and ¬s :- p, q
  • Correctness, Completeness and Soundness: No valid input information fails to produce the intended output conclusions, i.e. completeness relates to gaps (incomplete knowledge) in a rule set. Soundness checks that the intended outputs indeed follow from the valid input to a rule set. Examples are:
    • Loops / Circularity in rules or rule chains, e.g. q :- p, q
    • Exception to a rule, a rule R does not apply in a particular situation. This situation can be described by a fact EXCEPTION, e.g. B :- A1,..,AN,™ ¬EXCEPTION
    • There are gaps in the rule set, i.e. there is no result for certain queries or input facts. A default rule is introduced to address this problem, e.g. B :- A1,..,AN and B:-

2.4.2 Validation Tests

Validation is concerned with the logical/functional correctness of a rule set in a particular application context. It might be roughly described as: Are we building the right rule program (in a particular application context)?

2.4.2.1 Entailment Tests
2.4.2.1.1 Positive Entailment Tests

The given conclusions must be entailed by the given ruleset.

2.4.2.1.2 Negative Entailment Tests

The given conclusions must not be entailed by the given ruleset.

2.4.2.2 Conclusion Tests

Test the set of conclusion, e.g. number of conclusion [equal|less|more,...] than predefined size, test if variable answers [equal|are contained in] the predfined list of (ordered|unordered) variable bindings, etc.

2.4.3 Integrity Tests

Integrity tests are context specific inference/execution tests which formulate consistency (or inconsistency) criteria to validate the consistency of a rule set. For instance, in case of a evolving rule set, where new knowledge is added or imported at runtime, they might be used as integrity constraints. Hence, integrity testing here relates to the question: Are we keeping the rule program right?

2.4.3.1 Update Tests

New rules or facts are added to the rule set that affect the entailment of conclusions, which might violate one or more integrity constraint.


2.4.3.2 Import Tests

The premises document imports other documents that affect the entailment of conclusions, which might violate one or more integrity constraint.


2.4.4 Benchmark Tests

As opposed to functional validation, benchmark tests are concerned with the dynamic validation of the performance of a rule set. Benchmark test cases measure the runtime performance for different outcomes of a rule set given a certain test fact base as input. They can be used e.g. to experimentally evaluate the runtime performance of an interchanged rule set in an execution environment (e.g. a rule engine) and test if it provides adequate performance to run the rule program.

3 Proposal for RIF-BLD test suite

For the purpose of developing a set of test cases for RIF-BLD that illustrate issue resolution and aid in conformance evaluation (per the deliverables section of the charter) along with associated documentation, RIF could follow the general example of the RDF tests and OWL tests.

3.1 Overview of RDF and OWL tests

The test cases are designed to be machine-processable and to reflect issue resolution, illustrate language features and generally provide enough language coverage to aid in conformance evaluation. But they do not constitute a complete conformance test suite since they do not completely cover the language. The style of test case is like a specification rather than runnable tests: the task of providing input to the rule engine in some system-dependent manner and checking that the results are correct is left to the tester.

The test case repository is organized into subdirectories, each containing tests devoted to one issue or one feature of the language. Each directory contains the files that participate in the tests in a normative format, and also in additional convenience formats. Each directory also contains a manifest file, written in RDF/XML, for each test in the directory. The manifest file is a machine-readable file containing information such as input file(s), result file, text description of the test case, status, author, etc. The RDF WG defined an RDF Schema for these files, and the OWL WG used the same schema augmented with some new properties and types, which are defined in an OWL ontology. Test cases are categorized according to what aspect of the language (e.g. syntactic validity, entailment) they test.

For accumulating test cases, anyone could submit a test case with status of "proposed," the WG would then vote on the case and the status would go to "approved" or "rejected," and might later be changed to "obsoleted." OWL also has an "extracredit" status. The OWL WG had a web-based test editor which could be optionally used for creating test cases.


3.2 RIF test case structure

If we were to adopt the above methodology, we would have to decide on the particulars in (at least) the following areas:

3.2.1 Test case categories

 || Entailment || The given conclusion must be entailed by the given ruleset ||
 || Non-entailment || The given conclusion must not be entailed by the given ruleset ||
 || Import entailment tests || The premises document imports other documents that affect the entailment ||
 || XS to PS mapping || Do we want to have test cases for the stop of translation from RIF XML to presentation syntax? ||
 ||<(|2> Round tripping || Test that the translation from RIF to a language L and back results in equivalent RIF documents ||
 ||    As mentioned in the preceding discussion section, we could provide a specification or tool to determine equivalence of two RIF documents, or a RIF implementation could perform D --> L --> D --> L and check that the two L documents get the same results on the test suite ||
 || Consistency || The given document must be found to be consistent as defined by the RIF-BLD semantics ||
 || Inconsistency || The given document must be found to be inconsistent as defined by the RIF-BLD semantics ||
 ||<(|3> Syntactic validity || for things that cannot be checked by schema validation ||
 ||    RDF has test categories called PositiveParser and NegativeParser. PositiveParser tests specify a graph, which should be produced from the given input document. NegativeParser tests should result in a report that the input document is in error ||
 ||    OWL does not have specific test categories or cases for syntactic validity. They do define the notion of an OWL Syntax checker and say that a syntax checker takes a document as input and returns one word (Lite, DL, Full or Other) indicating the level of the document ||

3.2.2 Manifest file schema

 What extensions to the RDF test schema are needed, and do we want to use any of the OWL extensions?
 How will levels (dialects/extensions) be associated with test cases and results? 
   * in RDF a test:entailmentRules element in the manifest file specifies whether simple, RDF or RDFS-entailment applies.
   * in OWL the manifest file indicates a level for each test and each document in each test. Documents can have language levels Lite, DL or Full. Semantic tests can have one or two language levels which are used to indicate one of three semantic schemes. Also, there is an element (otest:feature) to indicate which OWL feature the test pertains to.


3.2.3 Test case file format

 Normative versions in RIF XML,  with copies in the presentation syntax also provided.

3.2.4 Test case query format

As AxelPolleres suggested in F2F9, we could use SPARQL and RIF condition language to represent queries in test cases.

{{{ SELECT A_LIST_X_OF_FREE_VARIABLES FROM ruleset WHERE { RIF_CONDITION_WITH_FREE_VARIABLES_IN_X } }}}

Note the RIF condition can be written in either RIF presentation syntax or XML syntax. The results of test queries should be represented in either RIF presentation syntax or XML syntax.

3.2.5 Repository

 We need to set up a test web site on http://www.w3.org. Tests can then be added to the directories using CVS access to the W3C CVS server. See the OWL test repository


3.3 RIF test case semantics

In analogy to test cases in agile software engineering, we consider a RIF test case T to consist of two parts: a set of input assertions I that sets up a test environment, and an expected outcome O.

  • The input assertion set I ⊆ L consists of formulas from L, typically it is the fact base of the test.
  • The expected outcome O of the test is a formula O ∈ L plus a label.
  • The label can be either + or . If the label is + then the test case T is called positive, if the label is then the test is called negative.

Like rules, test cases are constraints on the set of possible models and therefore describe an approximation of the intended model(s), i.e. the models which you expect from your rule set R.

Let Mod be the function that associates sets of formulas with sets of models, and let Σ be a model selection function. We define a compatibility relation  |= T between Σ and T as follows:

M0 |=T (I,O,+) iff ∀ m ∈ M0 : m ∈ Σ(Mod(I),R) ⇒ m ∈ Mod(O)

M0 |=T (I,O,-) iff ∃ m ∈ M0 : m ∉ Σ(Mod(I),R) ⇒ m ∉ Mod(O)

Validation against a test case checks whether (Mod(I),R) is compatible with the test case.

Note, model is here used in a very abstract sense as a set of worlds, following Tarski’s tradition. This includes, e.g. true-false mappings as models for classical propositional logic (CPL), the models used in predicate logic, and Kripke style models for modal logic. Model selection functions have been comprehensively studied in the area of non-monotonic reasoning. There can be different model selection functions that are compatible with the same test case. Checking compatibility ensures that the selection function meets the constraints defined by the test case. Executing a test case means to check the following conditions: O ∈ CR(I) for positive test cases (I, O, +), and O ∉ CR(I) for negative test cases (I, O,−), respectively. The inference operator CR(I) is the deductive closure of I.

A set of tests is called a test suite. A set of models is compatible with a test suite iff it is compatible with all tests in the test suite.

3.4 RIF-BLD test cases

3.4.1 Sources for test cases

  * The issues list,  to illustrate issue resolution
    * candidate issues: 8, 24, 25, 26, 28, 30, 31, 33, 34, 36, 42, 43, 44, 45, 46 ?

  * Use cases
  * Corner cases, to highlight notable language features, limitations, areas where people think there is ambiguity.
  * Basic coverage of the language in order to aid in conformance evaluation

3.4.2 Proposed test cases

  * ill-typed-literal
       attachment:Manifest001.rdf
       attachment:premises001.hrif
       attachment:premises001.nt
       attachment:conclusions001.hrif
  * membership
       attachment:mem_Manifest001.rdf
       attachment:mem_premises001.hrif
       attachment:mem_nonconclusions001.hrif
  * polymorphic-symbols
       attachment:poly_Manifest001.rdf
       attachment:poly_incorrect-syntax001.hrif    
  * UseCase-1
       attachment:uc1_Manifest001.rdf
       attachment:uc1_correct-syntax001.hrif

3.5 Other questions

 Do we need to address data types in the RIF entailment tests?  
  * RDF has a category called "Datatype-aware entailment tests"  for tests that require datatype support. This type of test requires "three pieces of machinery" to
      1. determine if a literal is in valid lexical form, for a supported datatype
      2. determine whether two lexically valid literals denote the same value.
      3. operation to support checking for range clashes
      4. for two supported data types, determine if one is a subclass of another
  * OWL says some tests require that certain datatypes are, or are not,  supported in the datatype map, and that these are indicated with the test. (though I can't find an example of this).

4 Draft text

TBD.