Re: Proposal: severity axis on test result

ah. I am not suggesting EARL have confidence levels at all. I am suggesting
EARL have qualitative statements about a failure or pass - such as "meets
requirements", "exceeds requirements", "fails solely due to the fact that it
fails some other requirement", "fails in some cases", "fails in all cases".

If you want to map confidence (trust) ratings, you should do so directly
between trust vocabularies assigned to the different testers. (Fundamentally
I don't see enough agreement on how to rank confidence for us to believe that
we can usefully standardise anything there). Maybe we should look for a
generic trust-ranking vocabulary (GPG has one I think) and explain how users
can rank test results according to a single vocabulary, by specifying their
own mapping or by using some tester's assertions about mapping, in order to
support use cases.

Chaals

On Wed, 3 Jul 2002, Ian Hickson wrote:

>Charles McCathieNevile wrote:
> > [numeric value vs enumerated value for confidence and severities]
> >
> > You will have an internal mapping of numeric results that is
> > specific to your needs, and the next person will too, but theirs
> > will be slightly different. If you map both of those to the same
> > EARL scheme by numeric value, you lose the ability to compare
> > between them without doing complicated searching based on who said
> > what and then doing the mapping
>
>I'm not convinced you have that ability now.
>
>If Alice has two confidence levels ('confident' and 'unsure') and Bob
>has two confidence levels ('1' and '2'), and Alice maps 'confident' to
>EARL's 'high' confidence and 'unsure' to EARL's 'low' confidence,
>while Bob maps '1' to 'high' and '2' to 'medium', then your data is
>just as useless as if they had mapped their values to four different
>numbers.
>
>I don't think EARL having enumerated types will make comparing
>merged results from people who did not agree on a convention before
>hand any easier.
>
>Similarly, if David has a numeric confidence scale, when he maps it to
>EARL's three levels, he's going to actually lose detail.
>
>I think it is important that you not _lose_ data when exporting to
>EARL. IMHO, you should be able to round trip data from a system, to
>EARL, and back again, without losing information. (The opposite is not
>true; there is no guarentee that arbitrary EARL should be able to be
>round tripped through an arbitrary EARL-aware system, because that
>depends entirely on that system's capabilities.)
>
>
> > you don't get essentially meaningless EARL data like "this is 66%
> > conformant to a checkpoint that may or may not have a definition of
> > what that means"
>
>"this is 'low' conformant to a checkpoint that may or may not have a
>definition of what that means"
>
>I don't see how enumerated types in any way solve that problem.
>
>
> >>> For some use cases your Yb - pass with unrelated errors would
> >>> count as a pass, and for some cases it would score as a fail. So
> >>> we would need to know what they are.
> >
> > We need to know what the use case is, and what the errors are, to
> > categorise whether a result means "only failed because of
> > dependencies" or "passed the essential bit".
>
>Oh, right. Ok. This test:
>
>    http://www.hixie.ch/tests/adhoc/css/inheritance/content/002.xml
>
>...would get a Y if it read:
>
>    (Control: PASSED)
>    Test 1 has: PASSED
>    Test 2 has: PASSED
>
>However, if it said that but in addition some of the lines overlapped
>each other, then the test would have passed with unrelated errors, and
>would get a Yb (for "Yes with bugs"). (PASS, Confidence: 100%,
>Severity: 90%)
>
>However, if it read
>
>    (Control: PASSED)
>    Test 1 has:
>    Test 2 has:
>
>...as it does in recent Mozilla builds, then the test would get a B
>(for "Buggy") (FAIL, Confidence: 100%, Severity: 50%).
>
>
>Does that answer your question?
>
>
> > But maybe this isn't necessary if we just say "use a more atomic
> > test". I don't know if that works in the real world or not.
>
>Atomic testing is not the only kind of testing required. It is
>absolutely imperative that browsers be tested with very complicated
>tests, as many bugs only become apparent when two or more features
>interact. If EARL is to be useful in this scenario, it has to deal
>with non-atomic tests.
>
>For example, the Box Acid Test:
>
>    http://www.w3.org/Style/CSS/Test/CSS1/current/test5526c.htm
>
>If this test looks like the screenshot, then in my nomenclature the UA
>gets a Y ("Yes") for this test. If it looks like the screenshot but
>the radio buttons don't work (e.g. think they are part of the same
>group) then it gets a Yb ("Yes, with minor unrelated bugs"). If the
>page is mostly correct but the <body> isn't sized correctly, it gets a
>B ("Buggy") and if the page doesn't look right, it gets a D ("Destroys
>this feature", since if the page looks wrong it means the UA doesn't
>support the block box model, a critical part of CSS1).
>
>This is not an atomic test; it's one of the most complicated CSS1
>tests on the W3C CSS1 test suite. And it is by far not the most
>complicated CSS test around.
>
>
> >> You don't need "Can't Tell" as that should just be "Pass" or "Fail"
> >> with "Confidence: 0%". (Whether you pick "pass" or "fail" depends
> >> on which is the "default" -- e.g. in a test where supporting the
> >> feature correctly and not even trying to support the feature are
> >> indistinguishable, you would use "Pass", while in a test where
> >> trying to support the feature but doing so _incorrectly_ is
> >> indistinguishable from not supporting the feature at all you would
> >> use "Fail".)
> > Not sure that I agree here. Having a pass and a fail be equivalent
> > under circumstances that are outside the RDF model we are using
> > seems likely to lead to problems in processing.
>
>They aren't equivalent. Pass with 0% confidence is a good result (it
>means the UA could in fact have completely passed the test) whereas
>Fail with 0% confidence is a bad result (the test definitely failed,
>it's just unclear whether it was because of lack of support or because
>of a bug).
>
>
> > If there is a cannotTell then I think it should be a single defined
> > property. More to the point, what is the use case for it in the
> > first place?
>
>This test:
>
>    http://www.hixie.ch/tests/evil/css/import/extra/linklanguage.html
>
>...looks exactly the same in UAs that support CSS completely, and
>those that have zero CSS support. Therefore you can't tell whether or
>not it passed, only whether it failed.
>
>So if the test failed, it gets a B (FAIL, Confidence: 100%, Severity:
>50%) but if it passes then it gets an M (for "Maybe", in EARL terms
>that would be a PASS, Confidence: 0%, Severity: 100%).
>
>If you were looking for whether the UA passed a checkpoint, which had
>that test as a prerequisite, then you would say the checkpoint had
>passed if that test was marked PASS, even with Confidence: 0%.
>
>On the other hand, if you had a test with FAIL, Confidence: 0%, then
>you would know that the checkpoint had not been passed, you just
>wouldn't know for sure whether it had failed because of intentional
>lack of support or whether it had failed because of a bug.
>
>
> >> "Not Applicable" is important for tests that neither pass nor fail, such as a
> >> test for making sure all images have alternate text, when applied to a document
> >> with no images, or a test to make sure that 'red' is rendered different from
> >> 'green', on a monochromatic device.
> >
> > (CMN I agree with your conclusion, but testing whether 'green' and 'red' are
> > rendered differently on a monochromatic device is in fact important...)
>
>...or a test for whether the 'color' property is correctly parsed, in
>a UA that doesn't support CSS, then. :-)
>
>

-- 
Charles McCathieNevile    http://www.w3.org/People/Charles  phone: +61 409 134 136
W3C Web Accessibility Initiative     http://www.w3.org/WAI  fax: +33 4 92 38 78 22
Location: 21 Mitchell street FOOTSCRAY Vic 3011, Australia
(or W3C INRIA, Route des Lucioles, BP 93, 06902 Sophia Antipolis Cedex, France)

Received on Sunday, 14 July 2002 22:55:10 UTC