Markup Validator II Design Issues

On the road to Holistic Validation

Code name: Advanced Conformance Observation Report Notation (Acorn 1.0)

(or Universal Conformance ... -- "UniCORN -- as rare as Valid Web pages"?)

This is a semi-structured discussion of design issues for the next generation of the W3C Markup Validator. The idea is to collect as many problems as possible to resolve them as early as possible in the design process. If you can think of additional problems, please add new sections to this document. If you have comments, please send them to public-qa-dev@w3.org or www-validator@w3.org.

Observator Framework

Markup Validator II will act as a high level web interface that controls a processing chain of currently so-called observators. An observators is a software module that can process markup, meta and other data and make specific observations about both the processed items and its processing. For example, OpenSP or more specifically SGML::Parser::OpenSP is an observer that processes markup and makes observations about whether specific markup meets specific constraints. A typical (instance of an) observation is

there is no attribute "HEIGHT"

Observation Model

The basic idea of the observation model is to provide a unified interface for observations like violations of syntax constraints, violations of validity constraints, or whether a specific resource can't be processed for reasons such as an unsupported character encoding or an unsupported MIME type. With such a model in place, the Markup Validator II would be able to apply generic processing for these observations, basically, the serialization of the observations into various output formats like XHTML or a custom XML format, and probably drawing some conclusions from the observations. This allows for easy integration (and rapid external development and maintenance) of observator modules and (equally important) for complete serialization of Validator output.

For the Markup Validator II the primary representation of an observation is a Perl object. Existing observators require wrappers to map their observation model into such Perl objects, new observators should be designed to use the Validator's observation module directly. To enable such development and usage, a Perl module needs to be developed that meets the relevant requirements. This section summarizes design issues and requirements for such a module.

The first requirement for the observation module is that is able to provide a sufficiently rich representation of the various observations the SGML::Parser::OpenSP module can make. This is documented in the SGML::Parser::OpenSP::Tools documentation:

# this is always present
primary_message =>
{
  Number       => 141,       # only if $p->show_error_numbers(1)
  Module       => 554521624, # only if $p->show_error_numbers(1)
  ColumnNumber => 9,
  LineNumber   => 12,
  Severity     => 'E',
  Text         => 'ID "a" already defined'
},

# only some messages have an aux_message
aux_message =>
{
  ColumnNumber => 9,
  LineNumber   => 11,
  Text         => 'ID "a" first defined here'
},

# iff $p->show_open_elements(1) and there are open elements
open_elements => 'html body[1] (p[1])',

# iff $p->show_open_entities(1) and there are open entities
# other than the document, but the document will be reported
# if the error is in some other entity
open_entities => [
{
  ColumnNumber => 55,
  FileName     => 'example.xhtml',
  EntityName   => 'html',
  LineNumber   => 2
}, ... ],

Taken together with positioning information as documented in the SGML::Parser::OpenSP documentation:

LineNumber   => ..., # line number
ColumnNumber => ..., # column number
ByteOffset   => ..., # number of preceding bytes
EntityOffset => ..., # number of preceding bit combinations
EntityName   => ..., # name of the external entity
FileName     => ..., # name of the file

and http://validator.w3.org/config/verbosemsg.cfg

<msg 141>
  original = "ID %1 already defined"
  verbose <<.EOF.
  <div class="ve mid-141">
    <p>
      An "id" is a unique identifier. Each time this attribute is used in a document
      it must have a different value. If you are using this attribute as a hock for
      style sheets it may be more appropriate to use classes (which group elements)
      than id (which are used to identify exactly one element).
    </p>
  </div>
.EOF.
</msg>

The first thing to note here is that the observation instance data comes from multiple places, the long description for example is maintained externally and merged with the observation as reported by SGML::Parser::OpenSP later on using the observation identifier 141. This composition is a crucial aspect for the design as discussed in a minute. Another thing to note is that one observation can have multiple messages, the auxiliary message "ID "a" first defined here" is not considered a separate observation but rather additional data for the primary message.

The composition of a "final" observation can be thought of as a process with multiple stages. For the specific observation here, the first step is indicated by

original = "ID %1 already defined"

OpenSP internally uses this format string to insert the id value into the error message. With the current implementation in the Markup Validator, adding the long description is the last step of processing which is performed by the check script. This has some impact on the design of the observation module.

Which data is maintained by each observator module?
Which data is maintained by the Validator?
...

...

maintenance
localization
...

Descriptor Use Cases and Requirments

Input Sources

The current validator supports multiple input sources, file upload, textarea, and retrieval of remote resources. An observation is naturally bound to the input retrieved through these sources (or their metadata) and should thus be identified in the observation instance.

There must be a descriptor to identify the network location of the input document

Unfortunately not all input has a network location, an uploaded file for example does not. It would to some extend be possible to pretend it does e.g. by using the data: URL scheme but the result would be much too large for this purpose (and it would be complicated as the resource would be a message/http resource (as the observation might consider metadata) which would need to be constructed first as the message is not directly available to the script when used with CGI).

There is thus an open issue how to refer to such content. The current implementation attempts to use a file name for uploaded files and the textarea validation has a "File: upload://Form Submission" signature...

There should be a descriptor to indicate file upload, textarea, etc. input sources

The input might also be a fragment of a document, for example, the style element and style attribute in XHTML usually contain style sheets and an integrated CSS Validator might be passed the content of such an element or attribute. The question for such a case would be how this interacts with a more precise location (see the section on highlighting), what is considered the input source and what the observation location.

Highlighting errors in the source code

In order to help authors locate errors in their documents, the Validator currently shows a short fragment of the source code surrounding the code relevant to the error and highlights a single character in the excerpt.

There must be a descriptor to identify a specific character in the source code of the input document.

This approach works well for a number of observation types, it works best to identify offending characters e.g.

value of attribute "ID" invalid: "_" cannot start a name

  ... id="_ctl2_devProductsButtonA" class="devProd
-- -- -- -- --^

but not so well for many other observations, for example,

document type does not allow element "LINK" here

  ...sdn-online/shared/css/homepage4.css' />
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --^

The descriptor should be extensible to allow for more sophisticated highlighting

It is also possible that an observator is not actually able to report observations in terms of the source code but is rather limited to e.g. an object model of the document, for example, certain observations could be implemented by an XSLT transformation for which the document representation would be based on the XPath data model. Such an observator might report locations as XPath expressions or as XPointers, for example

/*/*[2]/@leftmargin

could locate an offending leftmargin attribute on the body element in an XHTML document.

The descriptor should be extensible to allow for different location addressing schemes

...

More advanced message metadata

The Validator does not offer much message metadata at the moment, there is a short description of the observation for most and a long description for some of them and that's it. The long description in theory offers for infinite metadata as it allows for free-form markup, but that's not very accessible to machines and works pretty much only for formats that allow for inline XHTML content in some way. There is some metadata for new features that would be valuable to have in a more accessible form, for example, a reference to the part of a specification on which the observation is based.

The descriptor or observation structure should be extensible to allow for additional metadata

Global Responsibilities

In order to create the final report, the modules need to interact in some way to compose observations, merge instance data, draw conclusions from the observations and so on. The model needs to resolve which component is responsible for which tasks. A simple conclusion that the current Validator draws is whether a specific document is "valid". It infers this information from the absence of "errors". This requires the notion of an "error". OpenSP does not really provide this notion, even though there is currently a 1:1 relationship between OpenSP observations and errors the Validator would consider here, OpenSP can be configured to report other observations e.g. whether the document uses certain shorttag features as in

<title/.../

it would however report such observations as errors even though such constructs do not really render the document invalid when such features are used. A similar example would be

<!DOCTYPE x [
  <!ELEMENT x EMPTY>
  <!ATTLIST x foo:bar CDATA #IMPLIED>
]><x foo:bar="baz"/>

The XML document is "valid" but it is not "namespace-well-formed". Another example is

<!DOCTYPE x [
  <!ELEMENT x EMPTY>
  <!ATTLIST x y ID #IMPLIED>
]><x y="foo:bar"/>

The document is well-formed, valid, namespace-well-formed, but it is not namespace-valid. Or take

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>Virtual Library</title>
  </head>
  <body>
    <p/>
  </body>
</html>

The document is

well-formed XML
valid XML
namespace-well-formed XML
namespace-valid XML
strictly conforming XHTML 1.0
...

but it does not comply to the "rules" in Appendix C of XHTML 1.0. The question is, which information about the class/type/severity/etc is property of the observation, how is it encoded and how does the Validator interact with it? What features are desired? Should the Validator be able to tell whether a document is namespace-well-formed independently of whether it is well-formed and/or valid? To continue the overly long list of examples, lets consider a few more observations:

a broken link
a misssplled word
an outdated Technical Report reference
a stylistic flaw ("working group" vs "Working Group")

We lack terms for documents with or without such problems, there are no terms such as

strictly well-linking XHTML document
all-included-style-sheets-are-valid-css3 XHTML document
...

A related question is how the results would be presented in the XHTML interface, it could be a hierarchy like

Well-formedness errors:
  * wf-error1
  * wf-error2

DTD-Validitiy errors:
  * dtd-val-error1
  * dtd-val-error2

Link Check
  * broken-link1
  * broken-link2

or it could be a flat list

  * dtd-val-error1
  * wf-error1
  * wf-error2
  * dtd-val-error2
  * broken-link1
  * broken-link2

We currently have the flat list style...

XML Serialization

...

The Validator should also be able to read documents in this format and re-serialize it in supported output formats. Consider external services that provide similar functionality as the Validator but that are not (yet) integrated with the Validator such as the Experimental XHTML 1.0 Appendix C Validator. For loose integration such tools could generate output that conforms to the serialization format and re-use the Validator as viewer for their results. The Validator could also use this internally to dispatch certain validation requests to other tools that would be difficult to integrate through other means.

Related to the latter would be a definition of calling conventions for other tools and configuration of means already supported. Imagine that all tools relevant to the publication of a Technical Report (checking the markup, the style sheets, links, outdated normative references, spelling checks, stylistic checks, pubrules checks, ...) provide output compatible to the observation serialization, you would probably like to configure these tools in one common place and just ask the Validator to aggregate all results. Rather than using an interface that would depend on insanely long URIs or POST requests, you might like to setup a toolchain configuration document and request

http://validator.w3.org/check?uri=...using=http://example.org/toolchain

Where the toolchain document has all the tools with their configuration and calling conventions (like, if it supports checking by URI, where to place the URI of the input document). A similar situation would be if the Validator supports e.g. RelaxNG Validation. It is unlikely that documents include references to the relevant RelaxNG schemas and it is unlikely that the Validator knows about all schemas users might want to use with the Validator. Especially when using markup from multiple namespaces, it might become handy for some Validator users to have a way to select schemas in a more comfortable way than typing the schema locations in some input form whenever they want to validate.

@@

Identifiers

Each observation should have an identifier that clearly identifies the observation e.g. to enable retrieval of additional metadata like long descriptions of the observation. The question here is how these identifiers should be managed and what syntax should be used. The identifier does not necessarily act as the only means of identification, it could just contribute to it. An observation could be identified by "observation 141 of the SGML::Parser::OpenSP observator"; in this example the identifier (that is, the descriptor in the observation) would be just 141. This would probably require to encode the observator in the observation aswell, depending on how groups of observations are managed. This relates to how other global responsibilities are managed (what is an "error", when/how/whether to say a document is "well-formed XML 1.0", etc).

Using a single value rather than a tuple that includes the observator would help with matching the identifier, suppose someone models the observation in RDF and uses Notation3 to add some metadata, rather than using something like

[ a :Observation;
  :observator "SGML::Parser::OpenSP";
  :identifier "141";
] :longDescription "..." .

# with `cwm  --closure=e --think=...` and

{
  ?x a :Observation.
  ?y a :Observation.
  ?x :observator ?o.
  ?y :observator ?o.
  ?x :identifier ?i.
  ?y :identifier ?i.
}
=>
{ ?x = ?y . } .

It would be possible to say

@prefix osp: <...>.
osp:id141 :longDescription "..." .

but see the Parameters discussion, i.e., there are things that rather relate to the instance of the observation than the observation in general so that it would probably rather look like

@prefix osp: <...>.
[ :identifier osp:id141;
  :longDescription "...";
] .

Human-readable text and localization

Another aspect relevant for responsibility management is human-readable text. Most observations have associated human-readable text.

Where does the text come from?
- Maintained with the module?
  - What if different module users desire different text?
  - What if different instances of an observation should have different text?
As above for the text translated into different languages.

Parameters and their attributes

Let's consider a popular instance of an observation:

element "embed" undefined

This is an instance of the "element undefined" observation with a parameter "embed" that is the name of an element type...

...

You might for example want to add a long description for the observation based on certain parameters, if you are using Notion3 you might for example do something like

@prefix osp: <...>.
{
  ?x a :Observation.
  ?x :identifier osp:id999.
  ?x :param [ :localName "embed" ].
} =>
{
  ?x :longDescription "... use <object> ...".
}.

{
  ?x a :Observation.
  ?x :identifier osp:id999.
  ?x :param [ :localName "nobr" ].
} =>
{
  ?x :longDescription "... use white-space:nowrap ...".
}.

This example raises another issue, if there is more than one possible long description for an observation (e.g., a generic long description for all "element undefined" observations and more specific ones for some known proprietary elements), it needs to be determined which long description to use for a specific instance of an observation as it would be undesirable to have multiple long descriptions for the same observation instance. There thus needs to be a specifity or inheritance mechanism at some point.

...

Different use cases require different information passed as parameters. If the application has "required attribute '%s' not specified" and wants to generate "required attribute 'type' not specified" it would be sufficient to have only the local name of the attribute as parameter. If the application wants however to link to the specification for the element for which the 'type' attribute is required, additional information would be required like the name of the element that lacks the attribute.

A common information item that would be passed for elements and attributes along with the local name would be the namespace URI (if applicable). Consider the "element '%s' undefined" observation, it might be passed a parameter that is the element passed to the start_element PerlSAX handler which looks like

{
  Name         => ...,
  Attributes   => ...,
  NamespaceURI => ...,
  Prefix       => ...,
  LocalName    => ...,
}

If this hash reference is passed as parameter, the application can no longer do something to the effect of

  my $text = sprintf "element '%s' undefined", @parameters

It rather needs to have some idea about the structure of the parameters so that it could do something to the effect of

  my $par1 = shift @parameters;
  my $text = sprintf "element '%s' undefined", $par1->{Name};

...

Element
- namespace uri
- local name
- attributes?
- ...
Attribute
- namespace uri
- local name
- value
- ...
...

Processing Observations

The model should ideally scale to be usable from within HTML Tidy. Tidy makes "observations" about modifications it makes in order to "tidy" the document or to fix errors, typical observations would be

line 326 column 9 - Warning: trimming empty <p>

and

line 1 column 106 - Warning: <style> inserting "type" attribute

...

Processing Model

...

CPAN modules that will likely be used

HTML::Encoding (encoding detection)
SGML::Parser::OpenSP (HTML/XML Validation)
HTML::Doctype (doctype detection and rewrite)
XML::SAX::ExpatXS (xml-wf / xmlns-wf / base for AppC checks)
Encode (transcoding)
I18N::Charset (dealing with charset alias)
...

More on outsourcing

some superset of truncate_line() to Text::?
Transcoding along with detection through some HTML::ToUTF8 / XML::ToUTF8 module?
Inline HTML code through HTML::Template?
- done; http://dev.w3.org/cvsweb/validator/share/templates/
If we do some more sophisticated auth proxy code, HTTP::AuthProxy? Or CGI::ProxyAuth? Or something?
Config file for badges along with appropriate HTML code?
- done; http://dev.w3.org/cvsweb/validator/htdocs/config/types.conf
Is there a module that does check_utf8()?
- There is Encode::is_utf8 but it is not quite what we want, see
  - http://www.nntp.perl.org/group/perl.unicode/2707
How to handle multiple output formats?
...

Missing functionality

Volunteers needed to re-implement onsgmls output based functionality as SAX modules, see

http://lists.w3.org/Archives/Public/public-qa-dev/2004Sep/0026.html (thread)

@@

(this is outdated...)

...

An observation is an identifiable statement about a specific subject, a basic model for an observation would thus be

Observation
  Identifier
  Subject
    URI
  Location
    Line
    Column
  Arguments

An example could come from a XHTML::AppendixC observer for markup like

<img src="src" alt="alt"></img>

...

  Identifier => unminimized-empty-element
  Subject    =>
    URI      => http://www.example.org/
  Location   =>
    Line     => 23
    Column   => 42
  Arguments  => { Element => { LocalName => 'img' ... } }, ...

Such an observation is then merged with additional meta data, for example, there could be a

  unminimized-empty-element
    MessagePrototype => "foo <%s> bar baz"

which could be merged to

  Identifier => unminimized-empty-element
  Subject    =>
    URI      => http://www.example.org/
  ...
  FormattedMessage => "foo <img> bar baz"

and there could be

  unminimized-empty-element
    LongDescription => "Use the minimized tag syntax for empty elements,
                        e.g. <br />, as the alternative syntax <br></br>
                        allowed by XML gives uncertain results in many
                        existing user agents."
    SpecReference   => http://www.w3.org/TR/xhtml1/#C_2
  ...

-- ---

(hijacking this... need to find a better location...)

@@@

Introduction

@@ Why it is exceedingly important to have validators, etc. @@

Adding support for new formats

Basic Requirements

There is little difference between a Validator and ordinary implementations of a format; in order to properly support the format, the applicable conformance requirements must be clearly understood and implemented in a consistent manner. The Validator processes a document basically as follows:

Determine all constraints
Check for all constraints
Present the result
- all constraints met
- one or more constraints not met
- unknown

Determine all constraints

The first item requires that the document in some way indicates the applicable constraints. In XML for example this is done using the XML declaration, <?xml version="1.0"?> (or no XML declaration) for XML 1.0, <?xml version="1.1"?> for XML 1.1.

This requires that documents are written such that this information can be derived from the document. If documents do not provide sufficient information, the Validator would need other means to determine the constraints, e.g. by asking the user to specify them.

The latter is typically difficult to implement, it is thus recommended that specifications provide means that gurantee that such information can be derived from all conforming documents.

Check for all constraints

Once the applicable constraints have been determined, these are checked for. This requires an algorithm that translate all possible inputs into a result (conforming: yes/no/unknown) along with a rationale for the result (e.g., a list of errors). This is the most difficult part, and requires that conformance requirements for documents and (thus) Validators are well-defined.

Specifications must provide this information in an easily accessible manner. It is common that conformance requirements are in part spelled out in formal languages such as EBNF grammars and schema documents and it is reasonable to reduce the effort needed to support a new format in the Validator that this is based on such formal languages as support for this partial validation is often already available.

It is however also common to define constraints that are not expressed in the formal language (e.g., because using the formal language is not possible or because it is too difficult to use it accordingly). For such requirements additional implementation effort is necessary to fully meet the implementation requirements for Validators.

It is thus important that these additional requirements are easy to derive from specifications. For example, specifications can provide a list with all requirements that validation using the formal language would not check for. Some existing specifications clearly identify all errors, e.g. XSLT 2.0.

Present the result

It is important that the users of the Validator fully understand the result of the validation process. There are three states, if not all constraints have been met, the result is negative. If the validator could not find a constraint that has not been met, the result could be positive, but that's not always the case.

In this case the result rather depends on whether the Validator fully understood the input, but the format might allow some open-ended extensibility which is not fully supported by the Validator.

For example, the format might allow the inclusion of URI references and require that the URI references are constructed according to the URI specfication and the requirements of the URI scheme; if the document then uses a scheme that the Validator does not fully support, it cannot tell whether the document is actually conforming. This must then be clearly indicated in the result.

As discussed later in this document, there might not be a 1:1 relationship between the validation result and actual conformance, in this case it must be clear to the user how to interpret the result, what is checked for and what is not.

Conclusion

Specifications ideally treat Validators as first-class implementations and define in the specification the conformance requirements for Validators. This should include making it easy for Validator developers to derive conformance requirements that are not checked for using a formal language (like a schema, if any) and allowing users to clearly understand the requirements for and limitations of conforming validators.

Proper implementation of such requirements depends on proper conformance requirements in the first place. For all protocol elements of a format it must be clear how to determine whether it is conforming or not. This might sound obvious, but it is common that specifications do not get all the details right.

For example, consider the contentScriptType attribute in SVG 1.1 and consider these examples:

  contentScriptType = 'application'
  contentScriptType = 'application/ecmascript'
  contentScriptType = 'application/ecmascript;version=example'
  contentScriptType = 'application/ecmascript;foo=bar'
  contentScriptType = 'application/x-ecmascript'
  contentScriptType = 'application/perlscript'
  contentScriptType = 'Hello World'
  contentScriptType = '€'

Refer to the application/ecmascript draft for the type registration. Take five minutes to determine for each example whether the Validator is licenced to generate an error or a warning.

@@ need to come up with a better example here... or maybe better no example and better prose...

Design Goals

@@ Validator MUST NOT declare legal content invalid
@@ Validator SHOULD NOT declare illegal content valid
@@ Validator MUST be reliable (no non-experimental support for a format if the support is known to be incomplete)
@@ ...

Considering Validator Limitations

Validators are limited in their abilities to fully determine whether a document meets all requirements of the relevant specifications under all possible circumstances. For example, Validators are typically static user agents, for dynamic content they can consequently only validate a certain state of the document. They are also limited for all protocol elements that allow for extensibility.

A common case is a document format that allows the inclusion of URI references. It must be clear from a specification what Validators are required to check for, for URI references for example it must be clear whether URIs that do not conform to the generic URI syntax or to the specific scheme syntax must be detected (i.e., whether illegal use of URI references renders a document non-conforming) and if so, which schemes must be supported.

If illegal use of URI references renders a document non-conforming (and Validators are consequently required to detect such use), the Validator would need to note for URI schemes that it cannot fully check that the document uses features the Validator does not support and consequently cannot make a definitive statement about the compliance of the document.

Proper definition of Validator conformance requirements enables users to clearly understand the results of the validator and allows Validator developers to focus on the task of providing support for a new format without spending a lot of effort with reading meaning into requirements in specifications and their normative references.

Care must be taken in defining conformance requirements, there are many things that are better avoided. For example, requirements that depend on unstable external information are difficult to implement. Specification authors might want to encourage specific practise such as only using registered media types, charset names, but if a validator is required to check for such constraints it would need to be updated daily to avoid declaring legal use invalid.

Other constraints might be impossible to validate, for example, if an attribute value is required to be a "globally unique identifier" the validator would need to be omniscient to tell whether the requirement has been met. Specification authors should check for each protocol element whether an algorithm can be devised that returns a boolean value whether the requirements have been met.

Outreach Material

bla bla ... branding ... outreach ... community ... positive statements ... terminology ... "Valid XHTML 1.1" undefined ... ValidatorOK != Fully Conforming ... bla bla

Validator Testing

Support for new formats must be thoroughly and automatically tested, in particular, all custom code (as opposed to external code such as schema validation libraries) must be tested. Working Groups should provide test suite and work with the Validator Team to integrate existing test suites into the Validator workflow.

The Validator in particular needs test cases that violate constraints, Working Groups should thus ensure that test suite include proper error handling tests. The Validator Team should work with Working Groups to contribute new tests back to the "official" test suite. This requires clear documentation of test suite guidelines that must be available to the Validator Team (i.e., publicly available).

Support Maintenance

@@ Timely responses to requests for clarifications
@@ Changes in schemas, etc. must be communicated to the Validator Team
@@ No informal agreements; changes, clarifications, etc. must be normative before they can be included in the Validator
@@ ...

MarkupValidator/M12N