Revising the parse mode detection code

Hi,

  to "fix" e.g. <http://www.w3.org/Bugs/Public/show_bug.cgi?id=14>
we need to revise how `check` determines how to process text/html
resources. The question is how it should do that exactly which includes
whether #14 is actually a bug that should be fixed. The HTML Working
Group has been asked a number of times how to "sniff" XHTML documents
and refrained from comment.

For browsers they made it clear that those should not sniff for XHTML
but rather ignore both the XHTML and HTML specifications and process
text/html as tag soup. Well, Steven actually said "documents served as
text/html should be treated as HTML and not as XHTML" but that would
break most documents or cause undefined behavior due to shorttags and
stuff. So what he meant was tag soup.

Since we cannot do that and they are unlikely to provide input on this
matter we need to come up with a proper algorithm on our own. So how
shall that look like? Using SGML::Parser::OpenSP we can do something
like

  package Handler;
  use strict;
  use warnings;
  
  sub new { bless {}, shift }
  
  sub start_dtd
  {
      my $self = shift;
      my $doct = shift;
      
      # ignore specified document type declarations without
      # public or system identifier and implied document type
      # declarations (which have just a GeneratedSystemId key)
      return unless exists $doct->{ExternalId}{PublicId} or
                    exists $doct->{ExternalId}{SystemId};
      
      my $puid = $doct->{ExternalId}{PublicId};
  
      # no public identifier means HTML
      die "HTML" unless defined $puid;
  
      # split public identifier at //
      my @comp = split(/\/\//, $puid);
      
      # malformed public identifiers mean HTML
      die "HTML" unless @comp > 2;
      
      # we might want something different than \s and \S here
      # but it is not clear to me what exactly we should expect
      die "HTML" unless $comp[2] =~ /^DTD\s+(\S+)/;
      
      # the first token of the public text description must include
      # the string "XHTML", see XHTML M12N section 3.1, and see also
      # http://w3.org/mid/41584c61.156809450@smtp.bjoern.hoehrmann.de
      die "HTML" unless $1 =~ /XHTML/;
      
      # otherwise considers this document XHTML
      die "XHTML"
  }
  
  sub start_element
  {
      my $self = shift;
      my $elem = shift;
      
      # no xmlns attribute means HTML
      die "HTML" unless exists $elem->{Attributes}{XMLNS};
      
      my $xmlns = $elem->{Attributes}{XMLNS};
  
      # this should use the corresponding helper function to deal
      # with some potential edge cases but it is not in CVS yet
      die "HTML" unless $xmlns->{Defaulted} eq "specified";
  
      # see above
      die "HTML" unless "http://www.w3.org/1999/xhtml" eq
        join '', map { $_->{Data} } @{$xmlns->{CdataChunks}};
        
      die "XHTML"
  }

Instead of dying it would call egp->halt() and return HTML/XHTML through
other means. This assumes that our sgml.soc is passed as catalog. If we
remove the "DOCTYPE html ..." entry from sgml.soc (we can and should do
that if we implement doctype defaulting through doctype rewriting which
we can and should do) this will not read any document type definition
and should thus be reasonably fast. In prose description, we will
process a document using the HTML 4.01 SGML declaration unless either,
when processed using the HTML 4.01 document type declaration by default,

  * the document has a document type declaration with a public
    identifier that when split at // has a third component which
    matches /^DTD\s+(\S+)/ for which $1 matches /XHTML/

  * no public/system identifier but a <html> root element with an
    explicitly *specified* xmlns attribute with a value of
    "http://www.w3.org/1999/xhtml"

Now there are many possible variations from this rather simple algorithm
to get "better" results, for example if the xmlns attribute value is
"http://www.w3c.org/1999/xhtml", but we need to draw a line somewhere.
For example, due to default type configurations on some web servers, a
document starting with

  <?xml version="1.0" standalone="no"?>
  <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" 
    "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
  <svg width="300px" height="100px" version="1.1"
       xmlns="http://www.w3.org/2000/svg">
  ...

might be delivered as text/html and at a first sight it might make sense
to treat this document as XML but I do not really think the parse mode
detection code is the proper place to suggest to fix the MIME type for
the document, higher level content handlers would be a better place. For
example, if we determined a HTML parse mode and the root element is not
"html", we could stop further validation and just tell the user to fix
the document and/or MIME type.

http://lists.w3.org/Archives/Public/www-archive/2004Sep/0007.html has a
number of test cases (70) to test whatever algorithm we come up with in
a `make test` fashion. I have already some more test cases locally, and
I would thus like to maintain them in CVS somewhere.

I would like to know whether there are any good reasons to use a
different algorithm to determine the parse mode, whether everyone is
okay to use SGML::Parser::OpenSP to do that, where I could maintain the
tests in CVS and where code as the fragment above should go at this
point (CVS repository, module names, etc.)

regards.

Received on Sunday, 5 September 2004 19:10:21 UTC