<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>5808</bug_id>
          
          <creation_ts>2008-06-26 12:52:55 +0000</creation_ts>
          <short_desc>Define a way to coerce HTML5 parser output to an XML 1.0 4th ed. + Namespaces 1.0 infoset</short_desc>
          <delta_ts>2010-10-04 14:47:03 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>HTML WG</product>
          <component>pre-LC1 HTML5 spec (editor: Ian Hickson)</component>
          <version>unspecified</version>
          <rep_platform>All</rep_platform>
          <op_sys>All</op_sys>
          <bug_status>CLOSED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P3</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Henri Sivonen">hsivonen</reporter>
          <assigned_to name="Ian &apos;Hixie&apos; Hickson">ian</assigned_to>
          <cc>mike</cc>
    
    <cc>public-html-admin</cc>
    
    <cc>public-html-wg-issue-tracking</cc>
    
    <cc>t.broyer</cc>
    
    <cc>zcorpan</cc>
          
          <qa_contact name="HTML WG Bugzilla archive list">public-html-bugzilla</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>20914</commentid>
    <comment_count>0</comment_count>
    <who name="Henri Sivonen">hsivonen</who>
    <bug_when>2008-06-26 12:52:55 +0000</bug_when>
    <thetext>There&apos;s now a canned answer for anyone who argues that XHTML works better with the &apos;XML toolchain&apos; than HTML5: &quot;Just put an HTML5 parser at the start of your XML pipeline.&quot;

There&apos;s a slight problem though: The HTML5 parser algorithm can output a document tree that is not an XML 1.0 4th ed. + Namespaces 1.0 infoset. This poses a problem if a processing pipeline serializes to XML and expects a later stage to reparse using a conforming XML 1.0 4th ed. + Namespaces 1.0 parser or if a component in the pipeline (e.g. the XOM library) performs early checks.

Therefore, every HTML5 parser writer who wishes to provide a full-featured general-purpose HTML5 parser needs to come up with a coercion from an HTML5 DOM onto an XML 1.0 4th ed. + Namespaces 1.0 Infoset.

I suggest documenting a mapping.

Here&apos;s a list of problems with proposed solutions:

 * The document mode isn&apos;t part of the infoset: Optionally communicate as out-of-infoset-band data. Instruct apps to use the standards mode when not communicated.
 * The form pointer isn&apos;t part of the infoset: Make communicating the form pointer optional. Allow communicating it as out-of-infoset-band data. When the form element is not an ancestor of the form control, allow an UUID id attribute be generated on the form element and allow a form attribute be generated on the form control.
 * Some XML APIs treat the doctype as syntactic sugar: Make representing the document type information item is optional.
 * Attributes with the local name &quot;xmlns&quot; or a local name starting with &quot;xmlns:&quot; are not permitted attribute information items: Drop on the floor.
 * Namespace declarations are not attribute information items: Drop on the floor. (Optionally syntethize namespace information items for XLink and SVG or MathML on &lt;svg&gt; and &lt;math&gt; nodes, respectively, and XHTML namespace information items on HTML elements (including root) that do not have an HTML element as the parent.)
 * Form feed is not an XML character (either literally or as a character reference expansion): turn into a space.
 * The input stream contains a literal non-XML character other than form feed: turn into a REPLACEMENT CHARACTER.
 * A comment contains &quot;--&quot;: Replace with &quot;- -&quot;.
 * A name is not an NCName: Use the original name on tree builder stack for matching, but use as escaped name in the output. The escaping function must escape each non-NCName to a unique NCName, and the result must have at least one upper case ASCII character but must not match any known SVG camelCase name.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>20949</commentid>
    <comment_count>1</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2008-06-26 22:57:56 +0000</bug_when>
    <thetext>Yeah, I guess we&apos;ll add a section about this in the parser section somewhere. It&apos;ll free us up a bit and allow us to diverge more from XML, instead of vainly trying to keep the two in sync all the time.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>20951</commentid>
    <comment_count>2</comment_count>
    <who name="Henri Sivonen">hsivonen</who>
    <bug_when>2008-06-27 05:55:29 +0000</bug_when>
    <thetext>(In reply to comment #1)
&gt; It&apos;ll free us up a bit and allow us to diverge more from XML, instead of vainly
&gt; trying to keep the two in sync all the time.

I think we should not add any new cases where the HTML5 parsing algorithm can produce XML-incompatible parse trees. Maintaining alternative code paths for all the things I mentioned is annoying enough as is.
</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>21287</commentid>
    <comment_count>3</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2008-07-23 02:04:16 +0000</bug_when>
    <thetext>Done, but I didn&apos;t always follow your suggestions. In particular, I made bad names and attributes just get mutated so that bad characters turn into &quot;_&quot; characters, with clashes being dealt with by dropping attributes, instead of suggesting using a mapping function.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>21289</commentid>
    <comment_count>4</comment_count>
    <who name="Henri Sivonen">hsivonen</who>
    <bug_when>2008-07-23 06:39:07 +0000</bug_when>
    <thetext>(In reply to comment #3)
&gt; Done, 

Thanks.

&gt; but I didn&apos;t always follow your suggestions. 

&quot;Construct the DOM as if appropriate namespace declarations were in scope.&quot;, &quot;Construct the DOM as if these were default namespace declarations.&quot; and &quot;Construct the DOM as if these were namespace prefix declarations.&quot; are vague compared to saying that it is permissible to a) drop NS declarations and b) synthetize NS declarations.

&gt; In particular, I made bad
&gt; names and attributes just get mutated so that bad characters turn into &quot;_&quot;
&gt; characters, with clashes being dealt with by dropping attributes, instead of
&gt; suggesting using a mapping function.

I&apos;d much prefer having a mapping function that can&apos;t cause clashes. That way, I don&apos;t need to deal with attribute name clashes. Also, when the mapping function cannot cause element name clashes, implementation that don&apos;t maintain separate stack comparison name and an app-exposed name for elements would automatically be protected. (The Validator.nu parser maintains a stack comparison name and an exposed name separately already to allow pointer compares instead of case-insensitive string compares when an SVG camelCase name is the exposed name on the stack.)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>21291</commentid>
    <comment_count>5</comment_count>
    <who name="Thomas Broyer">t.broyer</who>
    <bug_when>2008-07-23 08:33:48 +0000</bug_when>
    <thetext>(In reply to comment #4)
&gt; 
&gt; &gt; In particular, I made bad
&gt; &gt; names and attributes just get mutated so that bad characters turn into &quot;_&quot;
&gt; &gt; characters, with clashes being dealt with by dropping attributes, instead of
&gt; &gt; suggesting using a mapping function.
&gt; 
&gt; I&apos;d much prefer having a mapping function that can&apos;t cause clashes. That way, I
&gt; don&apos;t need to deal with attribute name clashes.

Why not use ISO9075-like encoding? (i.e. replace \uXXXX with _xXXXX_)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>21293</commentid>
    <comment_count>6</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2008-07-23 08:42:11 +0000</bug_when>
    <thetext>Fixed. I ended up going with a U12345 scheme. (For various reasons we need a capital letter, and U seems to be the most obvious letter.)</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>