Re: SVG in HTML5 draft proposal

On Jun 19, 2008, at 2:28 AM, Anthony Grasso wrote:

> Hi Team,
>
> As per my action item ACTION-2060 [1] I've converted the Power Point  
> slides to a HTML spec-ish document (attached to this email).
>
> Additionally I've added the other ideas we have been throwing around  
> to the document.
>
> We should probably commit this to CVS somewhere before we do any  
> further edits.
>
> This closes my action item.

I'm confused about the introductory material's constant reference to  
"content handlers", which apparently is supposed to mean plugins.  
Modern browser engines handle SVG natively and would presumably do so  
in HTML just as in XML. So I do not see how these rules could apply to  
WebKit for example.

Regards,
Maciej


>
>
> Thanks,
>
> Anthony
>
>
> [1] http://www.w3.org/Graphics/SVG/WG/track/actions/2060
> SVG and HTML
>
> SVG (Scalable vector graphics) is a powerful concept of describing  
> graphical elements and animations, based on XML. SVG was originally  
> designed as a presentation only framework (like scalable documents)  
> and slowly becoming more and more UI centric.
>
> Things like user interactivity and animations are making SVG a good  
> candidate for integration in User Interface framework and WEBContent  
> Handler such as HTML in browsers, utilizing its power of flexibility  
> and scalability.
>
> In many cases SVG is used as a stand-alone SVG content, but very  
> often, there is another markup language used as host of SVG. Example  
> of such host markup languages would be HTML, XUL, etc. To correctly  
> integrate with (or embed in) other markup languages, multiple things  
> need to be considered.
>
> This document addresses some of the issues that need to be taken  
> into consideration when creating a hybrid design. It also proposes a  
> potential, high-level solutions for some of those problems.
>
> HTML/SVG Markup Integration Considerations
>
> Interleaved Parsing of HTML/SVG
> JavaScript context sharing
> JavaScript node access
> CSS integration
> SVG Widgets
> Focus Navigation
> Unified API (IDL)
> Interleaved Parsing of HTML/SVG
>
> As the reader is probably already aware, HTML and SVG are both  
> markup languages. SVG follows the strict XML rules and HTML allows  
> for a looser syntax, like tags not been closed, etc.
>
> To accommodate the complexity of integrating both, SGML & XML type  
> languages, SVG WG proposes a use of a cascaded parser. Cascading  
> parser allows recursive nested invocations between SGML and XML  
> Content Handler.
>
> Since XML already defines how XML languages from different  
> namespaces interact with each-other, it is valid to have one XML  
> language embedded in another.
>
> When embedding HTML inside SVG, the HTML markup must be well formed.
>
> Consider adding detail about what happens when HTML is encountered  
> inside the SVG. Parser switch? parser switching back? and at what  
> element does it resume? SVG may need to specify other rules for  
> embedding HTML in SVG.
> Example 1: SVG embedded in HTML
>
> <head>
>   <title>HTML Body of SVG</title>
> </head>
> <body>
>   <form>
>     <svg width="480" height="640" id="svg1"
>       xmlns="http://www.w3.org/2000/svg" xml:space="preserve"  
> version="1.2" baseProfile="tiny" >
>       <g fill-opacity="0.7" stroke="black" stroke-width="0.2" >
>         <circle fill="red" cx="100" cy="100" r="100"/>
>       </g>
>     </svg>
>   </form>
> </body>
>
> Eample 2: HTML embedded in SVG
>
> <form>
>   <svg width="480" height="640" id="svg3" viewBox="-160 -120 480  
> 640" stroke-miterlimit="2"
>     xmlns="http://www.w3.org/2000/svg" xml:space="preserve"  
> version="1.2" baseProfile="tiny">
>     <g fill-opacity="0.7" stroke="black" stroke-width="0.2">
>       <table border="1">
>         <caption><em>A test table</em></caption>
>         <tr>
>           <th rowspan="2"></th><th colspan="2">Average</th>
>           <th rowspan="2">Red<br/>eyes</th>
>         </tr>
>         <tr>
>           <th>height</th><th>weight</th>
>         </tr>
>         <tr>
>           <th>Males</th><td>1.9</td><td>0.003</td><td>40%</td>
>         </tr>
>         <tr>
>           <th>Females</th><td>1.7</td><td>0.002</td><td>43%</td>
>         </tr>
>       </table>
>       <circle fill="green" cx="100" cy="100" r="100"/>
>     </g>
>   </svg>
> </form>
>
> Cascading Parser Structure
>
> Cascading parser is combination of parsers, cascaded together to  
> process content of a different type – HTML and XML for example. The  
> individual parsers have their own rules, specific to the way the  
> markup language is structured. All of the individual parsers need to  
> also follow common rules of interactions with the other parsers.
>
>
> Content Handler
>
> A Content Handler is a logical component, which understands and  
> processes particular content type.
>
> Historically, the Content Handler are called Content plug-ins. The  
> interface is built for most plug-ins to only allow for content  
> inclusion by reference, i.e. no real content integration. For  
> example, displaying SVG from a plug-in in a WEB browser window, does  
> not make it integrated SVG Content Handler of HTML. Only if elements  
> of SVG can be interleaved with HTML content, then it can be  
> considered integrated. For example, it should be possible for an SVG  
> image and path elements to be displayed as elements of HTML table.  
> Another example is SVG text on path can be overplayed on top of HTML  
> Image object. From user visual perspective the content behaves as a  
> single markup language.
>
> Content Processing
>
> There are multiple steps a content processor needs to go through  
> before a markup stream becomes a visual representation of the input  
> content. The processing steps used are heavily based on the design  
> of the Content Handler. A legacy Content Handlernormally goes  
> through more steps than the contemporary SAX based parser  
> processors. For example many legacy Content Handler would tokenize  
> the stream and then start identifying and creating the elements  
> representing the tag elements in different steps.
>
> Additionally, if the Content Handler supports progressive parsing,  
> it may process the content in a different way to that of a non- 
> progressive parser.
>
> Regardless of the type of the parser the Content Handler is using,  
> the elements need to be identified from the markup stream, created  
> and displayed.
>
> Content Processing Steps
>
> These are the major content processing steps:
>
> Tokenizing
> Element Identification
> Element Creation
> DOM constructing and/or element addition
> Layout Re-flow
> Element rendering
> Based on the implementation and the type of Content Handler, some of  
> the steps (listed above) can be combined in a single operation and  
> some of these can be broken down into more steps.
>
> Note that Element Identification is an important process in the  
> cascading parsing schema.
>
> Tokenizing
>
> The purpose of the Tokenizing step is to break the stream down to  
> tokens that can be interpreted as elements and element attributes at  
> the element identification and construction stage.
>
> Consider rewording the above sentence to say: "... interpreted as  
> elements, element attributes, tags and properties ..."
> HTML and XML have a very similar way of tokenizing. However, the  
> tokeniser must maintain information about token casing.
>
> Element Identification
>
> Element identification is a very important step in the cascading  
> parser schema.
>
> Cascading parser maintains tables of:
>
> Element tags – element constructor routines pairs (regular parsers  
> also have such table);
> Reference to Content Handler, which maintains "Element tag - Content  
> Handler" pairs table;
> Namespace mapping table (for XML based parsers);
> The correspondence between a tagname and a handler is not  
> necessarily one-to-one. For example: "font", "video", "a". The  
> context in which the tag is encountered is also responsible for  
> deciding which handler to use.
> When a HTML parser encounters an unrecognized element during the  
> element identification step, it does the following:
>
> Terminates the current element or attribute tag – For example it  
> terminates P, LI, TR, TH, etc. elements.
>
> This may not be entirely accurate. There are many quirks to how  
> broken content is repaired. This is suggesting that svg elements are  
> recognized and parsed anywhere inside HTML. It should be pointed out  
> that XML namespaces apply too. However, this does not solve the  
> element context problem. The DOM and ecma script contexts must be  
> the same, otherwise I see no gain compared to inclusion-by-reference  
> (i.e. <object data="some.svg">).
> Makes a query of the Unknown Element tag to the Content Handlers for  
> a Content Handler, which can handle the unrecognized element.
>
> If a suitable Content Handler is not found, it obeys the HTML rules  
> for undefined element(s).
>
> If a Content Handler that can handle the element is found, the HTML  
> parser passes the control of the file or token stream to the Content  
> Handler. Based on the implementation. The HTML parser may also need  
> to pass JS context and DOM branch reference to the identified  
> Content Handler.
>
> The newly identified Content Provider, starts the processing of  
> (SVG) elements until it encounters unrecognized element error or a  
> namespace, which it can not handle. The SVG Content Handler invokes  
> the initial Content Handler in an attempt to identify the  
> unrecognized element with another Content Handler.
>
> What is "an unrecognized element"? This may not be clear enough.  
> Consider for example <fie:foo xmlns:fie="http://www.foo.com/fooml"/>.
> If there is a Content Handler that can handle the unrecognized  
> element, the file or token stream is passed the new Content Handler.
>
> If there is no Content Handler found, then the control of the stream  
> is returned back to the initial Content Handler of the stream. The  
> beginning of the element token that was not recognized is returned.
>
> The HTML parser would need to continue processing the rest of the  
> HTML elements until it encounters another unrecognized element.
>
> If the embedded Content Handler returns with error and the content  
> state is not recognized, the HTML parser will thread the stream from  
> this point on as malformed element and should try to identify the  
> first valid HTML element or a valid token and recover the normal  
> parsing flow.
>
> In the example 1, the HTML parser is going to find an element called  
> SVG (or SVG:SVG) and going to query for a content provider which can  
> handle SVG element. If there is such a provider, then the control of  
> file or token stream is passed to the SVG content provider along  
> with optional JS and DOM branch references.
>
>
> How are malformed errors in SVG handled? As per the SVG spec OR  
> using other rules passed to SVG content provider?
>
> Why should parsing continue outside of the element where XML parsing  
> started? When that element is closed control should be given back to  
> the parent parser. Note that allowed characters in tag names and in  
> content may differ between HTML and XML.
>
> Is it assuming that elements up until the point of error were  
> inserted? What about the element that had the error? What if it was  
> a miss-nesting error? Should the currently open elements up to the  
> point where XML parsing began on the stack be closed when this  
> happens? This section needs to be covering more.
> JavaScript Context Sharing
>
> In cascaded parsing, when one Content Handler passes the JavaScript  
> context to next level below, it is up to the new Content Handler to  
> continue using the existing parent's JavaScript context or create a  
> new one.
>
> Additionally, the parent Content Handler may decide not to pass a  
> valid JavaScript context to the embedded Content Handler. In this  
> case the embedded Content Handler is forced to create a new  
> JavaScript context, if needed.
>
> If an unified DOM3 interface is exposed from the embedded Content  
> Handler, it would be possible for the parents JavaScript context to  
> traverse the child nodes of the embedded content. If this is not  
> supported, the parent's JavaScript content should be allowed to at  
> least see the top-level content node (SVG element for example and  
> its attributes in the SVG content provider case). Please note that  
> attributes exposed in the last case can vary from the standard SVG  
> attributes. These attributes would be in relation to the parent  
> (from SVG also referred as User Agent SVG control attributes).
>
> The scripting context must be the same when using a "content  
> handler" in the same document. The parser should always put the  
> nodes in the same document. If you want separate contexts and  
> separate documents then it's already possible with inclusion-by- 
> reference (CDR in the CDF specs).
>
> "Please note that attributes exposed in the last case can vary from  
> the standard SVG attributes." - Not clear what this means. This  
> section should be fully rewritten or removed.
> CSS Parsing Integration
>
> Usually a separate parser than the host one is used for CSS rules.
>
> "Usually a separate parser than the host one is used for CSS rules."  
> - Not clear to me what that is supposed to mean.
> A particular implementation may decide to pass the CSS processing  
> rules to the child Content Handler. However, it is better to pass  
> this information during re-flow and rendering phase of the content  
> processing.
>
> CSS must apply to all nodes in the same document as defined in the  
> CSS and SVG specs. A may is not good enough. Rewrite suggestion for  
> the last part of the section: "Style information that is applied to  
> a particular node in the document is usually passed to that node  
> during layout and rendering of the document." Not sure it's relevant  
> to include it though.
>
> Need to describe exactly how this is better - more efficient  
> performance wise?
>
> Note a CSS parser is a non-HTML parser inside HTML, so it's somewhat  
> similar to introducing XML parser into HTML.
> SVG Widgets
>
> For an SVG parser to correctly process the content, all the content  
> information, including namespace definition must be included in the  
> document. This would be OK, when sealing with SVG content which is  
> standalone as part of HTML frame or HTML object (this being the case  
> of example 1).
>
> If the SVG content is embedded in HTML elements such as lists or  
> tables, then the content becomes very convoluted with the SVG  
> namespace information repeated over and over again (see example 3).
>
> Example 3
>
> <form>
>   <table border="1"  align="center">
>     <caption><em>A test table with SVG elements</em></caption>
>     <tr>
>       <svg id="svgRoot" version="1.2" baseProfile="tiny"  
> viewBox="-160 -120 480 640"
>         width="480" height="640" stroke-miterlimit="2"  
> zoomAndPan="enable" xml:space="preserve"
>         xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink 
> " xmlns:xe="http://www.w3.org/2001/xml-events">
>         <g fill-opacity="0.7" stroke="black" stroke-width="0.2">
>           <circle fill="red" cx="100" cy="100" r="100"/>
>         </g>
>       </svg>
>     </tr>
>     <tr>
>       <svg id="svgRoot" version="1.2" baseProfile="tiny"  
> viewBox="-160 -120 480 640"
>         width="480" height="640" stroke-miterlimit="2"  
> zoomAndPan="enable" xml:space="preserve"
>         xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink 
> " xmlns:xe="http://www.w3.org/2001/xml-events">
>         <g fill-opacity="0.7" stroke="black" stroke-width="0.2">
>           <circle fill="green" cx="100" cy="100" r="100"/>
>         </g>
>       </svg>
>     </tr>
>     <tr>
>       <svg id="svgRoot" version="1.2" baseProfile="tiny"  
> viewBox="-160 -120 480 640"
>         width="480" height="640" stroke-miterlimit="2"  
> zoomAndPan="enable" xml:space="preserve"
>         xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink 
> " xmlns:xe="http://www.w3.org/2001/xml-events">
>         <g fill-opacity="0.7" stroke="black" stroke-width="0.2">
>           <circle fill="blue" cx="100" cy="100" r="100"/>
>         </g>
>       </svg>
>     </tr>
>   </table>
> </form>
>
> If the SVG content is embedded in HTML elements such as lists or  
> tables, then the content becomes very convoluted with the SVG  
> namespace information repeated over and over again (see example 3).
>
> From example 3 it is obvious that the same document definition  
> information is repeated for every SVG element in the HTML table. If  
> the table had individual cells, the content would look even more  
> complex and bulky.
>
> The SVG widget concept is addressing the issue of repeated SVG  
> document information by packing common document information at one  
> place and leaving in the embedded content the information only  
> related to the SVG elements or fragment (group of elements).
>
> SVG Widget Structure
>
> Example 4: SVG Widget Structure - svgDoc
>
> <form>
>   <svgDoc id="svgRoot" version="1.2" baseProfile="tiny"  
> viewBox="-160 -120 480 640"
>     width="480" height="640" stroke-miterlimit="2"  
> zoomAndPan="enable" xml:space="preserve"
>     xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink 
> " xmlns:xe="http://www.w3.org/2001/xml-events">
>     <defs>
>       <style type="text/css">
>         <![CDATA[
>           circle { fill:red; stroke: blue; stroke-width:3 }
>         ]]>
>       </style>
>     </defs>
>   </svgDoc>
>   ...
> </form>
>
> Example 5: Reference SVG Widget (svgDoc) using svgWidget
>
> <table border="1"  align="center">
>   <caption> <em>A test table with SVG elements</em></caption>
>   <tr>
>   <svgWidget docId="myDoc1">
>       <g fill-opacity="0.7" stroke="black" stroke-width="0.2">
>         <circle fill="red" cx="100" cy="100" r="100"/>
>       </g>
>     </svgWidget>
>   </tr>
>   <tr>
>   <svgWidget docId="myDoc1">
>       <g fill-opacity="0.7" stroke="black" stroke-width="0.2">
>         <circle fill="gree" cx="100" cy="100" r="100"/>
>       </g>
>     </svgWidget>
>   </tr>
>   <tr>
>   <svgWidget docId="myDoc1">
>       <g fill-opacity="0.7" stroke="black" stroke-width="0.2">
>         <circle fill="blue" cx="100" cy="100" r="100"/>
>       </g>
>     </svgWidget>
>   </tr>
> </table>
>
>
> CSS does not have to be nested inside SVG to have the rules apply to  
> SVG nodes in the same document. This seems like a way of doing a  
> hardcoded (or very limited) document transformation, which is  
> already possible with XBL or XSLT. However, allowing (or requiring)  
> XML namespaces in HTML would solve the same problem, if coupled with  
> a statement that any <svg> element is recognized as an <svg> element  
> in the svg namespace even it doesn't specify xmlns="http://www.w3.org/2000/svg 
> ". While it's nice to not have to do alot of extra typing it would  
> break the ability to copy-paste svg source into existing svg editors.
> SVG Widget Advantage
>
> Makes the content lightweight
> SVG fragments have a common place to control document attributes,  
> namespace definition, CSS
>
> "SVG fragments have a common place to control document attributes,  
> namespace definition, CSS" - CSS can be stored in many different  
> places, but it is not clear why SVG would need anything special for  
> that. If the document is HTML, then a style element in the head  
> section is a good place. External stylesheets is another place.
> SVG widgets may have common or separate JavaScript context
>
> See previous comment about this.
> Common JavaScript routines can be defined at the document level,  
> accessible from all the widgets, belonging to that particular document
>
> Widgets size and position can be controlled from HTML layout  
> container (Tables, Lists, etc.)
>
> This can already be controlled in existing browsers, you can set the  
> width and height properties from CSS on the <svg> element.
>
> High level view of the changes to HTML5
>
> Make tokeniser case-preserving
> Handle xml namespaces
> Remove the hardcoded lists of case-fixups for svg
> Special parsing mode for SVG/MathML
> Summary of changes to HTML5 specification
>
> Make tokeniser case-preserving
>
> Remove hard coded table of case fixes for svg/mathml elements and  
> attributes (if still in the spec).
>
> Merge the "U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN  
> CAPITAL LETTER Z" and "U+0061 LATIN SMALL LETTER A through to U+007A  
> LATIN SMALL LETTER Z" cases, using the definition for the latter, in:
>
> Tag open state: If the content model flag is set to the PCDATA state
> Close tag open state
> Drop the "U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN  
> CAPITAL LETTER Z" case in:
>
> Tag name state
> Before attribute name state
> Attribute name state
> After attribute name state
> Change: When the steps below require the UA to insert an HTML  
> element for a token, the UA must first create an element for the  
> token in the HTML namespace, and then append this node to the  
> current node, and push it onto the stack of open elements so that it  
> is the new current node.
>
> To:
>
> When the steps below require the UA to insert an HTML element for a  
> token, the token and all attribute tokens it contains are first  
> normalized to lowercase [mapping A..Z to a..z]. If there are  
> attribute tokens with the same name it is a parse error, discard all  
> attribute tokens that are duplicates and the value that is  
> associated with each such token (if any), keep the first occurrence  
> of an attribute token whose name is duplicated. Then the UA must  
> create an element for the normalized token in the HTML namespace,  
> and then append this node to the current node, and push it onto the  
> stack of open elements so that it is the new current node.
>
> Remove the paragraph: When the user agent leaves the attribute name  
> state (and before emitting the tag token, if appropriate), the  
> complete attribute's name must be compared to the other attributes  
> on the same token; if there is already an attribute on the token  
> with the exact same name, then this is a parse error and the new  
> attribute must be dropped, along with the value that gets associated  
> with it (if any).
>
> Cases in the "in foreign content" insertion mode need to be case- 
> sensitive
>
> Handle xml namespaces
>
> Change the definition of "adjust foreign attributes"
>
> Special parsing mode for SVG/MathML
>
> Add the parse mode to "The in body insertion mode"
>
> Sections of specification with changes
>
> 9.2.4 Tokenisation
>
> Tag open state
> The behavior of this state depends on the content model flag.
>
> If the content model flag is set to the RCDATA or CDATA states
> Consume the next input character. If it is a U+002F SOLIDUS (/)  
> character, switch to the close tag open state. Otherwise, emit a U 
> +003C LESS-THAN SIGN character token and reconsume the current input  
> character in the data state.
>
> If the content model flag is set to the PCDATA state
> Consume the next input character:
>
> U+0021 EXCLAMATION MARK (!)
> Switch to the markup declaration open state.
> U+002F SOLIDUS (/)
> Switch to the close tag open state.
> U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER  
> Z and U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL  
> LETTER Z
> Create a new start tag token, set its tag name to the input  
> character, then switch to the tag name state. (Don't emit the token  
> yet; further details will be filled in before it is emitted.)
> U+003E GREATER-THAN SIGN (>)
> Parse error. Emit a U+003C LESS-THAN SIGN character token and a U 
> +003E GREATER-THAN SIGN character token. Switch to the data state.
> U+003F QUESTION MARK (?)
> Parse error. Switch to the bogus comment state.
> Anything else
> Parse error. Emit a U+003C LESS-THAN SIGN character token and  
> reconsume the current input character in the data state.
> Close tag open state
> If the content model flag is set to the RCDATA or CDATA states but  
> no start tag token has ever been emitted by this instance of the  
> tokeniser ( fragment case), or, if the content model flag is set to  
> the RCDATA or CDATA states and the next few characters do not match  
> the tag name of the last start tag token emitted (case  
> insensitively), or if they do but they are not immediately followed  
> by one of the following characters:
>
> U+0009 CHARACTER TABULATION
> U+000A LINE FEED (LF)
> U+000C FORM FEED (FF)
> U+0020 SPACE
> U+003E GREATER-THAN SIGN (>)
> U+002F SOLIDUS (/)
> EOF
> ...then emit a U+003C LESS-THAN SIGN character token, a U+002F  
> SOLIDUS character token, and switch to the data state to process the  
> next input character.
>
> Otherwise, if the content model flag is set to the PCDATA state, or  
> if the next few characters do match that tag name, consume the next  
> input character:
>
> U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER  
> Z and U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL  
> LETTER Z
> Create a new end tag token, set its tag name to the input character,  
> then switch to the tag name state. (Don't emit the token yet;  
> further details will be filled in before it is emitted.)
> U+003E GREATER-THAN SIGN (>)
> Parse error. Switch to the data state.
> EOF
> Parse error. Emit a U+003C LESS-THAN SIGN character token and a U 
> +002F SOLIDUS character token. Reconsume the EOF character in the  
> data state.
> Anything else
> Parse error. Switch to the bogus comment state.
> Tag name state
> Consume the next input character:
>
> U+0009 CHARACTER TABULATION
> U+000A LINE FEED (LF)
> U+000C FORM FEED (FF)
> U+0020 SPACE
> Switch to the before attribute name state.
> U+003E GREATER-THAN SIGN (>)
> Emit the current tag token. Switch to the datastate.
> EOF
> Parse error. Emit the current tag token. Reconsume the EOF character  
> in the data state.
> U+002F SOLIDUS (/)
> Switch to the self-closing start tag state.
> Anything else
> Append the current input character to the current tag token's tag  
> name. Stay in the tag name state.
> Before attribute name state
> Consume the next input character:
>
> U+0009 CHARACTER TABULATION
> U+000A LINE FEED (LF)
> U+000C FORM FEED (FF)
> U+0020 SPACE
> Stay in the before attribute name state.
> U+003E GREATER-THAN SIGN (>)
> Emit the current tag token. Switch to the data state.
> U+002F SOLIDUS (/)
> Switch to the self-closing start tag state.
> U+0022 QUOTATION MARK (")
> U+0027 APOSTROPHE (')
> U+003D EQUALS SIGN (=)
> Parse error. Treat it as per the "anything else" entry below.
> EOF
> Parse error. Emit the current tag token. Reconsume the EOF character  
> in the data state.
> Anything else
> Start a new attribute in the current tag token. Set that attribute's  
> name to the current input character, and its value to the empty  
> string. Switch to the attribute name state.
> Attribute name state
> Consume the next input character:
>
> U+0009 CHARACTER TABULATION
> U+000A LINE FEED (LF)
> U+000C FORM FEED (FF)
> U+0020 SPACE
> Switch to the after attribute name state.
> U+003D EQUALS SIGN (=)
> Switch to the before attribute value state.
> U+003E GREATER-THAN SIGN (>)
> Emit the current tag token. Switch to the data state.
> U+002F SOLIDUS (/)
> Switch to the self-closing start tag state.
> U+0022 QUOTATION MARK (")
> U+0027 APOSTROPHE (')
> Parse error. Treat it as per the "anything else" entry below.
> EOF
> Parse error. Emit the current tag token. Reconsume the EOF character  
> in the data state.
> Anything else
> Append the current input character to the current attribute's name.  
> Stay in the attribute name state.
> After attribute name state
> Consume the next input character:
>
> U+0009 CHARACTER TABULATION
> U+000A LINE FEED (LF)
> U+000C FORM FEED (FF)
> U+0020 SPACE
> Stay in the after attribute name state.
> U+003D EQUALS SIGN (=)
> Switch to the before attribute value state.
> U+003E GREATER-THAN SIGN (>)
> Emit the current tag token. Switch to the data state.
> U+002F SOLIDUS (/)
> Switch to the self-closing start tag state.
> EOF
> Parse error. Emit the current tag token. Reconsume the EOF character  
> in the data state.
> Anything else
> Start a new attribute in the current tag token. Set that attribute's  
> name to the current input character, and its value to the empty  
> string. Switch to the attribute name state.
> 9.2.5.1. Creating and inserting elements
>
> When the steps below require the UA to create an element for a token  
> in a particular namespace, the UA must create a node implementing  
> the interface appropriate for the element type corresponding to the  
> tag name of the token in the given namespace (as given in the  
> specification that defines that element, e.g. for an a element in  
> the HTML namespace, this specification defines it to be the  
> HTMLAnchorElement interface), with the tag name being the name of  
> that element, with the node being in the given namespace, and with  
> the attributes on the node being those given in the given token.
>
> The interface appropriate for an element in the HTML namespace that  
> is not defined in this specification is HTMLElement. The interface  
> appropriate for an element in another namespace that is not defined  
> by that namespace's specification is Element.
>
> When the steps below require the UA to insert an HTML element for a  
> token, the token and all attribute tokens it contains are first  
> normalized to lowercase [mapping A..Z to a..z].
>
> If there are attribute tokens with the same name it is a parse  
> error, discard all attribute tokens that are duplicates and the  
> value that is associated with each such token (if any), keep the  
> first occurrence of an attribute token whose name is duplicated. The  
> UA must then create an element for the normalised token in the HTML  
> namespace. The newly created node must be appended to the current  
> node and push it onto the stack of open elements so that it is the  
> new current node.
>
> The steps below may also require that the UA insert an HTML element  
> in a particular place, in which case the UA must follow the same  
> steps except that it must insert or append the new node in the  
> location specified instead of appending it to the current node.  
> (This happens in particular during the parsing of tables with  
> invalid content.)
>
> When the steps below require the UA to insert a foreign element for  
> a token, the UA must first create an element for the token in the  
> given namespace, and then append this node to the current node, and  
> push it onto the stack of open elements so that it is the newcurrent  
> node. If the newly created element has an xmlns attribute in the  
> XMLNS namespace whose value is not exactly the same as the element's  
> namespace, that is a parse error.
>
> When the steps below require the user agent to adjust foreign  
> attributes for a token, then for each attribute on the token the  
> following stemps must be applied:
>
> If the attribute starts with xml: and has one or more characters  
> following, let the attribute be a namespaced attribute with the  
> prefix xml, local name being the characters after the colon, and the  
> namespace being the XML namespace, and abort these steps.
> If the attribute starts with xmlns: and has one or more characters  
> following, let the attribute be a namespaced attribute with the  
> prefix xmlns, local name being the characters after the colon, and  
> the namespace being the XMLNS namespace, and abort these steps.
> If the attribute is xmlns, let the attribute a namespaced attribute  
> with no prefix, the local name xmlns, and the namespace being the  
> XMLNS namespace.
> If the attribute starts with one or more characters other than the  
> colon that are followed by a colon that is followed by one or more  
> characters, and there is an attribute on the token or on an element  
> in the stack of open elements that is in the XMLNS namespace and has  
> the local name being the same as the characters before the colon on  
> the token, let the attribute be a namespaced attribute with the  
> prefix being the characters before the colon, the local name being  
> the characters after the colon, and the namespace being the value of  
> the matched XMLNS attribute. If there are multiple matching XMLNS  
> attributes, the one on the token takes precedence over the ones on  
> the stack [of open elements], and the bottommost node on the stack  
> takes precedence over the next-to-bottommost node on the stack and  
> so on.
> 9.2.5.10. The " in body" insertion mode
>
> When the insertion mode is " in body", tokens must be handled as  
> follows:
>
> ...
> A start tag whose case-sensitive tag name is "math" that has a case- 
> sensitive attribute "xmlns" with the value "http://www.w3.org/Math/1998/MathML 
> ":
> A start tag whose case-sensitive tag name is "svg" that has a case- 
> sensitive attribute "xmlns" with the value "http://www.w3.org/2000/ 
> svg":
> A start tag whose case-sensitive tag name is "<str>:math" that has a  
> case-sensitive attribute "xmlns:<str>" with the value "http://www.w3.org/Math/1998/MathML 
> ":
> A start tag whose case-sensitive tag name is "<str>:svg" that has a  
> case-sensitive attribute "xmlns:<str>" with the value "http://www.w3.org/2000/svg 
> ":
> Save the tokeniser content model flag to old-state.
>
> Switch the tokeniser's content model flag to the CDATA state.
>
> The CDATA state isn't good enough since it transforms characters  
> (and this mode can probably be exited too easily, which is not  
> desirable either). A new CDATA-like state that passes data through  
> unmodified may be required.
> Create a new XML parser. Set the encoding to the character encoding  
> used by the HTML parser.
>
> Feed the XML parser the string corresponding to the start tag of the  
> element along with all its attributes.
>
> Let the XML parser parse and insert the foreign element.
>
> Added a link to foreign element insertion. Is this ok or should it  
> be removed?
> Then continue to feed character tokens to the XML parser until it:
>
> returns with a fatal error [XML10]
> closes the current node (the element where XML parsing began), with  
> no errors
> asks for another parser to handle an unknown tagname
> This is loosely based on the cascading parser ideas, but needs more  
> work.
> If the XML parser returns a fatal error:
>
> close all open elements on the stack up to the current node (the  
> element where XML parsing began)
> Destroy the XML parser.
> Reset the tokeniser content model flag to the old-state.
> Reset the insertion mode appropriately.
> If the XML parser returns with success, then destroy the XML parser.  
> Then reset the tokeniser content model flag to the old-state and  
> reset the insertion mode appropriately.
>
> If the XML parser encounters a tagname which is unknown:
>
> Save #state# on a stack: Push the tokeniser state, the insertion  
> mode and the XML parser.
> Reset the tokeniser content model to the old-state. Reset the  
> insertion mode appropriately.
> Hand over the token to the HTML5 parser.
> When HTML5 parser encounters an tagname that is unknown hand the  
> token over to be handled by the parent parser (pop #state# ...).
> Note that broken tags will get closed after an XML error, so that  
> something will be visible (but will look broken). This still  
> "suffers from reparsing issues" and "is too draconian".
> 9.2.5.19. The " in foreign content" insertion mode
>
> When the insertion mode is " in foreign content", tokens must be  
> handled as follows:
>
> A character token
> Insert the token's character into the current node.
>
> A comment token
> Append a Comment node to the current node with the data attribute  
> set to the data given in the comment token.
>
> A DOCTYPE token
> Parse error. Ignore the token.
>
> A start tag whose tag name is neither "mglyph" nor "malignmark", if  
> the current node is an mi element in the MathML namespace.
> A start tag whose tag name is neither "mglyph" nor "malignmark", if  
> the current node is an mo element in the MathML namespace.
> A start tag whose tag name is neither "mglyph" nor "malignmark", if  
> the current node is an mn element in the MathML namespace.
> A start tag whose tag name is neither "mglyph" nor "malignmark", if  
> the current node is an ms element in the MathML namespace.
> A start tag whose tag name is neither "mglyph" nor "malignmark", if  
> the current node is an mtext element in the MathML namespace.
> A start tag whose tag name is "svg", if the current node is an  
> annotation-xml element in the MathML namespace.
> A start tag, if the current node is an element in the HTML namespace.
> An end tag
> Process the token using the rules for the secondary insertion mode.
>
> The secondary insertion mode must be case-sensitve for these cases.
>
> If, after doing so, the insertion mode is still " in foreign  
> content", but there is no element in scope that has a namespace  
> other than the HTML namespace, switch the insertion mode to the  
> secondary insertion mode.
>
> <cascading-parser-structure.png>

Received on Thursday, 19 June 2008 09:46:18 UTC