This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 9751 - [FO11] fn:parse on non-xml input
Summary: [FO11] fn:parse on non-xml input
Status: CLOSED WORKSFORME
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Functions and Operators 3.0 (show other bugs)
Version: Working drafts
Hardware: PC Windows NT
: P2 normal
Target Milestone: ---
Assignee: Michael Kay
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-05-18 00:24 UTC by David Carlisle
Modified: 2010-06-08 19:01 UTC (History)
1 user (show)

See Also:


Attachments

Description David Carlisle 2010-05-18 00:24:55 UTC
fn:parse  says ...

This function takes as input an XML document represented as a string,
...The precise process used to construct the XDM instance is implementation-defined. In particular, it is implementation-defined whether DTD validation is invoked, and it is implementation-defined whether an XML 1.0 or XML 1.1 parser is used.

I assume that it would be conformant to supply the xpath engine with a parser that accepted some other syntax than xml and returned an XDM tree (or equivalently returned the sax events to allow you to build an XDM tree).
HTML5 being a topical example. Practically speaking, if an html5 parser exposing sax events is used the system probably can't tell anyway. Accepting html is arguably allowed by the "implementation defined" quoted above except that the error condition

A dynamic error [err:FODC0006] is raised if the content of $arg is not a well-formed and namespace-well-formed XML document.

has to be read with care if it is not to be interpreted as saying a system must raise an error if presented with HTML markup. Presumably (that is, I presume, but I'm unsure if I should presume) the intention is that FODC0008 is raised if an XDM tree can not be constructed by the "implementation defined process" when supplied the string $arg.
Comment 1 Michael Kay 2010-06-01 15:30:43 UTC
Agreed that we want to leave open the possibility of parsing non-XML documents,
but preferably in a way that still sets expectations that the normal/default
behaviour is to parse XML and return a well-defined error if the input isn't
well-formed. Editor to propose text to implement this.
Comment 2 Michael Kay 2010-06-02 21:59:43 UTC
Reading the specification again, and thinking about how one might "soften" the wording, I'm inclined to recommend keeping it as it is, despite what was said at the telcon. If the user wants a function that does something different from this, for example a function that parses HTML, then I think that should be a different function. Otherwise we seem to end up with a function that takes a string as input and produces a document node as output, using a process that is completely implementation-dependent with no guarantee of interoperability.

I think this case is different from doc() where the function accesses an external resource. In this case the input and output are entirely within the domain of the XSLT/XQuery processor, and I see no reason to make the behaviour implementation-defined except to the limited extent that XML parsing is intrinsically implementation defined.

Of course, if products are designed to allow users a choice of XML parser, then it might be possible for users to subvert the behaviour of this function to do something different from what it says in the specification. But I don't think that's something that the spec should encourage or endorse.
Comment 3 David Carlisle 2010-06-02 22:28:45 UTC
(In reply to comment #2)
> Reading the specification again, and thinking about how one might "soften" the
> wording, I'm inclined to recommend keeping it as it is.

I've some sympathy with this (that the behaviour be specified) although I could probably make use of a more lenient parsing regime if it were allowed. One of the main use cases for fn:parse that I'd see is parsing inline fragments of markup (often CDATA quoted) if the markup is actually xml, as needed by the "strict" interpretation of fn:parse() then arguably the function is just helping with a poor design style anyway, the markup would have been better unquoted and parsed as part of the original source, however a common case is to have quoted html fragments (atom/rss feeds are often like this for example) and there it's harder to say the source is using a bad style since it would not be well formed if unquoted.

perhaps (as with serialisation) html should be seen as an important special (and w3c defined) special case and a standard (perhaps optionally supported) hook be provided, perhaps a 2 argument form where the 2nd argument is a Qname naming a method (cf the methods specified in xsl:output) so
xml html or a system-defined name like saxon:gedcom. But maybe this is going too far, since a standard function with an argument naming a system-defined behaviour is really just an extension function in disguise. A standard way of parsing inline html fragments would be nice though....


If the intention is to mandate xml input perhaps it should say it more forcefully, since (as I said in the original comment) my initial assumption (until I got to the error message description) was that since the details of XDM construction were implementation defined, you _could_ push this to accepting non xml, although reading it again, perhaps that was always a slightly "optimistic" reading, given the paragraph starts with This function takes as input an XML document represented as a string.

In any case from the formal procedural point of view I'm happy that the issue has had WG discussion and am happy to leave it to the WGs to resolve appropriately thus this bug may be closed (or kept open if the discussion is ongoing) with no objection from me, as original submitter
Comment 4 Michael Kay 2010-06-08 16:45:55 UTC
Agreed in today's telcon to rename the function fn:parse-xml and make no substantive change to the specification. Marking as resolved. David, if you accept this resolution please mark as closed.
Comment 5 David Carlisle 2010-06-08 19:01:20 UTC
Not the most adventurous solution, but OK, thanks for considering, closing....