Re: Does ixml have to match the whole input? from Steven Pemberton on 2021-12-31 (public-ixml@w3.org from December 2021)

From: Steven Pemberton <steven.pemberton@cwi.nl>
Date: Fri, 31 Dec 2021 13:42:47 +0000
To: "Norm Tovey-Walsh" <norm@saxonica.com>, ixml <public-ixml@w3.org>
Message-Id: <1640956627621.2613951528.1776532855@cwi.nl>

On Friday 31 December 2021 11:16:49 (+01:00), Norm Tovey-Walsh wrote:

> Hello,
>
> I feel like I saw mention of this recently, but can’t now put my hands
> on the message where I saw it. Apologies for my failure to get this
> message into the correct thread.
>
> Consider this test from Steven:
>
> a: "a", spaces, b.
> b: spaces, "b".
> spaces: " "*.
>
> And the sample input file for that test:
>
> a b
>
> For clarity:
>
> $ od -a tests/ambig3.inp
> 0000000 a sp sp sp b nl
> 0000006
>
> I assert that the input does not match the grammar because there’s no
> parse that allows the trailing newline character.

Correct. The section on conformance contains this constraint:
In the normal case, when the input has a determinate length (either known in advance or signaled by some end-of-stream signal), the processor must by default parse the input in its entirety against the grammar and return either a parse tree or a failure document. Processors may provide user options for other behaviors (such as parsing the largest, or smallest, prefix of the input that is described by the grammar). Processors may also support invocation with input streams of indeterminate length.
This was what I was referring to in my recent mail ('Change in live version of ixml processor' https://lists.w3.org/Archives/Public/public-ixml/2021Dec/0097):

This is a possible future discussion point:

 If a parse succeeds without using all the available input, should that be reported as a parse error, or as an ixml:state="incomplete" (or something similar)?

meaning that a parse had been found for the root symbol, but there were trailing characters after the parse.

But that mail was also pointing out that my processor used to do the wrong thing, and I had fixed it now. Some of the tests need to be updated accordingly, including the one mentioned above. (And I will be uploading the correct version today; in fact I did it after writing that sentence).

> We could say that it matches, with a trailing newline left over, but I’d
> rather not. If we do, it’ll just introduce more variation in what the
> processor has to consume and produce. If trailing whitespace is allowed,
> why not leading whitespace? Why not both? Exactly one, or arbitrary
> amounts? What if I want a grammar that *does* match leading and/or
> trailing whitespace, etc. etc. etc.

It is important to note that "whitespace" is not a processing concept in ixml parsing. There are only characters. How those characters are interpreted is up to the ixml author.
But it is easy to add

   root: ...stuff..., -lf?.
   lf: -#a.

> The grammar could be updated to accept trailing newlines, or the user
> could strip them off before attempting to parse. Either of those seems
> preferable to saying that arbitrary left over characters at the ends are
> ok.
>
> With respect to the test suite, I’d be happy to say that all inputs
> should have either all or exactly one trailing newline stripped off
> before attempting to parse. Or not. A decent editor should allow you to
> control whether or not a trailing newline occurs, it’s just a little
> tedious to manage the distinction.

The tests should be correct wrt the spec. No trailing extra characters unless deliberate.

Steven

>
> Be seeing you,
> norm
>
> --
> Norm Tovey-Walsh
> Saxonica
>

Received on Friday, 31 December 2021 13:43:09 UTC