This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 1383 - [XQuery] some editorial comments on A.2 Lexical structure
Summary: [XQuery] some editorial comments on A.2 Lexical structure
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: XQuery 1.0 (show other bugs)
Version: Last Call drafts
Hardware: All All
: P2 minor
Target Milestone: ---
Assignee: Scott Boag
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-05-11 07:35 UTC by Michael Dyck
Modified: 2007-02-25 23:53 UTC (History)
0 users

See Also:


Attachments

Description Michael Dyck 2005-05-11 07:35:44 UTC
A.2 Lexical structure

[See a later comment for suggested alternate wording.]

Thanks for excising the state machine!

"and [XML Names]are"
    Insert space before "are".

"When patterns are simple string matches, the strings are embedded directly into
the EBNF. In other cases, named terminals are used."
    Delete. It doesn't say anything that isn't already said better in the EBNF
    notation section. Plus it isn't connected to anything else in the section.

"that together may help disambiguate the individual symbols."
    Ditto my comments re this sentence in A.1.

"When tokenizing, the longest possible match that is valid in the current
context is preferred ."
    Delete space before period.

    What constitutes "the current context"? What constitutes "valid"?  Longest
    match of what?  Given that tokenization is up to the implementor, it seems
    that the effect of this sentence would vary between implementations, which
    is probably not what you want.

    Luckily, I think this rule can be deleted. The rules about required
    whitespace (to prevent two adjacent terminals from being mis-recognized as
    one) should (if fixed) handle anything that the "longest possible match"
    would have.
Comment 1 Scott Boag 2005-07-09 05:28:04 UTC
(In reply to comment #0)
> A.2 Lexical structure
> 
> [See a later comment for suggested alternate wording.]
> 
> Thanks for excising the state machine!

You're wellcome.  You were right that it needed to be excised.

> 
> "and [XML Names]are"
>     Insert space before "are".

Done.

> 
> "When patterns are simple string matches, the strings are embedded directly into
> the EBNF. In other cases, named terminals are used."
>     Delete. It doesn't say anything that isn't already said better in the EBNF
>     notation section. Plus it isn't connected to anything else in the section.

Deleted.

> 
> "that together may help disambiguate the individual symbols."
>     Ditto my comments re this sentence in A.1.

Fixed.

> 
> "When tokenizing, the longest possible match that is valid in the current
> context is preferred ."
>     Delete space before period.

Fixed.

> 
>     What constitutes "the current context"? What constitutes "valid"?  Longest
>     match of what?  Given that tokenization is up to the implementor, it seems
>     that the effect of this sentence would vary between implementations, which
>     is probably not what you want.
> 
>     Luckily, I think this rule can be deleted. The rules about required
>     whitespace (to prevent two adjacent terminals from being mis-recognized as
>     one) should (if fixed) handle anything that the "longest possible match"
>     would have.

I'm not inclined to delete this, at least at this time.  I think the rule is
clear enough, and longstanding.  

This should be discussed at the Seattle F2F, especially against the light of
other changes or non-changes.
Comment 2 Scott Boag 2005-07-10 02:35:28 UTC
(In reply to comment #1)
> >     Luckily, I think this rule can be deleted. The rules about required
> >     whitespace (to prevent two adjacent terminals from being mis-recognized as
> >     one) should (if fixed) handle anything that the "longest possible match"
> >     would have.
> 
> I'm not inclined to delete this, at least at this time.  I think the rule is
> clear enough, and longstanding.  
> 
> This should be discussed at the Seattle F2F, especially against the light of
> other changes or non-changes.

One simple example of where I think the longest token rule is still needed is
that of ">" and ">>".
Comment 3 Michael Dyck 2005-07-10 02:56:25 UTC
> One simple example of where I think the longest token rule is still needed is
> that of ">" and ">>".

Is there a context for which '>>' and '>' '>' are both valid continuations?
Comment 4 Scott Boag 2005-07-19 20:19:24 UTC
(In reply to comment #3)
> Is there a context for which '>>' and '>' '>' are both valid continuations?

Not if you don't consider non-legal sentences.  Another case is
"descendant-or-self" vs. "descendant" which can occur in the same context.  In
my parser oriented mind, you need to decide if descendant-or-self::foo has
"descendant" followed by some other characters, vs. "descendant-or-self", thus
you keep searching for the longest token that matches.  On the other hand,
you're saying, in terms of the spec, if keyword delimitation is clear, which I
think it is, there's only one choice: "descendant-or-self", which is either
legal or not.  If "descendant-or-self::foo" could be interpreted as 
"descendant - or-self::foo" (i.e. a subtraction operation), then we would need a
longest token rule perhaps.

In summary, after thinking about it more, I can't justify having the longest
token rule, especially when the spec requires no specific tokenization spec.


So, I'm leaning on the side that this rule should be deleted.  I'm interested if
any other WG members can justify it.

Comment 5 Michael Rys 2005-07-19 20:27:41 UTC
Couldn't descendant - or - self::foo be interpreted as two substractions?

Best regards
Michael
Comment 6 Michael Dyck 2005-07-20 00:31:50 UTC
(In reply to comment #5)
> Couldn't descendant - or - self::foo be interpreted as two substractions?

Sure, but that's the same as the 'a-b' vs 'a - b' example, unless I'm missing
something.
Comment 7 Michael Rys 2005-07-20 00:36:41 UTC
It is related but in one case it is an axis name in the other a user-defined 
element name.

I still prefer the longest token rule to deal with these cases.
Comment 8 C. M. Sperberg-McQueen 2005-07-20 16:53:22 UTC
While I sympathize in general with the idea of deleting rules 
like the longest-token rule from grammars when they are redundant, 
in this particular case I am inclined to keep this particular 
rule.  There are several reasons:

1 I am not absolutely sure whether it's actually redundant in
this case; I haven't proven that it's not, but I haven't seen
anything that looks like a proof that it is.

2 Considering cases like "<<" vs. "<" + "<" (or similarly
the two-character tokens vs. the two single-character tokens
for "(:", ":)", "/>", ">>", "{{", "}}", "..", "::", ":=",
">=", "?>", "//"), if I have a choice of getting the right
answer by knowing exactly where I am in the grammar or by
following a longest-token rule, it seems clear to me that
the longest-token rule is a lot simpler to understand and
a lot simpler to use in practice.

If we could get rid of the qualification about being valid
in the current context, I'd be even happier, but I don't
see how to eliminate that without more complications.

Comment 9 Scott Boag 2005-07-22 20:14:44 UTC
A joint meeting of the Query and XSLT working groups considered this comment on 
July 20, 2005.  

The WGs agreed to resolve this issue as per my previous note, and C. M.
Sperberg-McQueen's comment in regards to the longest token rule.

If you do not agree with this resolution, please add a comment explaining why.
If you wish to appeal the WG's decision to the Director, then change the Status
of the record to Reopened. If we do not hear from you in the next two weeks, we
will assume you agree with the WG decision.
Comment 10 Jim Melton 2007-02-25 23:53:11 UTC
Closing bug because commenter has not objected to the resolution posted and more than two weeks have passed.