This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 5727 - [XPath] Syntax ambiguities with leading "/"
Summary: [XPath] Syntax ambiguities with leading "/"
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: XPath 2.0 (show other bugs)
Version: Recommendation
Hardware: PC Windows NT
: P2 normal
Target Milestone: ---
Assignee: Jonathan Robie
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-06-01 22:01 UTC by Michael Kay
Modified: 2009-03-24 16:37 UTC (History)
4 users (show)

See Also:


Attachments

Description Michael Kay 2008-06-01 22:01:35 UTC
The constraint "leading-lone-slash" in Appendix A.1.2 discusses the situation where a leading "/" is followed by a "keyword" (meaning, presumably, an NCName), or by "*", and indicates that these should always be treated as NameTests, to avoid the need for lookahead.

I would interpret this as meaning that the construct

/unordered{x}

is an error, even though there is only one way this construct could match the grammar. However, the reference parser accepts this construct (Saxon does not).

Gunther Rademacher, in an email on the saxon-help list [1], has pointed out another problem not discussed in this grammar note, which arises when "/" is followed by "<". The following expressions are both legal according to the XQuery grammar, though only (1) is legal XPath:

(1) /<a/b

(2) /<a/>

The reference parser for XQuery accepts (2) but not (1), while the reference parser for XPath accepts (1) but not (2). Saxon accepts (1) but not (2) whether parsing XPath or XQuery.

I think it's important that every valid XPath expression should be legal in XQuery, so I think that we should either (a) require parsers to accept both these constructs (which involves lookahead), or (b) require them to accept only (1). This can be achieved by a note at the end of constraint leading-lone-slash to the effect:

"Similarly, a '<' character that follows a leading '/' is always interpreted as an operator, and not as the start of a direct constructor."

(Both expressions are useless in practice, and I can't conceive of variations that make them useful. But I think an argument based on utility would also favour (1) as a marginally more sensible thing to support.)

[1] http://markmail.org/message/lu3c5632os6cvg7l
Comment 1 Michael Kay 2008-06-01 22:12:31 UTC
And it can involve more than one token of lookahead:

/<a div="3"/>

vs

/<a div 3
Comment 2 Michael Kay 2008-06-02 08:27:44 UTC
In the existing rule:

To reduce the need for lookahead, therefore, if the token immediately following a slash is "*" or a keyword, then the slash must be the beginning, but not the entirety, of a PathExpr (and the following token must be a NameTest, not an operator).

I suspect the first consequence is correct: "the slash must be the beginning, but not the entirety, of a PathExpr"

but the second is wrong: "the following token must be a NameTest, not an operator". The following construct might also be an axis name (/child::x), a function name (/base-uri()), a node test (/element(x)), an ordered|unordered expression (/unordered{node()}, a computed node constructor (/text{"a"}), etc. We should delete the phrase in brackets, and explain that by keyword we mean any token having the lexical form of an NCName.
Comment 3 Michael Dyck 2008-06-03 01:42:33 UTC
(In reply to comment #2)
> by keyword we mean any token having the lexical form of an NCName.

I'm not disagreeing with the meat of what you're saying, just with that particular meaning for "keyword".

Although the spec doesn't define "keyword", it generally uses it in the sense of a token that matches a quoted word (/"[a-z-]+"/) in the A.1 EBNF. So, e.g.
  for
and
  union
are keywords, but
  foo
and
  Union
aren't (though they are NCNames).

Where this section talks about (a slash followed by) a keyword, I think it means a token whose spelling matches that of some keyword (but the parser doesn't yet know [before applying this constraint] whether it's actually a keyword or an NCName). That is, the constraint is talking about a smaller set of cases than you think it is. (If a slash is followed by a word-like token that *doesn't* have the spelling of any keyword, then presumably it must be an NCName, and there isn't a parsing conflict.)
Comment 4 Michael Dyck 2008-06-03 02:11:20 UTC
(In reply to comment #0)
>
> Gunther Rademacher, in an email on the saxon-help list [1], has pointed out
> another problem not discussed in this grammar note, which arises when "/" is
> followed by "<". The following expressions are both legal according to the
> XQuery grammar, though only (1) is legal XPath:
> 
> (1) /<a/b
> 
> (2) /<a/>

This is closely related to a case I pointed out about 7 years ago:
http://lists.w3.org/Archives/Public/www-xml-query-comments/2001Jul/0021.html
It was a bona fide grammatical ambiguity, which was resolved by disallowing cascading RelationalExprs (now ComparisonExprs). I pointed out that that would still leave conflicts requiring 3 symbols of lookahead to resolve, but apparently that was deemed acceptable.

> I think it's important that every valid XPath expression should be legal in
> XQuery, so I think that we should either (a) require parsers to accept both
> these constructs (which involves lookahead), or (b) require them to accept
> only (1). This can be achieved by a note at the end of constraint
> leading-lone-slash to the effect:
> 
> "Similarly, a '<' character that follows a leading '/' is always interpreted
> as an operator, and not as the start of a direct constructor."

I think "Conversely" would be clearer than "Similarly", since the proposed resolution is the opposite of that for "*" and keyword-like tokens.

(Mind you, I think I prefer alternative a.)
Comment 5 Jonathan Robie 2008-06-24 09:13:35 UTC
The following NOTE, which occurs in both the XPath and the XQuery specs, seems to cover the original intent of the Working Group:

<quote>
Note:

The "/" character can be used either as a complete path expression or as the beginning of a longer path expression such as "/*". Also, "*" is both the multiply operator and a wildcard in path expressions. This can cause parsing difficulties when "/" appears on the left hand side of "*". This is resolved using the leading-lone-slash constraint. For example, "/*" and "/ *" are valid path expressions containing wildcards, but "/*5" and "/ * 5" raise syntax errors. Parentheses must be used when "/" is used on the left hand side of an operator, as in "(/) * 5". Similarly, "4 + / * 5" raises a syntax error, but "4 + (/) * 5" is a valid expression. The expression "4 + /" is also valid, because / does not occur on the left hand side of the operator.</quote>

As I understand it, the clearest statement we have on the subject is this:

<quote>Parentheses must be used when "/" is used on the left hand side of an operator, as in "(/) * 5".</quote>

Unfortunately, it does not clearly say that it is a syntax error to use "/" on the left hand side of an operator, and it does not clearly say how to handle operators like "-" or "+", which may be unary or binary.

And of course, this is in a Note, and notes aren't normative in our spec. If an implementor faithfully ignored the Note, their implementation would not be compatible with an implementation that took this statement into consideration.

Users are unlikely to write expressions that use / on the left hand side of an operator. I do not know whether implementations have used the Note: to guide their implementation.

So what are our options?

1. We could state normatively that "/" may not occur on the left hand side of an operator. We would have to clarify the behavior with respect to unary operators, e.g. whether /-5 is interpreted as /(-5). We would have to tell implementors who ignored the Notes, as we told them they could, to change their implementation.

2. We could remove the normative-sounding statement from the above Note in order to avoid laying a trap for implementors who take our Notes too seriously. We would then have to ask implementors who read the Note and implemented as though it meant something to change their implementations to ignore the Note entirely.

3. We could say it's implementation-defined whether implementors took our Note seriously. This would institutionalize the bug forever, allowing implementors to avoid changing anything.

Personally, I strongly prefer (1), I hate (3), and I can live with (2).

Jonathan
Comment 6 Michael Dyck 2008-06-24 22:31:01 UTC
> it does not clearly say how to handle
> operators like "-" or "+", which may be unary or binary.
> ...
> We would have to clarify the behavior with respect to unary
> operators, e.g. whether /-5 is interpreted as /(-5).

Note that there is no valid parse of /-5 in which the hyphen is a unary minus. (If you try to make -5 a UnaryExpr, you find that a UnaryExpr can't be a StepExpr.) So unary operators don't enter into it.

There's an EBNF-valid parse of /-5 as a ArithmeticExpr, so the only question (for that case) is whether we allow it or not. (That is, whether we decree that it violates some extra-grammatical constraint.) The Note in 3.2 suggests that it's a parse error, but the leading-lone-slash constraint doesn't disallow it (and has no reason to do so, since it poses no parsing difficulty).
Comment 7 Michael Kay 2008-06-24 23:44:29 UTC
A couple of points.

Firstly, there's no definition of the term "operator". I don't know if you regard "else" as an operator, but this ambiguity affects constructs like 

if ($doclevel) then / else /*

Certainly, any normative rules about this situation should not assume the term "operator" is well defined, any more than the word "keyword" (as used in the normative constraint).

Secondly, I don't think there's a good reason to ban operators after "/" in cases where they aren't ambiguous. The cases known to be ambiguous were operators written as a name ("union", "is", "intersect" are the most likely to appear in practice), and "*". To this list we must now add "<".

The sentence from the non-normative Note cited in comment #5 is quoted out of context. Most of the note is concerned with "/" followed by "*". Also, it's clearly written as a paraphrase of the normative constraint and not intended to replace or extend the constraint. I don't think it can be read as saying that implementations must not allow otherwise unambiguous constructs such as 

(/ = $a)

(which I have seen quite often; the author almost certainly intended ((/) is $a), but that's not the point).

We need to fix the normative constraint to clarify what is meant by a "keyword" (for example, to make it clear whether /f(x) or /unordered{x} are allowed or disallowed), and to say something about "/" followed by "<".

I have some sympathy with the argument that we should allow /<a/>, and disallow /<a, on the grounds that it's more consistent to always take the non-operator interpretation. The only reservation about this is that "/<a" was legal and unambguous in XPath 1.0, and is legal and unambiguous is XPath 2.0 - it is only XQuery that introduces the ambiguity. However, I think it's extremely unlikely that any real code will be affected.
Comment 8 John Snelson 2008-08-04 13:42:29 UTC
The leading-lone-slash grammar can be found here:

http://www.w3.org/TR/xquery/#parse-note-leading-lone-slash

Having looked over the XQuery grammar, I agree with Michael Kay that the only token that we need to add to this rule is "<". However I think that we can still firm up the grammar constraint to cover all eventualities. I propose the following new wording for the constraint:

<new>
A single slash may appear either as a complete path expression or as the first part of a path expression in which it is followed by a RelativePathExpr.
After a single slash, there are several tokens which have an ambiguous interpretation according to the grammar: the "*" token and keywords like "union" could indicate either an operator or a NameTest, and the "<" token could indicate a ComparisonExpr or the start of a DirectConstructor.
For example, without lookahead the first part of the expression "/ * 5" is easily taken to be a complete expression, "/ *", which has a very different interpretation (the child nodes of "/").

Therefore to reduce the need for lookahead, if the token immediately following a slash can form the start of a RelativePathExpr, then the slash must be the beginning of a PathExpr, not the entirety of it.

A single slash may be used as the left-hand argument of an operator by parenthesizing it: "(/) * 5". The expression "5 * /", on the other hand, is legal without parentheses.
</new>
Comment 9 Michael Kay 2008-08-04 14:41:59 UTC
(1) The proposal in comment #8 does not address the problem that if these rules are adopted,

/ < 5

now parses in XPath but not in XQuery. Perhaps that's acceptable.

(2) What about

/ instance of document-node(schema-element(x))

Are we saying this must be reported as a syntax error? I might be reluctant to implement that because I don't want to invalidate existing stylesheets - it's a perfectly reasonable thing to write, although I agree it's arguably illegal already (depends on what you think "keyword" means - I allow it because my parser recognizes "instance of" as a single token, which cannot appear at the start of a RelativePathExpr).

Michael Kay
Comment 10 John Snelson 2008-08-04 15:01:55 UTC
(In reply to comment #9)
> (1) The proposal in comment #8 does not address the problem that if these rules
> are adopted,
> 
> / < 5
> 
> now parses in XPath but not in XQuery. Perhaps that's acceptable.

I think that's acceptable. It's not the only case of an XPath expression that isn't valid in XQuery, although I agree we should keep these occurrences to a minimum.

> (2) What about
> 
> / instance of document-node(schema-element(x))
> 
> Are we saying this must be reported as a syntax error? I might be reluctant to
> implement that because I don't want to invalidate existing stylesheets - it's a
> perfectly reasonable thing to write, although I agree it's arguably illegal
> already (depends on what you think "keyword" means - I allow it because my
> parser recognizes "instance of" as a single token, which cannot appear at the
> start of a RelativePathExpr).

I believe this expression was already dis-allowed in the intent of the previous rule. Although "instance of" used to be a combined token in previous versions of the grammar, it is not in the recommendation.
Comment 11 Michael Dyck 2008-08-04 21:38:46 UTC
(In reply to comment #8)
> 
> <new>
> A single slash may appear either as a complete path expression or as the
> first part of a path expression in which it is followed by a
> RelativePathExpr.  After a single slash, there are several tokens which
> have an ambiguous interpretation according to the grammar:

This uses "ambiguous" with not quite its technical meaning, which I don't
think we should do when we're talking about grammars and parsers.
I suggest changing "After ... grammar:" to:
    In some cases, the next token after the slash is insufficient to
    allow a parser to distinguish these two possibilities:

> the "*" token and keywords like "union" could indicate either an operator
> or a NameTest, and the "<" token could indicate a ComparisonExpr or the
> start of a DirectConstructor.

I think it would be clearer to change "indicate" to "be" and
"ComparisonExpr" to "operator". (Why say "ComparisonExpr" for "<" and not
"MultiplicativeExpr" for "*"?) And maybe toss in another "either":
    the "*" token and keywords like "union" could be either an operator or
    a NameTest, and the "<" token could be either an operator or the start
    of a DirectConstructor.
    
(Actually, a keyword token in this context could be something other than an
operator, e.g. the example from comment #7:
    if ($doclevel) then / else /*
But since (a) we don't formally define "operator", and (b) we're about to
disallow that parse anyway, the fix is probably not worth it.)
    
---

So, just checking: Under this rule, these queries are all syntax errors:
    /*5
    /<a
    /<5
    /</b
    /<a div 3
    if ($doclevel) then / else /*
    / is $a
    / instance of document-node(schema-element(x))
    let $doc := / return $doc/*

these are PathExprs:
    /*
    /<a/>
    /<a div="3"/>
    /unordered{x}
    /f(x)

and these are other kinds of Exprs:
    /-5
    /=$a
    5*/

Right?

(It might be worth putting such examples in the doc.)
Comment 12 John Snelson 2008-09-03 10:07:31 UTC
I agree with Michael Dyck's changes. I'll repost the proposed text here for convenience:

<new>
A single slash may appear either as a complete path expression or as the first
part of a path expression in which it is followed by a RelativePathExpr. In some cases, the next token after the slash is insufficient to allow a parser to distinguish these two possibilities: the "*" token and keywords like
"union" could be either an operator or a NameTest, and the "<" token
could be either an operator or the start of a DirectConstructor.
For example, without lookahead the first part of the expression "/ * 5" is
easily taken to be a complete expression, "/ *", which has a very different
interpretation (the child nodes of "/").

Therefore to reduce the need for lookahead, if the token immediately following
a slash can form the start of a RelativePathExpr, then the slash must be the
beginning of a PathExpr, not the entirety of it.

A single slash may be used as the left-hand argument of an operator by
parenthesizing it: "(/) * 5". The expression "5 * /", on the other hand, is
legal without parentheses.
</new>
Comment 13 Michael Kay 2008-09-03 10:41:34 UTC
The text in comment #12 is fine for XQuery, but the phrase

<phrase>
and the "<" token could be either an operator or the start of a DirectConstructor
</phrase>

needs to be omitted for XPath.

This solution leaves the minor problem that (/ < a) is legal in XPath but not in XQuery. Fortunately it's a very improbable expression.
Comment 14 John Snelson 2008-10-29 21:43:56 UTC
The working group has approved the solution in comment #12. If you agree with this resolution, please mark this bug as closed.
Comment 15 Michael Kay 2008-10-30 13:21:05 UTC
I think the decision recorded in comment #14 was by the XQuery Working Group, before closing the bug it needs to be endorsed by the XSL WG. I am therefore marking it as REOPENED.

It would also be useful to know exactly what the XQuery WG is recommending should be the resolution for XPath. As the decision is recorded, it seems to be advocating including the text "the "<" token could be either an operator or the start of a DirectConstructor."  in the XPath specification, which does not seem appropriate.
Comment 16 Michael Kay 2009-02-05 17:46:50 UTC
The XSL WG felt that it would be inappropriate to introduce a change/erratum to XPath that is needed only because there is a problem in XQuery. Therefore the syntax 

/ < 5

should continue to be allowed in XPath, despite the fact that it will no longer be allowed in XQuery.
Comment 17 Jim Melton 2009-02-13 23:09:16 UTC
I hope that comment 16 means that we can mark this bug FIXED as an XQuery-only fix.  If that's incorrect, please say why.
Comment 18 Michael Kay 2009-02-13 23:35:09 UTC
No, the ambiguity in the current wording applies to XPath as well as to XQuery. I was left with an action to propose how the XPath text should be fixed; comment #16 was merely recording a decision that this solution should have the property that the XPath 1.0 and 2.0 expression ( / < 12 ) should continue to be legal, even though it will no longer be legal in XQuery.