Re: Determining whether '<' is a beginning of IRI or 'less than' operator [OK?]

Jiri Dokulil wrote:
> 
> I am not sure how should scanner for SPARQL determine whether '<'
> character it encountered is beginning of an IRI or a comparison
> operator.
> 
> Consider these queries:
> 
> SELECT * WHERE { ?a ?b ?c, ?d . FILTER(?a<?b && ?c>?d) }
> SELECT * WHERE { ?a ?b ?c, ?d . FILTER(?a<?b&&?c>?d) }
> 
> Yacker validator results look troubling to me:
> http://www.w3.org/2005/01/yacker/uploads/SPARQL?markup=html&lang=perl&text=SELECT+*+WHERE+%7B+%3Fa+%3Fb+%3Fc%2C+%3Fd+.+FILTER%28%3Fa%3C%3Fb+%26%26+%3Fc%3E%3Fd%29+%7D&action=validate+text 
> 
> http://www.w3.org/2005/01/yacker/uploads/SPARQL?markup=html&lang=perl&text=SELECT+*+WHERE+%7B+%3Fa+%3Fb+%3Fc%2C+%3Fd+.+FILTER%28%3Fa%3C%3Fb%26%26%3Fc%3E%3Fd%29+%7D%0D%0A&action=validate+text 
> 
> 
> The first query validates, the other does not.
> My guess is that the validator uses some flex-like scanner, that
> prefers the longest tokens. In the first case "<?b && ?c>" can't be
> parsed as IRI because of the spaces, so the scanner falls back and
> 'less than' rule is picked.
> On the other hand, "<?b&&?c>" is a valid (according to the grammar)
> IRI. But 'variable iri variable' is not a valid FILTER condition and
> the parser rejects the query.
> 
> The problem is more obvious for scanners with one character
> look-ahead, because they are completely unable to distinguish these
> two cases.
> They also have the same problem with () and [] tokens (NIL and ANON
> terminals) but that can easily be solved by going from LL(1) to LL(2).
> 
> Jiri Dokulil

Because the characters < and > are overloaded for IRIs and for comparison 
operators there is a potential ambiguity.  The SPARQL grammar handles IRI in 
two ways - the general grammar rule that is simple and covers any IRI scheme, 
but then replies on further validating by an IRI parser.

For the http: scheme, <?b> is a valid IRI, as is <?b&&1>. ? and & are legal in 
an HTTP URL.

For example:


BASE     <http://example/page>
PREFIX : <http://example/ns#>

ASK { <?b> :p <?b&&1> }



   1 BASE    <http://example/page>
   2 PREFIX  : <http://example/ns#>
   3
   4 ASK
   5 WHERE
   6   { <http://example/page?b>
   7               :p  <http://example/page?b&&1> .
   8   }

<?b> is a relative URL relative to base <http://example/page>
That is <http://example/page?b>

The rule "longest token wins" resolves the tokenizing problem (and is common 
practice in lexers because it also means 123 is a single number, not 3 
individual one digit numbers) although it moves the problem to the grammar.

It could be disambiguated but it needs more than changes to the lexer.  It 
needs a context sensitive lexer (< and an IRI can't occur in the same place in 
a valid expression, after ?a seeing < must be a comparison in a legal 
expression).  The WG has chosen to cover the wider range of parser toolkits, 
rather than chose the more complicated context sensitive approach.

I'll look at adding an editorial note that highlights this better. It does 
already say:

http://www.w3.org/TR/rdf-sparql-query/#whitespace
"""
White space (production WS) is used to separate two terminals which would 
otherwise be (mis-)recognized as one terminal.
"""
which already covers this case.

I hope that this message addresses you comment. If it does, please let us know 
- if you put [CLOSED] in the subject line, it will help scripts that help 
manage this list.

	Andy

Received on Friday, 18 August 2006 18:34:12 UTC