1373 – [XQuery] some editorial comments on A.1 EBNF (productions)

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 1373 - [XQuery] some editorial comments on A.1 EBNF (productions)

Summary: [XQuery] some editorial comments on A.1 EBNF (productions)

Status:	CLOSED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	XQuery 1.0 (show other bugs)
Version:	Last Call drafts
Hardware:	All All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Scott Boag
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:	grammar
Keywords:

Depends on:
Blocks:

Reported:	2005-05-11 07:24 UTC by Michael Dyck
Modified:	2005-09-29 10:58 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Michael Dyck 2005-05-11 07:24:43 UTC

A.1 EBNF

(sectioning)
    I think that the EBNF productions and the explanation of the EBNF notation
    should each be split into a separate section.

----------------------------------------
preamble

"with the following minor differences."
    You've removed the phrase "except that grammar symbols always have initial
    capital letters" even though it's still true, still different from the
    notation used in the XML spec, and still unexplained.
    [Leftover from qt-2004Feb0317-01]

"a grouping of terminals that together may help disambiguate the individual
symbols."
    This (along with its repeat in A.2) is another (but I hope the last) misuse
    of the word "disambiguate" in its technical sense. Instead, you might say
    "... may help a parser differentiate various constructs", or just "... may
    help a parser do its job".
    (And similarly for the repeat of this sentence in A.2.)

    You should make clear that angle-bracket-groups have no definitional
    significance. That is, their presence in the EBNF has no effect on the set
    of syntactically legal queries defined by the grammar. (Assuming that's
    true. If not, then you've got some explaining to do.)

"To help readability, this "< ... >" notation is absent in the EBNF in the main
body of this document. This appendix is the normative version of the EBNF."
    You could say the same of comments on grammar productions.

"Comments on grammar productions"
    Note that the XML spec's grammar has production comments, so it's not their
    *presence* here that's different, but rather their normative power.

"clarification for parsing rules"
    Some grammar notes are not mere clarification, they actually affect the set
    of legal queries.

Pulling some of this together, how about restructuring the preamble into
something like this:

    The following grammar .... differences:

        o All named symbols have a name that begins with an uppercase letter
          (unlike the XML spec, where some names began with lowercase letters
          to draw a distinction ...)

        o It adds a notation for referring to productions in external specs.

        o (...angle-bracket groups...)

        o Production comments are normative.

    These features are described in more detail in [the Notation section].

    To increase readability, the EBNF in the main body of this document omits
    some of these notational features. This appendix is the normative
    version of the EBNF.

----------------------------------------
productions

"[66]  PragmaContents"
"[146] Digits"
"[155] CommentContents"
    These should be marked "ws: explicit".

[96] DirAttributeValue    ::=
         ('"' (EscapeQuot | QuotAttrValueContent)* '"')
       | ("'" (EscapeApos | AposAttrValueContent)* "'")
[97] QuotAttrValueContent ::=
         QuotAttrContentChar | CommonContent
[98] AposAttrValueContent ::=
         AposAttrContentChar | CommonContent

    I think these would be clearer if you didn't split each of the choices over
    two productions. Instead, how about:

        [96] DirAttributeValue    ::=
                 ('"' QuotAttrValueContent* '"')
               | ("'" AposAttrValueContent* "'")
        [97] QuotAttrValueContent ::=
                 EscapeQuot | QuotAttrContentChar | CommonContent
        [98] AposAttrValueContent ::=
                 EscapeApos | AposAttrContentChar | CommonContent

[142] StringLiteral
    Change ('"' '"') to EscapeQuot.
    Change ("'" "'") to EscapeApos.

(*ContentChar)
    If you factor out the overlap of ElementContentChar, QuotAttrContentChar,
    and AposAttrContentChar, and push it over to CommonContent, I think the
    result is simpler. That is, change this:

        [97] QuotAttrValueContent ::=
                EscapeQuot | QuotAttrContentChar | CommonContent
        [98] AposAttrValueContent ::=
                EscapeApos | AposAttrContentChar | CommonContent
        [99] DirElemContent ::= DirectConstructor | CDataSection
                | ElementContentChar | CommonContent

        [100] CommonContent ::= ...

        [151] ElementContentChar ::= Char - [{}<&]
        [152] QuotAttrContentChar ::= Char - ["{}<&]
        [153] AposAttrContentChar ::= Char - ['{}<&]

    to this:

        [97] QuotAttrValueContent ::= EscapeQuot | "'" | CommonContent
        [98] AposAttrValueContent ::= EscapeApos | '"' | CommonContent
        [99] DirElemContent ::= DirectConstructor | CDataSection
                | '"' | "'" | CommonContent

        [100] CommonContent ::= [^"'{}<&] | ...

        [151] ElementContentChar   delete
        [152] QuotAttrContentChar  delete
        [153] AposAttrContentChar  delete

    Just a thought.

(explicits)
    I wonder if it would help the reader if the "ws: explicit" productions (and
    the intervening ones that don't care whether they're "ws: explicit" or not)
    were put together in a group. Specifically:
        [65-66]
        [79]
        [93-106]
        [138-159]
    Then, instead of "productions marked with 'ws: explicit'", you might say
    "productions in the [whatever] group".

Comment 1 Scott Boag 2005-05-17 15:12:29 UTC

> You've removed the phrase "except that grammar symbols always have initial
> capital letters" even though it's still true, still different from the
> notation used in the XML spec, and still unexplained.

MSM and I should have a conversation about this.  I'm curious as to why, in the
XML spec, there is:
[22] prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?
vs.
[24] VersionInfo ::= S 'version' Eq ("'" VersionNum "'" | '"' VersionNum '"')

Section 6 states "Symbols are written with an initial capital letter if they are
the start symbol of a regular language, otherwise with an initial lowercase
letter."  But it seems like a fuzzy line.

I would like to be as consistent with the XML spec as possible.  How much
trouble would it cause to change capitalization on some symbol names?

> Note that the XML spec's grammar has production comments, 
> so it's not their
> *presence* here that's different, but rather their normative power.

I think they're normative in the XML spec too, though they're not used to help
define the grammar itself.  In any case, the paragraph explaining the comments
was not meant to be comparitive with the XML spec.

Comment 2 Scott Boag 2005-05-17 15:13:15 UTC

> You've removed the phrase "except that grammar symbols always have initial
> capital letters" even though it's still true, still different from the
> notation used in the XML spec, and still unexplained.

MSM and I should have a conversation about this.  I'm curious as to why, in the
XML spec, there is:
[22] prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?
vs.
[24] VersionInfo ::= S 'version' Eq ("'" VersionNum "'" | '"' VersionNum '"')

Section 6 states "Symbols are written with an initial capital letter if they are
the start symbol of a regular language, otherwise with an initial lowercase
letter."  But it seems like a fuzzy line.

I would like to be as consistent with the XML spec as possible.  How much
trouble would it cause to change capitalization on some symbol names?

> Note that the XML spec's grammar has production comments, 
> so it's not their
> *presence* here that's different, but rather their normative power.

I think they're normative in the XML spec too, though they're not used to help
define the grammar itself.  In any case, the paragraph explaining the comments
was not meant to be comparitive with the XML spec.

Comment 3 C. M. Sperberg-McQueen 2005-06-02 00:29:25 UTC

Scott Boag writes:

    I'm curious as to why, in the XML spec, there is:

      [22] prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?

    vs.

      [24] VersionInfo ::= S 'version' Eq ("'" VersionNum "'" | '"'
           VersionNum '"')

    Section 6 states "Symbols are written with an initial capital
    letter if they are the start symbol of a regular language,
    otherwise with an initial lowercase letter."  But it seems like a
    fuzzy line.

The XML WG may have made errors in drawing the line, but whether a
particular language over the alphabet of Unicode characters is regular
or not, in the technical sense, should not be fuzzy.  The language
defined by a non-terminal in a regular right-part grammar is regular
if and only if non-terminals on the right-hand side can be replaced
(iteratively) until there is nothing there but terminal symbols (in
this case, Unicode characters or expressions like [a-zA-Z]).  This, in
turn, is the case if there is no recursion in the grammar rules (no
left-hand symbol turns up directly or indirectly in its own right-hand
side).

If all the symbols in a right-hand side are known to denote regular
languages, then the symbol on the left-hand side also denotes a
regular language; if any symbol on the right denotes a non-regular
language, then the language of the left-hand side symbol is
non-regular.

Consider the examples above. The language defined by using the XML 1.0
grammar with 'doctypedecl' as start symbol (I'll just call this 'the
language of doctypedecl' or 'the language denoted by doctypedecl' in
what follows) is not regular, because a doctypedecl can contain an
internal subset (intSubset), which can contain element declarations
(via markupdecl and elementdecl), which can contain content models for
elements with element content (via contentspec and children).  Such
content models are not regular, because they require that opening and
closing parentheses match; there is indirect recursion in both choice,
and seq, through cp.  (Content models for mixed content are regular
because they can't have nested groups.)

So 'doctypedecl' is spelled with an initial lowercase letter.

'VersionInfo', by contrast, has an initial uppercase because it
denotes a regular language:  it can be written


[24] VersionInfo ::= (#x20 | #x9 | #xD | #xA)+ 'version' 
                     (#x20 | #x9 | #xD | #xA)+? '=' 
                     (#x20 | #x9 | #xD | #xA)+?  
                     ("'" '1.0' "'" | '"' '1.0' '"')
 
which has no non-terminals on the right-hand side.

That may not be the 'why' you had in mind, though.  

The distinction between regular and non-regular non-terminals was the
result of a compromise.  Someone (I'll leave the protagonists
anonymous) proposed that it would be easier to see how to write an XML
parser if we distinguished the lexical level and the grammar level
explicitly, so that interested parties could see at a glance where one
might plausibly draw the line between a lexer and a parser.  Even if
an implementor later decided to move that line, it would be convenient
to have an initial suggestion.  Someone else objected that different
implementors might choose to draw the lexer/parser line in different
places, and that trying to prescribe it, or even making a specific
suggestion, was a waste of time.  In the end, we agreed to distinguish
regular from non-regular sublanguages, on the theory that conventional
lexers typically recognize only regular languages.  The initial
capital letter effectively says "If you want to, you can conveniently
treat this non-terminal as a terminal symbol recognized by the lexer";
perhaps even more important, the initial lowercase letter says "If you
were thinking of treating this as a terminal symbol, using a
conventional lexer design, then forget it".

I gather that when XPath 1.0 was done, the XSL WG had no one who
thought that this was a worthwhile way to help implementors or
readers.  Myself, I find it helpful but not essential.

Comment 4 Scott Boag 2005-07-07 16:47:12 UTC

I agree that the change in capitalization should occur.  However it will require
that it be done in a period of spec freeze, as it will effect many of the documents.

I think the other comments in this list are pretty much editorial (I agree with
most of them) and I will apply with editor's discretion.

Comment 5 Michael Dyck 2005-07-07 19:27:16 UTC

Re changing capitalization:
Note that, with the XQuery grammar, there's a subtlety in determining whether a
symbol derives a regular language. For example, consider ModuleDecl:
    ModuleDecl ::= "module" "namespace" NCName "=" URILiteral Separator
All the symbols on the RHS derive regular languages, and this rule merely
concatenates them, so it would appear to derive a regular language also.
However, because implicit whitespace is allowed/required between those symbols,
and because that can include comments, and because comments nest, the language
derived by ModuleDecl is actually non-regular.

Comment 6 Scott Boag 2005-07-08 05:18:02 UTC

Editorial change notes:

> Instead, you might say
>    "... may help a parser differentiate various constructs"

Done

>  I think that the EBNF productions and the explanation of the EBNF notation
>    should each be split into a separate section.

Notation section moved to subsection following the EBNF.

> Pulling some of this together, how about restructuring the preamble into
> something like this:

done.

> I think these would be clearer if you didn't split each of the choices

I don't think the improvement warrents a change at this point.

> [142] StringLiteral
>     Change ('"' '"') to EscapeQuot.
>     Change ("'" "'") to EscapeApos.

Done.

>    If you factor out the overlap of ElementContentChar, QuotAttrContentChar,
>    and AposAttrContentChar, and push it over to CommonContent, I think the
>    result is simpler.

I don't think the improvement warrents a change at this point.

>   I wonder if it would help the reader if the "ws: explicit" productions (and
>   the intervening ones that don't care whether they're "ws: explicit" or not)
>   were put together in a group.

I don't think the improvement warrents a change at this point.

Comment 7 Scott Boag 2005-07-10 02:27:37 UTC

(In reply to comment #5)
> derived by ModuleDecl is actually non-regular.

Based on this, which is absolutely true, and having thought about it a bit more,
I withdraw my support for the proposal that the capitalization be changed.  I
don't think it would be an improvement.

Comment 8 Michael Dyck 2005-07-10 02:39:45 UTC

> I withdraw my support for the proposal that the capitalization be changed.

I'll just point out that I didn't propose that the capitalization be changed, 
merely that the difference from the XML spec be acknowledged and explained.

Comment 9 Scott Boag 2005-07-22 19:34:13 UTC

A joint meeting of the Query and XSLT working groups considered this comment on 
July 20, 2005.  

The WGs agreed to resolve these editorial issues as listed in my previous comment.

If you do not agree with this resolution, please add a comment explaining why.
If you wish to appeal the WG's decision to the Director, then change the Status
of the record to Reopened. If we do not hear from you in the next two weeks, we
will assume you agree with the WG decision.