1617 – how are comments really parsed?

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 1617 - how are comments really parsed?

Summary: how are comments really parsed?

Status:	CLOSED WORKSFORME

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	XQuery/XPath Tokenizer (show other bugs)
Version:	Last Call drafts
Hardware:	PC Windows 2000

Importance:	P2 normal
Target Milestone:	---
Assignee:	Scott Boag
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2005-07-15 00:50 UTC by Fred Zemke
Modified:	2005-10-06 16:08 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Fred Zemke 2005-07-15 00:50:15 UTC

How to parse comments is still not clear, and is open to
conflicting interpretations.  The question is whether comments
count as whitespace within a so-called "long token".

The two possible interpretations are:

1. no, comments are not permitted in long tokens.
Evidence:  "Building a tokenizer for XPath and XQuery" (4 Apr 2005)
section 1.2.1 "Token granularity" second para through the end
says:

A scanner might take one of two approaches for assigning token
units to the character stream:
. . .
[Definition: Long tokens. Using this approach, declare namespace
would be considered a single token.] In this case, a parser that
has a look-ahead of only one token can be implemented.

This passage is not definitive because there is no definition
of "token", but some readers might reasonably think that it
means that the lexer does not expect to encounter a comment
between "declare" and "namespace".  That impression is
corroborated by section 2.1.1 "XQuery lexical states", first
table "the DEFAULT state", second row, which lists the pattern
<"declare" "namespace">.  This impression is re-inforced by the
fact that the tables do contain explicit provision for
supporting some comments, for example, the row later in the
DEFAULT table that contains the pattern "(:", and many other
state tables.  Thus one can argue that if the intent was to
permit comments in long tokens, there would be explicit support
for them in these tables.  Since the tables clearly do not support
comments in long tokens, there is no need for an implementation
to support them either.

2. The opposite opinion is that comments are permitted within
long tokens.  Evidence:  XQuery language spec (4 April 2005),
section 2.6 "Comments", says
"A comment may be used anywhere ignorable whitespace
is allowed."  The hot link for "ignorable whitespace" takes
you to A.2.2 "Whitespace rules", which says:

[Definition: Unless otherwise specified (see A.2.2.2 Explicit
Whitespace Handling), Ignorable whitespace may occur
between terminals, and is not significant to the parse tree.
For readability, whitespace may be used in most expressions
even though not explicitly notated in the EBNF. All allowable
whitespace that is not explicitly specified in the EBNF is ignorable
whitespace, and converse, this term does not apply to whitespace
that is explicitly specified. ]  ... Comments may also act as
"whitespace" to prevent two adjacent terminals from being
recognized as one.

The hot link for "terminal" takes you to the following definition:

[Definition: A terminal is a single unit of the grammar that can not
be further subdivided, and is specified in the EBNF by a character
or characters in quotes, or a regular expression.]

The relevant EBNF for my running example is

[10] NamespaceDecl :: <"declare" "namespace"> ...

By the definition of terminal, "declare" and "namespace" are two
terminals.  EBNF [10] is not marked as "explicit whitespace",
therefore comments are permitted between "declare" and
"namespace".

Why this is important: this is a serious usability issue.
If users cannot put comments between terminals in long tokens,
they will need to be careful where they put comments in their
XQuery expressions.  They will probably need a reference card,
since there is such an extensive list of long tokens.  In
addition, they will be prevented from placing comments in some
of the most natural places.

Aggravating the situation, some vendors will permit comments
in long tokens, even if XQuery does not.  This will lead their
users to write nonportable XQuery expressions, which will cause
syntax errors when supposedly debugged applications are migrated,
or simply deployed into a heterogenous environment.

Proposed solution: comments are permitted within long tokens.
The definition of "long token" in section 1.2.1 should be 
enhanced with a statement that comments are permitted between
the "subtoken"s of a long token, such as "declare" and "namespace".
In addition, the lexical state tables in section 2.1.1 should be
enhanced to handle comments in long tokens.  An idea
for doing this is to define a pattern for ignorable whitespace,
in the same fashion that the tables presume a pattern called
QName.  Let us call this pattern IW.  Given such a pattern, then
the actual long token is <"declare" IW "namespace">.  

Note that if we have a pattern for ignorable whitespace, then 
the current rows for (: and (# do not belong in the tables,
since comments and pragmas are now handled by the IW pattern.

Since IW is not a regular expression, owing to the ability to
nest comments, the specification should also give the reader
guidance on how to recognize IW.  IW can be recognized by a 
stack machine, so the current set of rules for handling (:
and (# could be placed in an entirely new set of tables,
which describe only IW.  Note that this idea implies that the 
complete lexer is running a low-level stack automoton to
detect IW, and then a high-level stack automoton as described 
in section 2.1.1.  The rules for IW should be in a separate 
section from 2.1.1 to make clear that they form a preliminary
stage to the lexer, before the final stage.

Alternatively, if the two-stack design is not agreeable, 
then a single stack can be used, at the cost of a lot more
states.  For example, the pattern <"declare" "(:"> needs to
enter a state that looks for the matching :) after which
it can pop and continue looking for the word to come after
"declare".  If it does not find an appropriate word, then
it can rewind the scan and decide that "declare" was not
a keyword after all.  You need a separate state for every
juncture that a comment might appear, so that you can keep
track of how much of a long token has already been recognized.
Personally, I think the number of states would be prohibitive.

Comment 1 Scott Boag 2005-07-19 21:09:42 UTC

(In reply to comment #0)

I agree with your analysis.

Certainly the intent and specific decision of the working groups is that
comments be allowed in so-called long tokens.

I don't think the two-pass approach you suggested, if I understand it, works
very well, because you have to be aware of the context to recognize a comment...
for instance, the comment could occur in string or element content.  So you
would have to do at least a partial complete parse to remove the comments.

My current thinking is that we don't use the term "long token" at all, and 
specify <"aa" "bb"> to mean look-ahead, i.e. you only recognize "aa" if followed
by "bb".

I plan to be doing a lot of work on this in the next three weeks, so I'll follow
up this issue more after that, with a more concrete proposal.

-scott

Comment 2 C. M. Sperberg-McQueen 2005-07-20 17:45:06 UTC

Thanks for the comment.  The XML Query and XSL Working Groups
discussed this issue during this morning's meeting.

We agree that the correct interpretation here is that comments
and other token separators are indeed allowed within the
so-called 'long tokens' of the grammar.  (Some members of the
WGs suggest that there is really no uncertainty as to the 
answer, if only because all the evidence on one side comes from
a normative document, and all the evidence on the other 
side is from a document marked non-normative.  But we agree
that the non-normative document can usefully be made clearer
on this question.)

Since the interpretation agreed upon requires no changes to the
language documents, we are closing this issue without any
instructions to the editors to change the normative documents.
(Also with the expectation that the next revision of the
document about tokenization will be clearer on this topic.)

Please let us know if you agree with this resolution of your 
issue, by adding a comment to the issue record and changing 
the Status of the issue to Closed. Or, if you do not agree 
with this resolution, please add a comment explaining why. 
If you wish to appeal the WG's decision to the Director, 
then also change the Status of the record to Reopened. If 
you wish to record your dissent, but do not wish to appeal 
the decision to the Director, then change the Status of the 
record to Closed. If we do not hear from you in the next 
two weeks, we will assume you agree with the WG decision.