This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
How to parse comments is still not clear, and is open to conflicting interpretations. The question is whether comments count as whitespace within a so-called "long token". The two possible interpretations are: 1. no, comments are not permitted in long tokens. Evidence: "Building a tokenizer for XPath and XQuery" (4 Apr 2005) section 1.2.1 "Token granularity" second para through the end says: A scanner might take one of two approaches for assigning token units to the character stream: . . . [Definition: Long tokens. Using this approach, declare namespace would be considered a single token.] In this case, a parser that has a look-ahead of only one token can be implemented. This passage is not definitive because there is no definition of "token", but some readers might reasonably think that it means that the lexer does not expect to encounter a comment between "declare" and "namespace". That impression is corroborated by section 2.1.1 "XQuery lexical states", first table "the DEFAULT state", second row, which lists the pattern <"declare" "namespace">. This impression is re-inforced by the fact that the tables do contain explicit provision for supporting some comments, for example, the row later in the DEFAULT table that contains the pattern "(:", and many other state tables. Thus one can argue that if the intent was to permit comments in long tokens, there would be explicit support for them in these tables. Since the tables clearly do not support comments in long tokens, there is no need for an implementation to support them either. 2. The opposite opinion is that comments are permitted within long tokens. Evidence: XQuery language spec (4 April 2005), section 2.6 "Comments", says "A comment may be used anywhere ignorable whitespace is allowed." The hot link for "ignorable whitespace" takes you to A.2.2 "Whitespace rules", which says: [Definition: Unless otherwise specified (see A.2.2.2 Explicit Whitespace Handling), Ignorable whitespace may occur between terminals, and is not significant to the parse tree. For readability, whitespace may be used in most expressions even though not explicitly notated in the EBNF. All allowable whitespace that is not explicitly specified in the EBNF is ignorable whitespace, and converse, this term does not apply to whitespace that is explicitly specified. ] ... Comments may also act as "whitespace" to prevent two adjacent terminals from being recognized as one. The hot link for "terminal" takes you to the following definition: [Definition: A terminal is a single unit of the grammar that can not be further subdivided, and is specified in the EBNF by a character or characters in quotes, or a regular expression.] The relevant EBNF for my running example is [10] NamespaceDecl :: <"declare" "namespace"> ... By the definition of terminal, "declare" and "namespace" are two terminals. EBNF [10] is not marked as "explicit whitespace", therefore comments are permitted between "declare" and "namespace". Why this is important: this is a serious usability issue. If users cannot put comments between terminals in long tokens, they will need to be careful where they put comments in their XQuery expressions. They will probably need a reference card, since there is such an extensive list of long tokens. In addition, they will be prevented from placing comments in some of the most natural places. Aggravating the situation, some vendors will permit comments in long tokens, even if XQuery does not. This will lead their users to write nonportable XQuery expressions, which will cause syntax errors when supposedly debugged applications are migrated, or simply deployed into a heterogenous environment. Proposed solution: comments are permitted within long tokens. The definition of "long token" in section 1.2.1 should be enhanced with a statement that comments are permitted between the "subtoken"s of a long token, such as "declare" and "namespace". In addition, the lexical state tables in section 2.1.1 should be enhanced to handle comments in long tokens. An idea for doing this is to define a pattern for ignorable whitespace, in the same fashion that the tables presume a pattern called QName. Let us call this pattern IW. Given such a pattern, then the actual long token is <"declare" IW "namespace">. Note that if we have a pattern for ignorable whitespace, then the current rows for (: and (# do not belong in the tables, since comments and pragmas are now handled by the IW pattern. Since IW is not a regular expression, owing to the ability to nest comments, the specification should also give the reader guidance on how to recognize IW. IW can be recognized by a stack machine, so the current set of rules for handling (: and (# could be placed in an entirely new set of tables, which describe only IW. Note that this idea implies that the complete lexer is running a low-level stack automoton to detect IW, and then a high-level stack automoton as described in section 2.1.1. The rules for IW should be in a separate section from 2.1.1 to make clear that they form a preliminary stage to the lexer, before the final stage. Alternatively, if the two-stack design is not agreeable, then a single stack can be used, at the cost of a lot more states. For example, the pattern <"declare" "(:"> needs to enter a state that looks for the matching :) after which it can pop and continue looking for the word to come after "declare". If it does not find an appropriate word, then it can rewind the scan and decide that "declare" was not a keyword after all. You need a separate state for every juncture that a comment might appear, so that you can keep track of how much of a long token has already been recognized. Personally, I think the number of states would be prohibitive.
(In reply to comment #0) I agree with your analysis. Certainly the intent and specific decision of the working groups is that comments be allowed in so-called long tokens. I don't think the two-pass approach you suggested, if I understand it, works very well, because you have to be aware of the context to recognize a comment... for instance, the comment could occur in string or element content. So you would have to do at least a partial complete parse to remove the comments. My current thinking is that we don't use the term "long token" at all, and specify <"aa" "bb"> to mean look-ahead, i.e. you only recognize "aa" if followed by "bb". I plan to be doing a lot of work on this in the next three weeks, so I'll follow up this issue more after that, with a more concrete proposal. -scott
Thanks for the comment. The XML Query and XSL Working Groups discussed this issue during this morning's meeting. We agree that the correct interpretation here is that comments and other token separators are indeed allowed within the so-called 'long tokens' of the grammar. (Some members of the WGs suggest that there is really no uncertainty as to the answer, if only because all the evidence on one side comes from a normative document, and all the evidence on the other side is from a document marked non-normative. But we agree that the non-normative document can usefully be made clearer on this question.) Since the interpretation agreed upon requires no changes to the language documents, we are closing this issue without any instructions to the editors to change the normative documents. (Also with the expectation that the next revision of the document about tokenization will be clearer on this topic.) Please let us know if you agree with this resolution of your issue, by adding a comment to the issue record and changing the Status of the issue to Closed. Or, if you do not agree with this resolution, please add a comment explaining why. If you wish to appeal the WG's decision to the Director, then also change the Status of the record to Reopened. If you wish to record your dissent, but do not wish to appeal the decision to the Director, then change the Status of the record to Closed. If we do not hear from you in the next two weeks, we will assume you agree with the WG decision.