W3C

Building a Tokenizer for XPath or XQuery

W3C Working Draft 4 April 2005

This version:
http://www.w3.org/TR/2005/WD-xquery-xpath-parsing-20050404/
Latest version:
http://www.w3.org/TR/xquery-xpath-parsing/
Editor:
Scott Boag (XSL WG), IBM Research <scott_boag@us.ibm.com>

Abstract

This document describes possible strategies for tokenizing the [XML Path Language (XPath) 2.0] and [XQuery 1.0: An XML Query Language] languages, and is provided as a helpful guide to those who are designing an implementation for these languages, and as background material for the normative EBNF found in the language specifications. In the future this document may be expanded to cover more general parsing strategies.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is a public W3C Working Draft for review by W3C Members and other interested parties. Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

Building a Tokenizer for XPath or XQuery has been defined through the efforts of a joint task force of the XML Query Working Group and the XSL Working Group (both part of the XML Activity). It is designed to be read in conjunction with the following documents: [XML Path Language (XPath) 2.0] and [XQuery 1.0: An XML Query Language].

This is the first public Working Draft of these guidelines. Public comments on this document are invited. Comments should be entered into the last-call issue tracking system for this specification (instructions can be found at http://www.w3.org/XML/2005/04/qt-bugzilla). If access to that system is not feasible, you may send your comments to the W3C XPath/XQuery mailing list, public-qt-comments@w3.org (archived at http://lists.w3.org/Archives/Public/public-qt-comments/), with “[Parsing]” at the beginning of the subject field.

As of this publication, the Working Group expects to eventually publish this document as a Working Group Note. It is not expected to become a W3C Recommendation, and therefore it has no associated W3C Patent Policy licensing obligations. If this expectation changes, the Working Group will have an opportunity to fulfill the associated patent policy requirements with respect to a future draft.

Table of Contents

1 Introduction
    1.1 Structure of Parsers
    1.2 Lexical Scanning Strategies
        1.2.1 Token Granularity
    1.3 Parser Strategies
2 Lexical Processing
    2.1 Lexical States
        2.1.1 XQuery Lexical States
        2.1.2 XPath Lexical States

Appendices

A XPath/XQuery Grammar
    A.1 EBNF
        A.1.1 XPath EBNF
        A.1.2 XQuery EBNF
B References
    B.1 Main References
C Glossary (Non-Normative)

End Notes


1 Introduction

The XQuery language is, for the most part, a super-set of the smaller XPath language [1]. XPath is designed to be a simple addressing and expression language, while XQuery includes prologue declarations, function, variable, and extension declarations, XML element construction, and advanced iteration constructs. These languages do not contain reserved keywords, which is to say data references in the form of QNames are designed to never overlap with keywords in the language. In order to achieve the combination of expression language, directory-style addressing, prologue declarations, type matching, and XML element construction, the meaning of the tokens are extremely context dependent. For instance, the string div may be an element reference, or may be a division operator. Likewise, the * might be a wildcard element reference, may be a multiplication operator, or may be a kleene operator.

As illustration, the following is a very unlikely, but legal, XQuery example:

declare namespace namespace = "http://example.com"; 
declare union <union>for gibberish {
   for $for in for return <for>***div div</for>
}</union>, 
if(if) then then else else- +-++-**-* instance 
of element(*)* * * **---++div- div -div

1.1 Structure of Parsers

XPath and XQuery parsers may be either top-down or bottom-up. Traditionally, the parsing process is decomposed into subcomponents:

  1. [Definition: Lexical scanner, which assigns symbols known as tokens to units in the input, in essence recognizing the basic words and punctuation of the language. Typically, the lexer uses regular expressions as a specification for how to do this tokenization. The rules that govern the lexical structure of the language, sometimes known as microsyntax, should be simple, regular, and as context-free as possible.]

  2. [Definition: Parser, which arranges the tokens into a hierarchical structure. Typically, the parser uses a BNF or variant description (EBNF in the case of XPath/XQuery) as a specification for this parsing.]

Since the tokenization of XPath/XQuery is so context dependent, this breakdown may not be as simple as with other programming languages. Furthermore, the granularity of tokenization is an issue.

1.2 Lexical Scanning Strategies

We can speculate on an enumeration of some strategies for implementing a lexical scanner as follows:

  1. [Definition: Scan-while-parse scanner. Using this strategy, the parser would evaluate the character stream into tokens, while at the same time branching to the respective hierarchical context. Thus the words can be discovered in context. Complex parsers are typically not built in this fashion, but, given the issue of context sensitivity, this may be a reasonable approach. This would probably be implemented as a hand-constructed recursive-descent parser. ]

  2. [Definition: Fuzzy token scanner. Using this strategy, a typical simple lexical scanner is implemented, but the meaning of the tokens are often left somewhat ambiguous. Thus, the word div might be assigned the symbol QNameOrKeyword. The words can be disambiguated during the context sensitive parse process. This approach would probably be a hand-constructed, Top-down parser.]

  3. [Definition: Two-pass scanner. This is the same as the Fuzzy token scanner, but the ambiguous words can be disambiguated by a second scan, using some rules (probably expressed as a context-free grammar) that reflect the basic structure of the EBNF (call this the three-pass approach). The three-pass approach could conceivably be implemented via a parser constructor program, with the addition of a custom disambiguator in between the lexer and parser. This method could be used for either Top-down or Bottom-up techniques.]

  4. [Definition: State-driven scanner. This approach recognizes sets of words only in certain states, each word or punctuation possibly transitioning to a next state. In other words, it implements lexical recognition as transition tables for a push-down automata.]

1.2.1 Token Granularity

In XPath/XQuery, the meaning of a word may often be determined by the words and punctuation ahead of it. Thus, the word if is only the "if" keyword if it is followed by a left-parenthesis (ignoring whitespace). Likewise, the word declare may be a QName, or may be part of a NamespaceDecl in the prolog, if followed by a namespace word. Inversely, namespace is determined to be a keyword verses a QName by it's appearance after declare. In the XQuery and XPath specifications, in the EBNF appendix, the notation "< ... >" is used to indicate a grouping of terminals that together may help disambiguate the individual symbols.

A scanner might take one of two approaches for assigning token units to the character stream:

  • [Definition: Short tokens. Using this approach, declare namespace would be considered two tokens, the exact symbols assigned based on lexical look-ahead (and look-behind).] In this case, a parser that has at least a look-ahead of two tokens is required (in particular, a two-token look-ahead should be sufficient for a LALR parser).

  • [Definition: Long tokens. Using this approach, declare namespace would be considered a single token.] In this case, a parser that has a look-ahead of only one token can be implemented.

1.3 Parser Strategies

The general strategies for implementing the parser are as follows:

  1. Recursive-descent parser. For our purposes, this term means a hand-constructed, recursive descent parser.

  2. [Definition: LL Parser (Left to right, Leftmost derivation, hence LL). ] The XPath/XQuery EBNF is expressed as an LL(1) grammar, so an LL parser can use the EBNF from the specifications fairly directly.

  3. [Definition: LALR Parser (Look Ahead Left to right, Rightmost derivation, hence LALR). ] The XPath/XQuery EBNF is expressed as an LL(1) grammar, so an LALR parser must use what is essentially an inverted form of the original EBNF.

2 Lexical Processing

2.1 Lexical States

Many lexical scanner generators will allow specifications of tokens that are recognized based on some condition. It is possible to condition the XQuery and XPath token recognition in a lexical scanner in terms of a series of states and transitions between those states.

[Definition: A lexical state is a condition where a defined set of patterns are recognized.] The tables below define the complete lexical rules for [XPath/XQuery]. Each two column table corresponds to a lexical state and lists in the first column the tokens that are recognized when in that state. Any token not listed should result in a syntax error when in that state. When a given token is recognized in the given state, the transition to the next state and/or a lexical action is given in the right column. [Definition: A lexical action is an action that occurs as an side-effect of a pattern recognition.] Following are a list of lexical actions used in the tables:

pushState()

The current state, before the transition, is pushed onto an abstract stack.

pushState(STATE)

A STATE parameter is pushed onto an abstract stack.

popState()

The most recently pushed state is removed from the top of the stack, and becomes the current state.

input_stream.backup(n)

This is a special action that backs up the character stream by n characters. See the description of the OCCURRENCEINDICATOR state for more information.

(maintain state)

This simply means that the current state is maintained with no modification.

The lexical states have, in many cases, close connection to the parser productions. However, just because a token is recognized in a certain lexical state, does not mean it will be legal in the current EBNF production.

The XQuery EBNF has some places where whitespace has explicit specification, which is to say, whitespace is not ignored in these states: START_TAG, END_TAG, ELEMENT_CONTENT, XML_COMMENT, PROCESSING_INSTRUCTION, PROCESSING_INSTRUCTION_CONTENT, CDATA_SECTION, QUOT_ATTRIBUTE_CONTENT, and APOS_ATTRIBUTE_CONTENT.

Editorial note  
The following tables were hand-constructed and have not, at the time of this writing, been exhaustively verified against all possible paths that may be legal in the XQuery and XPath EBNFs, which is to say, it is possible they contain bugs.

2.1.1 XQuery Lexical States

The DEFAULT State

This state is for patterns that occur at the beginning of an expression or subexpression.

Pattern Transition To State
DecimalLiteral, "..", ".", DoubleLiteral, IntegerLiteral, NotNumber, <NCName ":" "*">, QName, ")", <"*" ":" NCName>, "*", StringLiteral, <"declare" "construction">, <"declare" "default" "order">
OPERATOR
<"declare" "default" "collation">, <"declare" "namespace">, <"module" "namespace">, <"declare" "base-uri">
NAMESPACEDECL
<"declare" "default" "element">, <"declare" "default" "function">, <"import" "schema">, <"import" "module">, <"declare" "copy-namespaces">
NAMESPACEKEYWORD
"$", <"for" "$">, <"let" "$">, <"some" "$">, <"every" "$">
VARNAME
<"declare" "variable" "$">
VARNAME
<")" "as">
ITEMTYPE
<"element" "(">, <"attribute" "(">, <"schema-element" "(">, <"schema-attribute" "(">, <"comment" "(">, <"text" "(">, <"node" "(">, <"document-node" "(">
pushState(OPERATOR)
KINDTEST
<"processing-instruction" "(">
pushState(OPERATOR)
KINDTESTFORPI
"<!--"
pushState(OPERATOR)
XML_COMMENT
"<?"
pushState(OPERATOR)
PROCESSING_INSTRUCTION
"<![CDATA["
pushState(OPERATOR)
CDATA_SECTION
"<"
pushState(OPERATOR)
START_TAG
<"declare" "boundary-space">
XMLSPACE_DECL
"}"
popState()
<"validate" "{">, <"validate" ValidationMode>
pushState(OPERATOR)
DEFAULT
<"typeswitch" "(">
DEFAULT
<"element" "{">, <"attribute" "{">
pushState(OPERATOR)
DEFAULT
<"attribute" QName "{">, <"element" QName "{">, <"document" "{">, <"text" "{">, <"processing-instruction" "{">, <"processing-instruction" NCName "{">, <"comment" "{">
pushState(OPERATOR)
<"declare" "function">
DEFAULT
"(:"
pushState()
EXPR_COMMENT
"{", <"ordered" "{">, <"unordered" "{">
pushState(OPERATOR)
DEFAULT
<"declare" "ordering">
DECLAREORDERING
<"xquery" "version">
XQUERYVERSION
"(#"
PRAGMA
<"declare" "option">
OPTION
<"at" URILiteral>
NAMESPACEDECL
";", ",", "(", <QName "(">, <"if" "(">, "-", "+", "//", "/", <"ancestor-or-self" "::">, <"ancestor" "::">, <"attribute" "::">, <"child" "::">, <"descendant-or-self" "::">, <"descendant" "::">, <"following-sibling" "::">, <"following" "::">, <"parent" "::">, <"preceding-sibling" "::">, <"preceding" "::">, <"self" "::">, "@"
DEFAULT

The DECLAREORDERING State

Special state to recognize declare ordering specific keywords.

Pattern Transition To State
"ordered", "unordered"
DEFAULT

The OPERATOR State

This state is for patterns that are defined for operators.

Pattern Transition To State
"{"
pushState(OPERATOR)
DEFAULT
";", "then", "else", "external", "and", "at", ":=", ",", "div", "=", "except", "eq", "ge", "gt", "le", "lt", "ne", ">=", ">>", ">", "idiv", "intersect", "in", "is", "[", "<=", "<<", "<", "-", "mod", "*", "!=", <"order" "by">, <"stable" "order" "by">, "or", "+", "return", "satisfies", "//", "/", "to", "union", "|", "where", ("preserve" | "strip")
DEFAULT
<"castable" "as">, <"cast" "as">
SINGLETYPE
<"instance" "of">, <"treat" "as">, "case", "as", <")" "as">
ITEMTYPE
"}"
popState()
"$", <"for" "$">, <"let" "$">
VARNAME
"(:"
pushState()
EXPR_COMMENT
")", "?", <"empty" "greatest">, <"empty" "least">, "ascending", "descending", "default", "]"
OPERATOR
"collation"
URITOOPERATOR
StringLiteral, NotOperatorKeyword
(maintain state)

The XQUERYVERSION State

This state is for recognition of XQuery version specific keywords.

Pattern Transition To State
";"
DEFAULT
StringLiteral, "encoding"
XQUERYVERSION

The NAMESPACEDECL State

This state occurs inside of a namespace declaration, and is needed to recognize a NCName that is to be used as the prefix, as opposed to allowing a QName to occur. (Otherwise, the difference between NCName and QName are ambiguous.)

Pattern Transition To State
URILiteral, ",", <"at" URILiteral>
NAMESPACEDECL
"(:"
pushState()
EXPR_COMMENT
"=", NCName
NAMESPACEDECL
";"
DEFAULT

The URITOOPERATOR State

This state is to recognize a URILiteral that transits to the OPERATOR state.

Pattern Transition To State
URILiteral
OPERATOR

The NAMESPACEKEYWORD State

This state occurs at places where the keyword "namespace" is expected, which would otherwise be ambiguous compared to a QName. QNames can not occur in this state.

Pattern Transition To State
URILiteral
NAMESPACEDECL
"inherit", "no-inherit"
DEFAULT
"namespace"
NAMESPACEDECL
"(:"
pushState()
EXPR_COMMENT
<"default" "element">, "preserve", "no-preserve", ","
NAMESPACEKEYWORD

The XMLSPACE_DECL State

This state occurs at places where the keywords "preserve" and "strip" is expected to support "declare xmlspace". QNames can not occur in this state.

Pattern Transition To State
"preserve", "strip"
DEFAULT
"(:"
pushState()
EXPR_COMMENT

The SINGLETYPE State

This state distinguishes tokens that can occur only inside the SingleType production.

Pattern Transition To State
QName
OPERATOR
"(:"
pushState()
EXPR_COMMENT

The ITEMTYPE State

This state distinguishes tokens that can occur only inside the ItemType production.

Pattern Transition To State
"$"
VARNAME
<"void" "(" ")">, QName
OPERATOR
"(:"
pushState()
EXPR_COMMENT
<"element" "(">, <"attribute" "(">, <"schema-element" "(">, <"schema-attribute" "(">, <"comment" "(">, <"text" "(">, <"node" "(">, <"document-node" "(">
pushState(OCCURRENCEINDICATOR)
KINDTEST
<"processing-instruction" "(">
pushState(OCCURRENCEINDICATOR)
KINDTESTFORPI
QName, <"item" "(" ")">
OCCURRENCEINDICATOR
"(#"
PRAGMA
";", "then", "else"
DEFAULT
<"at" URILiteral>
NAMESPACEDECL
"external", "and", "at", ":=", ",", "div", "=", "except", "eq", "ge", "gt", "le", "lt", "ne", ">=", ">>", ">", "idiv", "intersect", "in", "is", "[", "(", "<=", "<<", "<", "-", "mod", "!=", <"order" "by">, <"stable" "order" "by">, "or", "return", "satisfies", "to", "union", "|", "where"
DEFAULT
<"castable" "as">, <"cast" "as">
SINGLETYPE
<"instance" "of">, <"treat" "as">, "case", "as", <")" "as">
ITEMTYPE

The KINDTEST State

This state is for the psuedo-parameters for the KindTest productions.

Pattern Transition To State
"{"
pushState(OPERATOR)
DEFAULT
")"
popState()
"*", QName
CLOSEKINDTEST
<"element" "(">, <"schema-element" "(">
pushState(KINDTEST)
KINDTEST
"(:"
pushState()
EXPR_COMMENT

The KINDTESTFORPI State

This state is similar to KINDTEST, but recognizes NCNames instead of QNames.

Pattern Transition To State
")"
popState()
"(:"
pushState()
EXPR_COMMENT
NCName, StringLiteral
KINDTESTFORPI

The CLOSEKINDTEST State

This state is expecting to close a KINDTEST sequence.

Pattern Transition To State
")"
popState()
","
KINDTEST
"{"
pushState(OPERATOR)
DEFAULT
"?"
CLOSEKINDTEST
"(:"
pushState()
EXPR_COMMENT

The OCCURRENCEINDICATOR State

This special state is needed to distinguish occurrence indicators that appear in the SequenceType production. For instance, compare "foo instance of baz*" to "baz*foo". In the first case, the "*" is interpreted as an occurrence indicator, and in the second case, it must be interpreted as a multiplication operator. But, when in the OCCURRENCEINDICATOR state, if anything else other than "?", "*", and "+", those symbols must be interpreted in the OPERATOR state. For instance, this would occur with the expression "foo instance of baz and $x", with the operator "and". This backing up of the lexical characters in order to reset the state, is symbolized by the notation "input_stream.backup(1)". NotOccurrenceIndicator is a special symbol for any character that is not an occurrence indicator.

Pattern Transition To State
[NotOccurrenceIndicator]
input_stream.backup(1)
OPERATOR
"?", "*", "+"
OPERATOR
"(:"
pushState()
EXPR_COMMENT

The OPTION State

This state is entered in the prolog for an option declaration, and recognizes a QName that transits to a DEFAULT state rather than a OPERATOR state.

Pattern Transition To State
QName
DEFAULT

The PRAGMA State

This state is entered in a a pragma expression, and recognizes a QName that transits to a PRAGMACONTENTS state rather than a OPERATOR state.

Pattern Transition To State
QName
PRAGMACONTENTS

The PRAGMACONTENTS State

This state recognizes characters in pragma content, and transits out of this state when a '#)' pattern is recognized.

Pattern Transition To State
"#)"
OPERATOR
S, Char
(maintain state)

The VARNAME State

This state differentiates variable names from qualified names. This allows only the pattern of a QName to be recognized when otherwise ambiguities could occur.

Pattern Transition To State
VarName
OPERATOR
"(:"
pushState()
EXPR_COMMENT

The START_TAG State

This state allows attributes in the native XML syntax, and marks the beginning of an element construction. Element constructors also push the current state, popping it at the conclusion of an end tag. In the START_TAG state, the string ">" is recognized as a token which is associated with the transition to the original state.

Pattern Transition To State
"/>"
popState()
">"
ELEMENT_CONTENT
'"'
QUOT_ATTRIBUTE_CONTENT
"'"
APOS_ATTRIBUTE_CONTENT
"="
START_TAG
S, QName
(maintain state)

The ELEMENT_CONTENT State

This state allows XML-like content, without these characters being misinterpreted as expressions. The character "{" marks a transition to the DEFAULT state, i.e. the start of an embedded expression, and the "}" character pops back to the ELEMENT_CONTENT state. To allow curly braces to be used as character content, a double left or right curly brace is interpreted as a single curly brace character. The string "</" is interpreted as the beginning of an end tag, which is associated with a transition to the END_TAG state.

Pattern Transition To State
"</"
END_TAG
"{"
pushState()
DEFAULT
"<!--"
pushState()
XML_COMMENT
"<?"
pushState()
PROCESSING_INSTRUCTION
"<![CDATA["
pushState()
CDATA_SECTION
"<"
pushState()
START_TAG
ElementContentChar
ELEMENT_CONTENT
PredefinedEntityRef, CharRef, "{{", "}}"
(maintain state)

The END_TAG State

When the end tag is terminated, the state is popped to the state that was pushed at the start of the corresponding start tag.

Pattern Transition To State
">"
popState()
S, QName
(maintain state)

The XML_COMMENT State

The "<--" token marks the beginning of an XML Comment, and the "-->" token marks the end. This allows no special interpretation of other characters in this state.

Pattern Transition To State
"-->"
popState()
(Char - '-'), <'-' (Char - '-')>
XML_COMMENT

The EXPR_COMMENT State

The "(:" token marks the beginning of an expression Comment, and the ":)" token marks the end. This allows no special interpretation of other characters in this state.

Pattern Transition To State
":)"
popState()
"(:"
pushState()
EXPR_COMMENT
Char
(maintain state)

The PROCESSING_INSTRUCTION State

In this state, only patterns that are legal in a processing instruction name are recognized.

Pattern Transition To State
S
PROCESSING_INSTRUCTION_CONTENT
"?>"
popState()
PITarget
PROCESSING_INSTRUCTION

The PROCESSING_INSTRUCTION_CONTENT State

In this state, only characters are that are legal in processing instruction content are recognized.

Pattern Transition To State
"?>"
popState()
Char
PROCESSING_INSTRUCTION_CONTENT

The CDATA_SECTION State

In this state, only lexemes that are legal in a CDATA section are recognized.

Pattern Transition To State
"]]>"
popState()
Char
CDATA_SECTION

The QUOT_ATTRIBUTE_CONTENT State

This state allows content legal for attributes. The character "{" marks a transition to the DEFAULT state, i.e. the start of an embedded expression, and the "}" character pops back to the original state. To allow curly braces to be used as character content, a double left or right curly brace is interpreted as a single curly brace character. This state is the same as APOS_ATTRIBUTE_CONTENT, except that apostrophes are allowed without escaping, and an unescaped quote marks the end of the state.

Pattern Transition To State
'"'
START_TAG
"{"
pushState()
DEFAULT
EscapeQuot, QuotAttrContentChar
QUOT_ATTRIBUTE_CONTENT
PredefinedEntityRef, CharRef, "{{", "}}"
(maintain state)

The APOS_ATTRIBUTE_CONTENT State

This state is the same as QUOT_ATTRIBUTE_CONTENT, except that quotes are allowed, and an unescaped apostrophe marks the end of the state.

Pattern Transition To State
"'"
START_TAG
"{"
pushState()
DEFAULT
EscapeApos, AposAttrContentChar
APOS_ATTRIBUTE_CONTENT
PredefinedEntityRef, CharRef, "{{", "}}"
(maintain state)

2.1.2 XPath Lexical States

The DEFAULT State

This state is for patterns that occur at the beginning of an expression or subexpression.

Pattern Transition To State
DecimalLiteral, "..", ".", DoubleLiteral, IntegerLiteral, NotNumber, <NCName ":" "*">, QName, ")", <"*" ":" NCName>, "*", StringLiteral
OPERATOR
"$", <"for" "$">, <"some" "$">, <"every" "$">
VARNAME
<"element" "(">, <"attribute" "(">, <"schema-element" "(">, <"schema-attribute" "(">, <"comment" "(">, <"text" "(">, <"node" "(">, <"document-node" "(">
pushState(OPERATOR)
KINDTEST
<"processing-instruction" "(">
pushState(OPERATOR)
KINDTESTFORPI
"(:"
pushState()
EXPR_COMMENT
",", "(", <QName "(">, <"if" "(">, "-", "+", "//", "/", <"ancestor-or-self" "::">, <"ancestor" "::">, <"attribute" "::">, <"child" "::">, <"descendant-or-self" "::">, <"descendant" "::">, <"following-sibling" "::">, <"following" "::">, <"namespace" "::">, <"parent" "::">, <"preceding-sibling" "::">, <"preceding" "::">, <"self" "::">, "@"
DEFAULT

The OPERATOR State

This state is for patterns that are defined for operators.

Pattern Transition To State
"then", "else", "and", ",", "div", "=", "except", "eq", "ge", "gt", "le", "lt", "ne", ">=", ">>", ">", "idiv", "intersect", "in", "is", "[", "<=", "<<", "<", "-", "mod", "*", "!=", "or", "+", "return", "satisfies", "//", "/", "to", "union", "|"
DEFAULT
<"castable" "as">, <"cast" "as">
SINGLETYPE
<"instance" "of">, <"treat" "as">
ITEMTYPE
"$", <"for" "$">
VARNAME
"(:"
pushState()
EXPR_COMMENT
")", "?", "]"
OPERATOR
StringLiteral, NotOperatorKeyword
(maintain state)

The SINGLETYPE State

This state distinguishes tokens that can occur only inside the SingleType production.

Pattern Transition To State
QName
OPERATOR
"(:"
pushState()
EXPR_COMMENT

The ITEMTYPE State

This state distinguishes tokens that can occur only inside the ItemType production.

Pattern Transition To State
<"void" "(" ")">, QName
OPERATOR
"(:"
pushState()
EXPR_COMMENT
<"element" "(">, <"attribute" "(">, <"schema-element" "(">, <"schema-attribute" "(">, <"comment" "(">, <"text" "(">, <"node" "(">, <"document-node" "(">
pushState(OCCURRENCEINDICATOR)
KINDTEST
<"processing-instruction" "(">
pushState(OCCURRENCEINDICATOR)
KINDTESTFORPI
QName, <"item" "(" ")">
OCCURRENCEINDICATOR
"then", "else"
DEFAULT
"and", ",", "div", "=", "except", "eq", "ge", "gt", "le", "lt", "ne", ">=", ">>", ">", "idiv", "intersect", "in", "is", "[", "(", "<=", "<<", "<", "-", "mod", "!=", "or", "return", "satisfies", "to", "union", "|"
DEFAULT
<"castable" "as">, <"cast" "as">
SINGLETYPE
<"instance" "of">, <"treat" "as">
ITEMTYPE

The KINDTEST State

This state is for the psuedo-parameters for the KindTest productions.

Pattern Transition To State
")"
popState()
"*", QName
CLOSEKINDTEST
<"element" "(">, <"schema-element" "(">
pushState(KINDTEST)
KINDTEST
"(:"
pushState()
EXPR_COMMENT

The KINDTESTFORPI State

This state is similar to KINDTEST, but recognizes NCNames instead of QNames.

Pattern Transition To State
")"
popState()
"(:"
pushState()
EXPR_COMMENT
NCName, StringLiteral
KINDTESTFORPI

The CLOSEKINDTEST State

This state is expecting to close a KINDTEST sequence.

Pattern Transition To State
")"
popState()
","
KINDTEST
"?"
CLOSEKINDTEST
"(:"
pushState()
EXPR_COMMENT

The OCCURRENCEINDICATOR State

This special state is needed to distinguish occurrence indicators that appear in the SequenceType production. For instance, compare "foo instance of baz*" to "baz*foo". In the first case, the "*" is interpreted as an occurrence indicator, and in the second case, it must be interpreted as a multiplication operator. But, when in the OCCURRENCEINDICATOR state, if anything else other than "?", "*", and "+", those symbols must be interpreted in the OPERATOR state. For instance, this would occur with the expression "foo instance of baz and $x", with the operator "and". This backing up of the lexical characters in order to reset the state, is symbolized by the notation "input_stream.backup(1)". NotOccurrenceIndicator is a special symbol for any character that is not an occurrence indicator.

Pattern Transition To State
[NotOccurrenceIndicator]
input_stream.backup(1)
OPERATOR
"?", "*", "+"
OPERATOR
"(:"
pushState()
EXPR_COMMENT

The VARNAME State

This state differentiates variable names from qualified names. This allows only the pattern of a QName to be recognized when otherwise ambiguities could occur.

Pattern Transition To State
VarName
OPERATOR
"(:"
pushState()
EXPR_COMMENT

The EXPR_COMMENT State

The "(:" token marks the beginning of an expression Comment, and the ":)" token marks the end. This allows no special interpretation of other characters in this state.

Pattern Transition To State
":)"
popState()
"(:"
pushState()
EXPR_COMMENT
Char
(maintain state)

A XPath/XQuery Grammar

A.1 EBNF

For convenience, the basic EBNF of the XPath and XQuery grammars are reproduced here. For further details please consult the respective specifications.

The following grammars use the same simple Extended Backus-Naur Form (EBNF) notation as [XML 1.0] with the following minor differences. The notation "< ... >" is used to indicate a grouping of terminals that together may help disambiguate the individual symbols. To help readability, this "< ... >" notation is absent in the EBNF in the main body of this document. For further details concerning the EBNF notation used, see Section A.1 EBNFXQ in [XQuery 1.0: An XML Query Language] or Section A.1 EBNFXP in [XML Path Language (XPath) 2.0].

Comments on grammar productions are between '/*' and '*/' symbols. A 'gn:' prefix means a 'Grammar Note', and is meant as a clarification for parsing rules, and are explained in Section A.1.1 Grammar NotesXQ in [XQuery 1.0: An XML Query Language] or Section A.1.1 Grammar NotesXP in [XML Path Language (XPath) 2.0]. A 'ws:' prefix explains the whitespace rules for the production, the details of which are explained in Section A.2.2 Whitespace RulesXQ in [XQuery 1.0: An XML Query Language] or Section A.2.2 Whitespace RulesXP in [XML Path Language (XPath) 2.0].

A.1.1 XPath EBNF

[1]    XPath    ::=    Expr
[2]    Expr    ::=    ExprSingle ("," ExprSingle)*
[3]    ExprSingle    ::=    ForExpr
| QuantifiedExpr
| IfExpr
| OrExpr
[4]    ForExpr    ::=    SimpleForClause "return" ExprSingle
[5]    SimpleForClause    ::=    <"for" "$"> VarName "in" ExprSingle ("," "$" VarName "in" ExprSingle)*
[6]    QuantifiedExpr    ::=    (<"some" "$"> | <"every" "$">) VarName "in" ExprSingle ("," "$" VarName "in" ExprSingle)* "satisfies" ExprSingle
[7]    IfExpr    ::=    <"if" "("> Expr ")" "then" ExprSingle "else" ExprSingle
[8]    OrExpr    ::=    AndExpr ( "or" AndExpr )*
[9]    AndExpr    ::=    ComparisonExpr ( "and" ComparisonExpr )*
[10]    ComparisonExpr    ::=    RangeExpr ( (ValueComp
| GeneralComp
| NodeComp) RangeExpr )?
[11]    RangeExpr    ::=    AdditiveExpr ( "to" AdditiveExpr )?
[12]    AdditiveExpr    ::=    MultiplicativeExpr ( ("+" | "-") MultiplicativeExpr )*
[13]    MultiplicativeExpr    ::=    UnionExpr ( ("*" | "div" | "idiv" | "mod") UnionExpr )*
[14]    UnionExpr    ::=    IntersectExceptExpr ( ("union" | "|") IntersectExceptExpr )*
[15]    IntersectExceptExpr    ::=    InstanceofExpr ( ("intersect" | "except") InstanceofExpr )*
[16]    InstanceofExpr    ::=    TreatExpr ( <"instance" "of"> SequenceType )?
[17]    TreatExpr    ::=    CastableExpr ( <"treat" "as"> SequenceType )?
[18]    CastableExpr    ::=    CastExpr ( <"castable" "as"> SingleType )?
[19]    CastExpr    ::=    UnaryExpr ( <"cast" "as"> SingleType )?
[20]    UnaryExpr    ::=    ("-" | "+")* ValueExpr
[21]    ValueExpr    ::=    PathExpr
[22]    GeneralComp    ::=    "=" | "!=" | "<" | "<=" | ">" | ">=" /* gn: ltXP */
[23]    ValueComp    ::=    "eq" | "ne" | "lt" | "le" | "gt" | "ge"
[24]    NodeComp    ::=    "is" | "<<" | ">>"
[25]    PathExpr    ::=    ("/" RelativePathExpr?)
| ("//" RelativePathExpr)
| RelativePathExpr
/* gn: leading-lone-slashXP */
[26]    RelativePathExpr    ::=    StepExpr (("/" | "//") StepExpr)*
[27]    StepExpr    ::=    AxisStep | FilterExpr
[28]    AxisStep    ::=    (ForwardStep | ReverseStep) PredicateList
[29]    ForwardStep    ::=    (ForwardAxis NodeTest) | AbbrevForwardStep
[30]    ForwardAxis    ::=    <"child" "::">
| <"descendant" "::">
| <"attribute" "::">
| <"self" "::">
| <"descendant-or-self" "::">
| <"following-sibling" "::">
| <"following" "::">
| <"namespace" "::">
[31]    AbbrevForwardStep    ::=    "@"? NodeTest
[32]    ReverseStep    ::=    (ReverseAxis NodeTest) | AbbrevReverseStep
[33]    ReverseAxis    ::=    <"parent" "::">
| <"ancestor" "::">
| <"preceding-sibling" "::">
| <"preceding" "::">
| <"ancestor-or-self" "::">
[34]    AbbrevReverseStep    ::=    ".."
[35]    NodeTest    ::=    KindTest | NameTest
[36]    NameTest    ::=    QName | Wildcard
[37]    Wildcard    ::=    "*"
| <NCName ":" "*">
| <"*" ":" NCName>
/* ws: explicitXP */
[38]    FilterExpr    ::=    PrimaryExpr PredicateList
[39]    PredicateList    ::=    Predicate*
[40]    Predicate    ::=    "[" Expr "]"
[41]    PrimaryExpr    ::=    Literal | VarRef | ParenthesizedExpr | ContextItemExpr | FunctionCall
[42]    Literal    ::=    NumericLiteral | StringLiteral
[43]    NumericLiteral    ::=    IntegerLiteral | DecimalLiteral | DoubleLiteral
[44]    VarRef    ::=    "$" VarName
[45]    ParenthesizedExpr    ::=    "(" Expr? ")"
[46]    ContextItemExpr    ::=    "."
[47]    FunctionCall    ::=    <QName "("> (ExprSingle ("," ExprSingle)*)? ")" /* gn: parensXP */
/* gn: reserved-function-namesXP */
[48]    SingleType    ::=    AtomicType "?"?
[49]    SequenceType    ::=    (ItemType OccurrenceIndicator?)
| <"void" "(" ")">
[50]    OccurrenceIndicator    ::=    "?" | "*" | "+" /* gn: occurrence-indicatorsXP */
[51]    ItemType    ::=    AtomicType | KindTest | <"item" "(" ")">
[52]    AtomicType    ::=    QName
[53]    KindTest    ::=    DocumentTest
| ElementTest
| AttributeTest
| SchemaElementTest
| SchemaAttributeTest
| PITest
| CommentTest
| TextTest
| AnyKindTest
[54]    AnyKindTest    ::=    <"node" "("> ")"
[55]    DocumentTest    ::=    <"document-node" "("> (ElementTest | SchemaElementTest)? ")"
[56]    TextTest    ::=    <"text" "("> ")"
[57]    CommentTest    ::=    <"comment" "("> ")"
[58]    PITest    ::=    <"processing-instruction" "("> (NCName | StringLiteral)? ")"
[59]    AttributeTest    ::=    <"attribute" "("> (AttribNameOrWildcard ("," TypeName)?)? ")"
[60]    AttribNameOrWildcard    ::=    AttributeName | "*"
[61]    SchemaAttributeTest    ::=    <"schema-attribute" "("> AttributeDeclaration ")"
[62]    AttributeDeclaration    ::=    AttributeName
[63]    ElementTest    ::=    <"element" "("> (ElementNameOrWildcard ("," TypeName "?"?)?)? ")"
[64]    ElementNameOrWildcard    ::=    ElementName | "*"
[65]    SchemaElementTest    ::=    <"schema-element" "("> ElementDeclaration ")"
[66]    ElementDeclaration    ::=    ElementName
[67]    AttributeName    ::=    QName
[68]    ElementName    ::=    QName
[69]    TypeName    ::=    QName
[70]    IntegerLiteral    ::=    Digits
[71]    DecimalLiteral    ::=    ("." Digits) | (Digits "." [0-9]*) /* ws: explicitXP */
[72]    DoubleLiteral    ::=    (("." Digits) | (Digits ("." [0-9]*)?)) [eE] [+-]? Digits /* ws: explicitXP */
[73]    StringLiteral    ::=    ('"' (('"' '"') | [^"])* '"') | ("'" (("'" "'") | [^'])* "'") /* ws: explicitXP */
[74]    VarName    ::=    QName
[75]    Digits    ::=    [0-9]+
[76]    Comment    ::=    "(:" (CommentContents | Comment)* ":)" /* ws: explicitXP */
/* gn: commentsXP */
[77]    CommentContents    ::=    (Char+ - (Char* ':)' Char*))
[78]    QName    ::=    [http://www.w3.org/TR/REC-xml-names/#NT-QName]Names /* gn: xml-versionXP */
[79]    NCName    ::=    [http://www.w3.org/TR/REC-xml-names/#NT-NCName]Names /* gn: xml-versionXP */
[80]    Char    ::=    [http://www.w3.org/TR/REC-xml#NT-Char]XML /* gn: xml-versionXP */

A.1.2 XQuery EBNF

[1]    Module    ::=    VersionDecl? (MainModule | LibraryModule)
[2]    VersionDecl    ::=    <"xquery" "version"> StringLiteral ("encoding" StringLiteral)? Separator
[3]    MainModule    ::=    Prolog QueryBody
[4]    LibraryModule    ::=    ModuleDecl Prolog
[5]    ModuleDecl    ::=    <"module" "namespace"> NCName "=" URILiteral Separator
[6]    Prolog    ::=    ((Setter | Import | NamespaceDecl | DefaultNamespaceDecl) Separator)* ((VarDecl | FunctionDecl | OptionDecl) Separator)*
[7]    Setter    ::=    BoundarySpaceDecl | DefaultCollationDecl | BaseURIDecl | ConstructionDecl | OrderingModeDecl | EmptyOrderDecl | CopyNamespacesDecl
[8]    Import    ::=    SchemaImport | ModuleImport
[9]    Separator    ::=    ";"
[10]    NamespaceDecl    ::=    <"declare" "namespace"> NCName "=" URILiteral
[11]    BoundarySpaceDecl    ::=    <"declare" "boundary-space"> ("preserve" | "strip")
[12]    DefaultNamespaceDecl    ::=    (<"declare" "default" "element"> | <"declare" "default" "function">) "namespace" URILiteral
[13]    OptionDecl    ::=    <"declare" "option"> QName StringLiteral
[14]    OrderingModeDecl    ::=    <"declare" "ordering"> ("ordered" | "unordered")
[15]    EmptyOrderDecl    ::=    <"declare" "default" "order"> (<"empty" "greatest"> | <"empty" "least">)
[16]    CopyNamespacesDecl    ::=    <"declare" "copy-namespaces"> PreserveMode "," InheritMode
[17]    PreserveMode    ::=    "preserve" | "no-preserve"
[18]    InheritMode    ::=    "inherit" | "no-inherit"
[19]    DefaultCollationDecl    ::=    <"declare" "default" "collation"> URILiteral
[20]    BaseURIDecl    ::=    <"declare" "base-uri"> URILiteral
[21]    SchemaImport    ::=    <"import" "schema"> SchemaPrefix? URILiteral (<"at" URILiteral> ("," URILiteral)*)?
[22]    SchemaPrefix    ::=    ("namespace" NCName "=") | (<"default" "element"> "namespace")
[23]    ModuleImport    ::=    <"import" "module"> ("namespace" NCName "=")? URILiteral (<"at" URILiteral> ("," URILiteral)*)?
[24]    VarDecl    ::=    <"declare" "variable" "$"> VarName TypeDeclaration? ((":=" ExprSingle) | "external")
[25]    ConstructionDecl    ::=    <"declare" "construction"> ("preserve" | "strip")
[26]    FunctionDecl    ::=    <"declare" "function"> <QName "("> ParamList? (")" | (<")" "as"> SequenceType)) (EnclosedExpr | "external")
[27]    ParamList    ::=    Param ("," Param)*
[28]    Param    ::=    "$" VarName TypeDeclaration?
[29]    EnclosedExpr    ::=    "{" Expr "}"
[30]    QueryBody    ::=    Expr
[31]    Expr    ::=    ExprSingle ("," ExprSingle)*
[32]    ExprSingle    ::=    FLWORExpr
| QuantifiedExpr
| TypeswitchExpr
| IfExpr
| OrExpr
[33]    FLWORExpr    ::=    (ForClause | LetClause)+ WhereClause? OrderByClause? "return" ExprSingle
[34]    ForClause    ::=    <"for" "$"> VarName TypeDeclaration? PositionalVar? "in" ExprSingle ("," "$" VarName TypeDeclaration? PositionalVar? "in" ExprSingle)*
[35]    PositionalVar    ::=    "at" "$" VarName
[36]    LetClause    ::=    <"let" "$"> VarName TypeDeclaration? ":=" ExprSingle ("," "$" VarName TypeDeclaration? ":=" ExprSingle)*
[37]    WhereClause    ::=    "where" ExprSingle
[38]    OrderByClause    ::=    (<"order" "by"> | <"stable" "order" "by">) OrderSpecList
[39]    OrderSpecList    ::=    OrderSpec ("," OrderSpec)*
[40]    OrderSpec    ::=    ExprSingle OrderModifier
[41]    OrderModifier    ::=    ("ascending" | "descending")? (<"empty" "greatest"> | <"empty" "least">)? ("collation" URILiteral)?
[42]    QuantifiedExpr    ::=    (<"some" "$"> | <"every" "$">) VarName TypeDeclaration? "in" ExprSingle ("," "$" VarName TypeDeclaration? "in" ExprSingle)* "satisfies" ExprSingle
[43]    TypeswitchExpr    ::=    <"typeswitch" "("> Expr ")" CaseClause+ "default" ("$" VarName)? "return" ExprSingle
[44]    CaseClause    ::=    "case" ("$" VarName "as")? SequenceType "return" ExprSingle
[45]    IfExpr    ::=    <"if" "("> Expr ")" "then" ExprSingle "else" ExprSingle
[46]    OrExpr    ::=    AndExpr ( "or" AndExpr )*
[47]    AndExpr    ::=    ComparisonExpr ( "and" ComparisonExpr )*
[48]    ComparisonExpr    ::=    RangeExpr ( (ValueComp
| GeneralComp
| NodeComp) RangeExpr )?
[49]    RangeExpr    ::=    AdditiveExpr ( "to" AdditiveExpr )?
[50]    AdditiveExpr    ::=    MultiplicativeExpr ( ("+" | "-") MultiplicativeExpr )*
[51]    MultiplicativeExpr    ::=    UnionExpr ( ("*" | "div" | "idiv" | "mod") UnionExpr )*
[52]    UnionExpr    ::=    IntersectExceptExpr ( ("union" | "|") IntersectExceptExpr )*
[53]    IntersectExceptExpr    ::=    InstanceofExpr ( ("intersect" | "except") InstanceofExpr )*
[54]    InstanceofExpr    ::=    TreatExpr ( <"instance" "of"> SequenceType )?
[55]    TreatExpr    ::=    CastableExpr ( <"treat" "as"> SequenceType )?
[56]    CastableExpr    ::=    CastExpr ( <"castable" "as"> SingleType )?
[57]    CastExpr    ::=    UnaryExpr ( <"cast" "as"> SingleType )?
[58]    UnaryExpr    ::=    ("-" | "+")* ValueExpr
[59]    ValueExpr    ::=    ValidateExpr | PathExpr | ExtensionExpr
[60]    GeneralComp    ::=    "=" | "!=" | "<" | "<=" | ">" | ">=" /* gn: ltXQ */
[61]    ValueComp    ::=    "eq" | "ne" | "lt" | "le" | "gt" | "ge"
[62]    NodeComp    ::=    "is" | "<<" | ">>"
[63]    ValidateExpr    ::=    (<"validate" "{"> | (<"validate" ValidationMode> "{")) Expr "}"
[64]    ExtensionExpr    ::=    Pragma+ "{" Expr? "}"
[65]    Pragma    ::=    "(#" S? QName PragmaContents "#)" /* ws: explicitXQ */
[66]    PragmaContents    ::=    (Char* - (Char* '#)' Char*))
[67]    PathExpr    ::=    ("/" RelativePathExpr?)
| ("//" RelativePathExpr)
| RelativePathExpr
/* gn: leading-lone-slashXQ */
[68]    RelativePathExpr    ::=    StepExpr (("/" | "//") StepExpr)*
[69]    StepExpr    ::=    AxisStep | FilterExpr
[70]    AxisStep    ::=    (ForwardStep | ReverseStep) PredicateList
[71]    ForwardStep    ::=    (ForwardAxis NodeTest) | AbbrevForwardStep
[72]    ForwardAxis    ::=    <"child" "::">
| <"descendant" "::">
| <"attribute" "::">
| <"self" "::">
| <"descendant-or-self" "::">
| <"following-sibling" "::">
| <"following" "::">
[73]    AbbrevForwardStep    ::=    "@"? NodeTest
[74]    ReverseStep    ::=    (ReverseAxis NodeTest) | AbbrevReverseStep
[75]    ReverseAxis    ::=    <"parent" "::">
| <"ancestor" "::">
| <"preceding-sibling" "::">
| <"preceding" "::">
| <"ancestor-or-self" "::">
[76]    AbbrevReverseStep    ::=    ".."
[77]    NodeTest    ::=    KindTest | NameTest
[78]    NameTest    ::=    QName | Wildcard
[79]    Wildcard    ::=    "*"
| <NCName ":" "*">
| <"*" ":" NCName>
/* ws: explicitXQ */
[80]    FilterExpr    ::=    PrimaryExpr PredicateList
[81]    PredicateList    ::=    Predicate*
[82]    Predicate    ::=    "[" Expr "]"
[83]    PrimaryExpr    ::=    Literal | VarRef | ParenthesizedExpr | ContextItemExpr | FunctionCall | Constructor | OrderedExpr | UnorderedExpr
[84]    Literal    ::=    NumericLiteral | StringLiteral
[85]    NumericLiteral    ::=    IntegerLiteral | DecimalLiteral | DoubleLiteral
[86]    VarRef    ::=    "$" VarName
[87]    ParenthesizedExpr    ::=    "(" Expr? ")"
[88]    ContextItemExpr    ::=    "."
[89]    OrderedExpr    ::=    <"ordered" "{"> Expr "}"
[90]    UnorderedExpr    ::=    <"unordered" "{"> Expr "}"
[91]    FunctionCall    ::=    <QName "("> (ExprSingle ("," ExprSingle)*)? ")" /* gn: parensXQ */
/* gn: reserved-function-namesXQ */
[92]    Constructor    ::=    DirectConstructor
| ComputedConstructor
[93]    DirectConstructor    ::=    DirElemConstructor
| DirCommentConstructor
| DirPIConstructor
[94]    DirElemConstructor    ::=    "<" QName DirAttributeList ("/>" | (">" DirElemContent* "</" QName S? ">")) /* ws: explicitXQ */
/* gn: ltXQ */
[95]    DirAttributeList    ::=    (S (QName S? "=" S? DirAttributeValue)?)* /* ws: explicitXQ */
[96]    DirAttributeValue    ::=    ('"' (EscapeQuot | QuotAttrValueContent)* '"')
| ("'" (EscapeApos | AposAttrValueContent)* "'")
/* ws: explicitXQ */
[97]    QuotAttrValueContent    ::=    QuotAttrContentChar
| CommonContent
[98]    AposAttrValueContent    ::=    AposAttrContentChar
| CommonContent
[99]    DirElemContent    ::=    DirectConstructor
| ElementContentChar
| CDataSection
| CommonContent
[100]    CommonContent    ::=    PredefinedEntityRef | CharRef | "{{" | "}}" | EnclosedExpr
[101]    DirCommentConstructor    ::=    "<!--" DirCommentContents "-->" /* ws: explicitXQ */
[102]    DirCommentContents    ::=    ((Char - '-') | <'-' (Char - '-')>)* /* ws: explicitXQ */
[103]    DirPIConstructor    ::=    "<?" PITarget (S DirPIContents)? "?>" /* ws: explicitXQ */
[104]    DirPIContents    ::=    (Char* - (Char* '?>' Char*)) /* ws: explicitXQ */
[105]    CDataSection    ::=    "<![CDATA[" CDataSectionContents "]]>" /* ws: explicitXQ */
[106]    CDataSectionContents    ::=    (Char* - (Char* ']]>' Char*)) /* ws: explicitXQ */
[107]    ComputedConstructor    ::=    CompDocConstructor
| CompElemConstructor
| CompAttrConstructor
| CompTextConstructor
| CompCommentConstructor
| CompPIConstructor
[108]    CompDocConstructor    ::=    <"document" "{"> Expr "}"
[109]    CompElemConstructor    ::=    (<"element" QName "{"> | (<"element" "{"> Expr "}" "{")) ContentExpr? "}"
[110]    ContentExpr    ::=    Expr
[111]    CompAttrConstructor    ::=    (<"attribute" QName "{"> | (<"attribute" "{"> Expr "}" "{")) Expr? "}"
[112]    CompTextConstructor    ::=    <"text" "{"> Expr "}"
[113]    CompCommentConstructor    ::=    <"comment" "{"> Expr "}"
[114]    CompPIConstructor    ::=    (<"processing-instruction" NCName "{"> | (<"processing-instruction" "{"> Expr "}" "{")) Expr? "}"
[115]    SingleType    ::=    AtomicType "?"?
[116]    TypeDeclaration    ::=    "as" SequenceType
[117]    SequenceType    ::=    (ItemType OccurrenceIndicator?)
| <"void" "(" ")">
[118]    OccurrenceIndicator    ::=    "?" | "*" | "+" /* gn: occurrence-indicatorsXQ */
[119]    ItemType    ::=    AtomicType | KindTest | <"item" "(" ")">
[120]    AtomicType    ::=    QName
[121]    KindTest    ::=    DocumentTest
| ElementTest
| AttributeTest
| SchemaElementTest
| SchemaAttributeTest
| PITest
| CommentTest
| TextTest
| AnyKindTest
[122]    AnyKindTest    ::=    <"node" "("> ")"
[123]    DocumentTest    ::=    <"document-node" "("> (ElementTest | SchemaElementTest)? ")"
[124]    TextTest    ::=    <"text" "("> ")"
[125]    CommentTest    ::=    <"comment" "("> ")"
[126]    PITest    ::=    <"processing-instruction" "("> (NCName | StringLiteral)? ")"
[127]    AttributeTest    ::=    <"attribute" "("> (AttribNameOrWildcard ("," TypeName)?)? ")"
[128]    AttribNameOrWildcard    ::=    AttributeName | "*"
[129]    SchemaAttributeTest    ::=    <"schema-attribute" "("> AttributeDeclaration ")"
[130]    AttributeDeclaration    ::=    AttributeName
[131]    ElementTest    ::=    <"element" "("> (ElementNameOrWildcard ("," TypeName "?"?)?)? ")"
[132]    ElementNameOrWildcard    ::=    ElementName | "*"
[133]    SchemaElementTest    ::=    <"schema-element" "("> ElementDeclaration ")"
[134]    ElementDeclaration    ::=    ElementName
[135]    AttributeName    ::=    QName
[136]    ElementName    ::=    QName
[137]    TypeName    ::=    QName
[138]    IntegerLiteral    ::=    Digits
[139]    DecimalLiteral    ::=    ("." Digits) | (Digits "." [0-9]*) /* ws: explicitXQ */
[140]    DoubleLiteral    ::=    (("." Digits) | (Digits ("." [0-9]*)?)) [eE] [+-]? Digits /* ws: explicitXQ */
[141]    URILiteral    ::=    StringLiteral
[142]    StringLiteral    ::=    ('"' (PredefinedEntityRef | CharRef | ('"' '"') | [^"&])* '"') | ("'" (PredefinedEntityRef | CharRef | ("'" "'") | [^'&])* "'") /* ws: explicitXQ */
[143]    PITarget    ::=    [http://www.w3.org/TR/REC-xml#NT-PITarget]XML /* gn: xml-versionXQ */
[144]    VarName    ::=    QName
[145]    ValidationMode    ::=    "lax" | "strict"
[146]    Digits    ::=    [0-9]+
[147]    PredefinedEntityRef    ::=    "&" ("lt" | "gt" | "amp" | "quot" | "apos") ";" /* ws: explicitXQ */
[148]    CharRef    ::=    [http://www.w3.org/TR/REC-xml#NT-CharRef]XML /* gn: xml-versionXQ */
[149]    EscapeQuot    ::=    '""'
[150]    EscapeApos    ::=    "''"
[151]    ElementContentChar    ::=    Char - [{}<&]
[152]    QuotAttrContentChar    ::=    Char - ["{}<&]
[153]    AposAttrContentChar    ::=    Char - ['{}<&]
[154]    Comment    ::=    "(:" (CommentContents | Comment)* ":)" /* ws: explicitXQ */
/* gn: commentsXQ */
[155]    CommentContents    ::=    (Char+ - (Char* ':)' Char*))
[156]    QName    ::=    [http://www.w3.org/TR/REC-xml-names/#NT-QName]Names /* gn: xml-versionXQ */
[157]    NCName    ::=    [http://www.w3.org/TR/REC-xml-names/#NT-NCName]Names /* gn: xml-versionXQ */
[158]    S    ::=    [http://www.w3.org/TR/REC-xml#NT-S]XML /* gn: xml-versionXQ */
[159]    Char    ::=    [http://www.w3.org/TR/REC-xml#NT-Char]XML /* gn: xml-versionXQ */

B References

B.1 Main References

XML 1.0
World Wide Web Consortium. Extensible Markup Language (XML) 1.0. W3C Recommendation. See http://www.w3.org/TR/2000/REC-xml-20001006
XML Path Language (XPath) 2.0
XML Path Language (XPath) 2.0, Anders Berglund, Mary F. Fernández, Scott Boag, et. al., Editors. World Wide Web Consortium, 4 Apr 2005. This version is http://www.w3.org/TR/2005/WD-xpath20-20050404. The latest version is available at http://www.w3.org/TR/xpath20.
XQuery 1.0: An XML Query Language
XQuery 1.0: An XML Query Language, Scott Boag, Don Chamberlin, Mary F. Fernández, et. al., Editors. World Wide Web Consortium, 4 Apr 2005. This version is http://www.w3.org/TR/2005/WD-xquery-20050404/. The latest version is available at http://www.w3.org/TR/xquery.

C Glossary (Non-Normative)

fuzzy token scanner

Fuzzy token scanner. Using this strategy, a typical simple lexical scanner is implemented, but the meaning of the tokens are often left somewhat ambiguous. Thus, the word div might be assigned the symbol QNameOrKeyword. The words can be disambiguated during the context sensitive parse process. This approach would probably be a hand-constructed, Top-down parser.

LALR Parser

LALR Parser (Look Ahead Left to right, Rightmost derivation, hence LALR).

lexical action

A lexical action is an action that occurs as an side-effect of a pattern recognition.

lexical scanner

Lexical scanner, which assigns symbols known as tokens to units in the input, in essence recognizing the basic words and punctuation of the language. Typically, the lexer uses regular expressions as a specification for how to do this tokenization. The rules that govern the lexical structure of the language, sometimes known as microsyntax, should be simple, regular, and as context-free as possible.

lexical state

A lexical state is a condition where a defined set of patterns are recognized.

LL Parser

LL Parser (Left to right, Leftmost derivation, hence LL).

long token

Long tokens. Using this approach, declare namespace would be considered a single token.

parser

Parser, which arranges the tokens into a hierarchical structure. Typically, the parser uses a BNF or variant description (EBNF in the case of XPath/XQuery) as a specification for this parsing.

Scan-while-parse scanner

Scan-while-parse scanner. Using this strategy, the parser would evaluate the character stream into tokens, while at the same time branching to the respective hierarchical context. Thus the words can be discovered in context. Complex parsers are typically not built in this fashion, but, given the issue of context sensitivity, this may be a reasonable approach. This would probably be implemented as a hand-constructed recursive-descent parser.

short token

Short tokens. Using this approach, declare namespace would be considered two tokens, the exact symbols assigned based on lexical look-ahead (and look-behind).

state-driven scanner

State-driven scanner. This approach recognizes sets of words only in certain states, each word or punctuation possibly transitioning to a next state. In other words, it implements lexical recognition as transition tables for a push-down automata.

two-pass scanner

Two-pass scanner. This is the same as the Fuzzy token scanner, but the ambiguous words can be disambiguated by a second scan, using some rules (probably expressed as a context-free grammar) that reflect the basic structure of the EBNF (call this the three-pass approach). The three-pass approach could conceivably be implemented via a parser constructor program, with the addition of a custom disambiguator in between the lexer and parser. This method could be used for either Top-down or Bottom-up techniques.


End Notes

[1]

XQuery does not include the ForExpr production and <"namespace" "::"> axis.