A character is an atomic unit of text as specified by ISO/IEC 10646 [ISO10646] (see also [ISO10646-2000]). Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646, as long as these characters are legal XML characters as defined in the [XML] recommendation.
A lexical pattern is a rule that describes how a sequence of characters can match a grammar unit. A lexeme is the smallest meaningful unit in the grammar that has syntactic interpretation. A token is a symbol that matches lexemes, and is the output of the lexical analyzer. A token symbol is the symbolic name given to that token. A single token may be composed of one or more lexemes. If there is more than one lexeme, they may be separated by whitespace or punctuation. For instance, the token AxisDescendantOrSelf has two lexemes, "descendant-or-self" and "::".
| Pattern | Lexeme(s) | Token Names (for example) |
|---|---|---|
| "or" | "or" | Or |
| "=" | "=" | Equals |
| (Prefix ':')? LocalPart | "p:foo" | QName |
| <"descendant-or-self" "::"> | "descendant-or-self" | AxisDescendantOrSelf |
| "::" |
When patterns are simple string matches, the strings are embedded directly into the BNF. In other cases, token symbols are used when the pattern is a more complex regular expression (the major cases of these are NCName, QName, and Number and String literals). It is up to an implementation to decide on the exact tokenization strategy, which may be different depending on the parser construction. For instance, an implementation may decide that a token named For is composed of only "for", or may decide that it is composed of ("for" "("). In the first case the implementation may decide to use lexical lookahead to distinguish the "for" lexeme from a QName that has the lexeme "for". In the second case, the implementation may decide to combine the two lexemes into a single "long" token. In either case, the end grammatical result will be the same. Lexemes that must be described by lexical lookahead are delimited with the tokens that it must look ahead to, in order to be recoginized, by "<" and ">".
This grammar implies lexical states, which are lexical constraints on the tokenization process based on grammatical positioning. The exact structure of these states is left to the implementation, but the normative rules for calculating these states are given in the 1.1.2 Lexical Rules section.
When tokenizing, the longest possible match that is valid in the current lexical state is prefered .
For readability, Whitespace may be used in most expressions even though not explicitly allowed by the grammar: Whitespace may be freely added between lexemes, except a few cases where whitespace is needed to disambiguate the token. For instance, in XML, "-" is a valid character in an element or attribute name. When used as an operator after the characters of a name, it must be separated from the name, e.g. by using whitespace or parentheses.
All keywords are case sensitive.
Ed. Note: Since currently ExprComments are specified at the lexical level, I leave it as an editorial issue as to how to describe this, or whether to constrain where ExprComments are allowed so they can be described by the BNF. (I would be in favor of this. -sb)
Character Classes
The following basic tokens are defined in [XML].
Identifiers
The following identifier components are defined in [XMLNAMES].
String Literals and Numbers
| [147] | IntegerLiteral | ::= | Digits |
| [148] | DecimalLiteral | ::= | ("." Digits) | (Digits "." [0-9]*) |
| [149] | DoubleLiteral | ::= | (("." Digits) | (Digits ("." [0-9]*)?)) ("e" | "E") ("+" | "-")? Digits |
| [162] | StringLiteral | ::= | ('"' (('"' '"') | [^"])* '"') | ("'" (("'" "'") | [^'])* "'") |
Comments
Comments are lexical constructs only, and do not affect the processing of an expression. [Ed. Note: Need to expand on where, exactly, comments are allowed.]
| [59] | ExprComment | ::= | "{--" [^}]* "--}" |
Defined Tokens
The following is a list of defined tokens for the XPath grammar.
| [59] | S | ::= | WhitespaceChar+ |
| [105] | Nmstart | ::= | Letter | "_" |
| [106] | Nmchar | ::= | Letter | CombiningChar | Extender | Digit | "." | "-" | "_" |
| [145] | Digits | ::= | [0-9]+ |
| [162] | URLLiteral | ::= | ('"' (('"' '"') | [^"])* '"') | ("'" (("'" "'") | [^'])* "'") |
| [166] | VarName | ::= | QName |
| [167] | FuncName | ::= | (Prefix ":")? LocalPart |
| [173] | Char | ::= | ([#x0009] | [#x000D] | [#x000A] | [#x0020-#xFFFD]) |
| [174] | WhitespaceChar | ::= | ([#x0009] | [#x000D] | [#x000A] | [#x0020]) |
As discussed above, there are various strategies that can be used by an implementation to disambiguate token symbol choices. Among the choices are lexical look-ahead and look-behind, a two-pass lexical evaluation, and a single recursive descent lexical evaluation and parse. This specification does not dictate what strategy to use. However, this section does describe normative rules with which these decisions must conform to.
The lexical contexts and transitions between lexical contexts is described in terms of a series of states and transitions between those states. An implementation need not follow this approach in implementing lexer rules, but does need to conform to the results. For instance, instead of using a state automata, an implementation might use lexecal look-behind, or might use a full context-free-grammar parse, or it might make extensive use of parser lookahead (and use a more ambiguous token strategy).
The tables below define the complete lexical rules for XPath. Each table corresponds to a lexical state in which the tokens listed are recognized only in that state. When a given token is recognized in the given state, the transition to the next state is given. In some cases, a transition will "push" the current state or a specific state onto an abstract stack, and will later restore that state by a "pop" when another lexical event occurs. [Ed. Note: pushParenState(...), popParenState(...), and resetParenStateOrSwitch(...) are left unexplained for now. Briefly, these are used to maintain the ITEMTYPE state within a function definition in XQuery, since it is hard to distinguish lexically between a function call and a function definition. Since the lexical state mechanism is shared between XPath and XQuery, these are used in the XPath tables also, even though they are not strictly needed.]
The lexical states have in many cases close connection to the parser productions. However, just because a token is recognized in a certain lexical state, does not mean it will be legal in the parser state.
This state is for patterns that can be recognized in any state.
| Pattern | Transition To State |
|---|---|
| WhitespaceChar
Nmstart Nmchar Digits | (maintain current state) |
This state is for patterns that occur at the beginning of an expression.
| Pattern | Transition To State |
|---|---|
| "?" "[" "+" "-" "empty" | DEFAULT |
| "{" | DEFAULT pushState(DEFAULT) |
| DEFAULT pushState(DEFAULT) |
|
| <"type" QName> "validate" "*" <NCName ":" "*"> <"*" ":" NCName> "." ".." StringLiteral | OPERATOR |
| <"of" "type"> "/" "//" <"child" "::"> <"descendant" "::"> <"parent" "::"> <"attribute" "::"> <"self" "::"> <"descendant-or-self" "::"> <"ancestor" "::"> <"following-sibling" "::"> <"preceding-sibling" "::"> <"following" "::"> <"preceding" "::"> <"namespace" "::"> <"ancestor-or-self" "::"> "@" | QNAME |
| <"cast" "as"> <"treat" "as"> | ITEMTYPE |
| "$" <"for" "$"> <"some" "$"> <"every" "$"> | VARNAME |
| "," | resetParenStateOrSwitch(DEFAULT) |
| "(" | pushParenState(DEFAULT) |
| <"text" "("> <"comment" "("> <"node" "("> <"processing-instruction" "("> | pushParenState(DEFAULT) |
| <QName "("> <"if" "("> | pushParenState(DEFAULT) |
| ")" | popParenState(OPERATOR) |
| "}" | popState |
This state is for patterns that are defined for operators.
| Pattern | Transition To State |
|---|---|
| "/" "//" "div" "idiv" "mod" "and" "or" "*" "return" "then" "else" "to" "union" "intersect" "except" "=" "is" "!=" "isnot" "<=" ">=" "<" ">" "|" "<<" ">>" "eq" "ne" "gt" "ge" "lt" "le" "in" "satisfies" "?" "[" "+" "-" "item" "node" "document" "comment" "text" | DEFAULT |
| "{" | DEFAULT pushState(DEFAULT) |
| <NCName ":" "*"> <"*" ":" NCName> "." ".." StringLiteral | OPERATOR |
| <"of" "type"> | QNAME |
| <"instance" "of"> | ITEMTYPE |
| "$" <"for" "$"> <"some" "$"> <"every" "$"> | VARNAME |
| "," | resetParenStateOrSwitch(DEFAULT) |
| "(" | pushParenState(DEFAULT) |
| ")" | popParenState(OPERATOR) |
| "}" | popState |
When a qualified name is expected, and it is required to remove ambiguity from patterns that look like keywords, this state is used.
| Pattern | Transition To State |
|---|---|
| "*" <NCName ":" "*"> <"*" ":" NCName> "." ".." | OPERATOR |
| "/" "//" <"child" "::"> <"descendant" "::"> <"parent" "::"> <"attribute" "::"> <"self" "::"> <"descendant-or-self" "::"> <"ancestor" "::"> <"following-sibling" "::"> <"preceding-sibling" "::"> <"following" "::"> <"preceding" "::"> <"namespace" "::"> <"ancestor-or-self" "::"> "@" | QNAME |
| "$" | VARNAME |
| "(" | pushParenState(DEFAULT) |
| <"text" "("> <"comment" "("> <"node" "("> <"processing-instruction" "("> | pushParenState(DEFAULT) |
| ")" | popParenState(OPERATOR) |
This state distinguishes tokens that can occur only inside the ItemType production.
| Pattern | Transition To State |
|---|---|
| "attribute" "element" "node" "document" "comment" "text" "processing-instruction" "item" "untyped" <"atomic" "value"> AtomicType | DEFAULT |
| <NCName ":" "*"> <"*" ":" NCName> "." ".." | OPERATOR |
| "$" | VARNAME |
| ")" | popParenState(OPERATOR) |
This state differentiates variable names from qualified names. This allows only the pattern of a QName to be recognized when otherwise ambiguities could occur.
| Pattern | Transition To State |
|---|---|
| VarName
| OPERATOR |
Ed. Note: Note that the "validate" keyword can not be easily distinguished at lexical evaluation time from a QName whose value is "validate". This should be fixed in the future.
The following grammar uses the same Basic EBNF notation as [XML], except that grammar symbols always have initial capital letters. The EBNF contains the lexemes embedded in the productions.
Note:
Note that the Semicolon character is reserved for future use.