A character is an atomic unit of text as specified by ISO/IEC 10646 [ISO10646] (see also [ISO10646-2000]). Legal characters are those allowed in the [XML] recommendation.
A lexical pattern is a rule that describes how a sequence of characters can match a grammar unit. A lexeme is the smallest meaningful unit in the grammar that has syntactic interpretation. A token is a symbol that matches lexemes, and is the output of the lexical analyzer. A token symbol is the symbolic name given to that token. A single token may be composed of one or more lexemes. If there is more than one lexeme, they may be separated by whitespace or punctuation. For instance, a token AxisDescendantOrSelf might have two lexemes, "descendant-or-self" and "::".
Pattern | Lexeme(s) | Token Names (for example) |
---|---|---|
"or" | "or" | Or |
"=" | "=" | Equals |
(Prefix ':')? LocalPart | "p" | QName |
":" | ||
"foo" | ||
<"descendant-or-self" "::"> | "descendant-or-self" | AxisDescendantOrSelf |
"::" |
When patterns are simple string matches, the strings are embedded directly into the BNF. In other cases, token symbols are used when the pattern is a more complex regular expression (the major cases of these are NCName, QName, and Number and String literals). It is up to an implementation to decide on the exact tokenization strategy, which may be different depending on the parser construction. For example, an implementation may decide that a token named For
is composed of only "for", or may decide that it is composed of ("for" "$"). In the first case the implementation may decide to use lexical lookahead to distinguish the "for" lexeme from a QName that has the lexeme "for". In the second case, the implementation may decide to combine the two lexemes into a single "long" token. In either case, the end grammatical result will be the same. In the
BNF, the notation "< ... >" is used to indicate and delimit a sequence
of lexemes that must be recognized using lexical lookahead or some
equivalent means.
This grammar implies lexical states, which are lexical constraints on the tokenization process based on grammatical positioning. The exact structure of these states is left to the implementation, but the normative rules for calculating these states are given in the 1.1.2 Lexical Rules section.
When tokenizing, the longest possible match that is valid in the current lexical state is prefered .
For readability, Whitespace may be used in most expressions even though not explicitly notated in the BNF. Whitespace may be freely added between lexemes, except a few cases where whitespace is needed to disambiguate the token. For instance, in XML, "-" is a valid character in an element or attribute name. When used as an operator after the characters of a name, it must be separated from the name, e.g. by using whitespace or parentheses.
Special whitespace notation is specified with the BNF productions, when it is different from the default rules. "ws: explicit" means that where whitespace is allowed must be explicitly notated in the BNF. "ws: significant" means that whitespace is significant as value content.
All keywords are case sensitive.
Character Classes
The following basic tokens are defined in [XML].
Identifiers
The following identifier components are defined in [XMLNAMES].
String Literals and Numbers
[1] | IntegerLiteral | ::= | Digits |
[2] | DecimalLiteral | ::= | ("." Digits) | (Digits "." [0-9]*) |
[3] | DoubleLiteral | ::= | (("." Digits) | (Digits ("." [0-9]*)?)) ("e" | "E") ("+" | "-")? Digits |
[4] | StringLiteral
(ws: significant) | ::= | ('"' (('"' '"') | [^"])* '"') | ("'" (("'" "'") | [^'])* "'") |
Comments
Comments are lexical constructs only, and do not affect the processing of an expression. They are allowed whereever whitespace is allowed, as long as the whitespace notation in not 'explicit' or 'significant'.
[5] | ExprComment | ::= | "{--" [^}]* "--}" |
Defined Tokens
The following is a list of defined tokens for the XPath grammar.
[6] | S | ::= | WhitespaceChar+ |
[7] | Nmstart | ::= | Letter | "_" |
[8] | Nmchar | ::= | Letter | CombiningChar | Extender | Digit | "." | "-" | "_" |
[9] | Digits | ::= | [0-9]+ |
[10] | VarName | ::= | QName |
[11] | Char | ::= | ([#x0009] | [#x000D] | [#x000A] | [#x0020-#xFFFD]) |
[12] | WhitespaceChar | ::= | ([#x0009] | [#x000D] | [#x000A] | [#x0020]) |
The lexical contexts and transitions between lexical contexts is described in terms of a series of states and transitions between those states.
As discussed above, there are various strategies that can be used by an implementation to disambiguate token symbol choices. Among the choices are lexical look-ahead and look-behind, a two-pass lexical evaluation, and a single recursive descent lexical evaluation and parse. This specification does not dictate what strategy to use. An implementation need not follow this approach in implementing lexer rules, but does need to conform to the results. For instance, instead of using a state automaton, an implementation might use lexical look-behind, or might use a full context-free-grammar parse, or it might make extensive use of parser lookahead (and use a more ambiguous token strategy).
The tables below define the complete lexical rules for XPath. Each table corresponds to a lexical state in which the tokens listed are recognized only in that state. When a given token is recognized in the given state, the transition to the next state is given. In some cases, a transition will "push" the current state or a specific state onto an abstract stack, and will later restore that state by a "pop" when another lexical event occurs.
The lexical states have in many cases close connection to the parser productions. However, just because a token is recognized in a certain lexical state, does not mean it will be legal in the parser state.
This state is for patterns that can be recognized in any state.
Pattern | Transition To State |
---|---|
WhitespaceChar
Nmstart Nmchar Digits | (maintain current state) |
This state is for patterns that occur at the beginning of an expression.
Pattern | Transition To State |
---|---|
"?" "[" "+" "-" "(" <"text" "("> <"comment" "("> <"node" "("> <"processing-instruction" "("> <"if" "("> <QName "("> <"validate" "context"> | DEFAULT |
"{" <"validate" "{"> | DEFAULT pushState(DEFAULT) |
DEFAULT pushState(DEFAULT) |
|
"]" IntegerLiteral DecimalLiteral DoubleLiteral <"type" QName> "*" <NCName ":" "*"> <"*" ":" NCName> "." ".." ")" StringLiteral | OPERATOR |
<"of" "type"> "/" "//" <"child" "::"> <"descendant" "::"> <"parent" "::"> <"attribute" "::"> <"self" "::"> <"descendant-or-self" "::"> <"ancestor" "::"> <"following-sibling" "::"> <"preceding-sibling" "::"> <"following" "::"> <"preceding" "::"> <"namespace" "::"> <"ancestor-or-self" "::"> "@" | QNAME |
<"cast" "as"> <"treat" "as"> | ITEMTYPE |
"$" <"for" "$"> <"some" "$"> <"every" "$"> | VARNAME |
"," | resetParenStateOrSwitch(DEFAULT) |
ExprComment
| (maintain state) |
"}" | popState |
This state is for patterns that are defined for operators.
Pattern | Transition To State |
---|---|
"/" "//" "div" "idiv" "mod" "and" "or" "*" "return" "then" "else" "to" "union" "intersect" "except" "=" "is" "!=" "isnot" "<=" ">=" "<" ">" "|" "<<" ">>" "eq" "ne" "gt" "ge" "lt" "le" "in" "context" "satisfies" "?" "[" "+" "-" "(" "item" "node" "document" "comment" "text" | DEFAULT |
"{" <"validate" "{"> | DEFAULT pushState(DEFAULT) |
"]" IntegerLiteral DecimalLiteral DoubleLiteral <NCName ":" "*"> <"*" ":" NCName> "." ".." ")" StringLiteral | OPERATOR |
<"of" "type"> | QNAME |
<"instance" "of"> <"castable" "as"> | ITEMTYPE |
"$" <"for" "$"> <"some" "$"> <"every" "$"> | VARNAME |
"," | resetParenStateOrSwitch(DEFAULT) |
ExprComment
| (maintain state) |
"}" | popState |
When a qualified name is expected, and it is required to remove ambiguity from patterns that look like keywords, this state is used.
Pattern | Transition To State |
---|---|
"(" <"text" "("> <"comment" "("> <"node" "("> <"processing-instruction" "("> | DEFAULT |
"*" <NCName ":" "*"> <"*" ":" NCName> "." ".." ")" | OPERATOR |
"/" "//" <"child" "::"> <"descendant" "::"> <"parent" "::"> <"attribute" "::"> <"self" "::"> <"descendant-or-self" "::"> <"ancestor" "::"> <"following-sibling" "::"> <"preceding-sibling" "::"> <"following" "::"> <"preceding" "::"> <"namespace" "::"> <"ancestor-or-self" "::"> "@" | QNAME |
"$" | VARNAME |
"," | resetParenStateOrSwitch(DEFAULT) |
ExprComment
| (maintain state) |
This state distinguishes tokens that can occur only inside the ItemType production.
Pattern | Transition To State |
---|---|
"attribute" "element" "node" "document" "comment" "text" "processing-instruction" "item" "untyped" <"atomic" "value"> AtomicType "empty" | DEFAULT |
"{" <"validate" "{"> | DEFAULT pushState(DEFAULT) |
<NCName ":" "*"> <"*" ":" NCName> "." ".." ")" | OPERATOR |
"$" | VARNAME |
ExprComment
| (maintain state) |
This state differentiates variable names from qualified names. This allows only the pattern of a QName to be recognized when otherwise ambiguities could occur.
Pattern | Transition To State |
---|---|
VarName
| OPERATOR |
ExprComment
| (maintain state) |
Ed. Note: Note that the "validate" keyword can not be easily distinguished at lexical evaluation time from a QName whose value is "validate". This should be fixed in the future.
The following grammar uses the same Basic EBNF notation as [XML], except that grammar symbols always have initial capital letters. The EBNF contains the lexemes embedded in the productions.
Note:
Note that the Semicolon character is reserved for future use.