1 XPath Grammar

1.1 Lexical structure

A character is an atomic unit of text as specified by ISO/IEC 10646 [ISO10646] (see also [ISO10646-2000]). Legal characters are those allowed in the [XML] recommendation.

A lexical pattern is a rule that describes how a sequence of characters can match a grammar unit. A lexeme is the smallest meaningful unit in the grammar that has syntactic interpretation. A token is a symbol that matches lexemes, and is the output of the lexical analyzer. A token symbol is the symbolic name given to that token. A single token may be composed of one or more lexemes. If there is more than one lexeme, they may be separated by whitespace or punctuation. For instance, a token AxisDescendantOrSelf might have two lexemes, "descendant-or-self" and "::".

PatternLexeme(s)Token Names (for example)
"or""or"Or
"=""="Equals
(Prefix ':')? LocalPart"p"QName
":"
"foo"
<"descendant-or-self" "::">"descendant-or-self"AxisDescendantOrSelf
"::"

When patterns are simple string matches, the strings are embedded directly into the BNF. In other cases, token symbols are used when the pattern is a more complex regular expression (the major cases of these are NCName, QName, and Number and String literals). It is up to an implementation to decide on the exact tokenization strategy, which may be different depending on the parser construction. For example, an implementation may decide that a token named For is composed of only "for", or may decide that it is composed of ("for" "$"). In the first case the implementation may decide to use lexical lookahead to distinguish the "for" lexeme from a QName that has the lexeme "for". In the second case, the implementation may decide to combine the two lexemes into a single "long" token. In either case, the end grammatical result will be the same. In the BNF, the notation "< ... >" is used to indicate and delimit a sequence of lexemes that must be recognized using lexical lookahead or some equivalent means.

This grammar implies lexical states, which are lexical constraints on the tokenization process based on grammatical positioning. The exact structure of these states is left to the implementation, but the normative rules for calculating these states are given in the 1.1.2 Lexical Rules section.

When tokenizing, the longest possible match that is valid in the current lexical state is prefered .

For readability, Whitespace may be used in most expressions even though not explicitly notated in the BNF. Whitespace may be freely added between lexemes, except a few cases where whitespace is needed to disambiguate the token. For instance, in XML, "-" is a valid character in an element or attribute name. When used as an operator after the characters of a name, it must be separated from the name, e.g. by using whitespace or parentheses.

Special whitespace notation is specified with the BNF productions, when it is different from the default rules. "ws: explicit" means that where whitespace is allowed must be explicitly notated in the BNF. "ws: significant" means that whitespace is significant as value content.

All keywords are case sensitive.

1.1.1 Syntactic Constructs

1.1.2 Lexical Rules

The lexical contexts and transitions between lexical contexts is described in terms of a series of states and transitions between those states.

As discussed above, there are various strategies that can be used by an implementation to disambiguate token symbol choices. Among the choices are lexical look-ahead and look-behind, a two-pass lexical evaluation, and a single recursive descent lexical evaluation and parse. This specification does not dictate what strategy to use. An implementation need not follow this approach in implementing lexer rules, but does need to conform to the results. For instance, instead of using a state automaton, an implementation might use lexical look-behind, or might use a full context-free-grammar parse, or it might make extensive use of parser lookahead (and use a more ambiguous token strategy).

The tables below define the complete lexical rules for XPath. Each table corresponds to a lexical state in which the tokens listed are recognized only in that state. When a given token is recognized in the given state, the transition to the next state is given. In some cases, a transition will "push" the current state or a specific state onto an abstract stack, and will later restore that state by a "pop" when another lexical event occurs.

The lexical states have in many cases close connection to the parser productions. However, just because a token is recognized in a certain lexical state, does not mean it will be legal in the parser state.

The #ANY State

This state is for patterns that can be recognized in any state.

PatternTransition To State
WhitespaceChar
Nmstart
Nmchar
Digits
(maintain current state)

The DEFAULT State

This state is for patterns that occur at the beginning of an expression.

PatternTransition To State
"?"
"["
"+"
"-"
"("
<"text" "(">
<"comment" "(">
<"node" "(">
<"processing-instruction" "(">
<"if" "(">
<QName "(">
<"validate" "context">
DEFAULT
"{"
<"validate" "{">
DEFAULT
pushState(DEFAULT)
DEFAULT
pushState(DEFAULT)
"]"
IntegerLiteral
DecimalLiteral
DoubleLiteral
<"type" QName>
"*"
<NCName ":" "*">
<"*" ":" NCName>
"."
".."
")"
StringLiteral
OPERATOR
<"of" "type">
"/"
"//"
<"child" "::">
<"descendant" "::">
<"parent" "::">
<"attribute" "::">
<"self" "::">
<"descendant-or-self" "::">
<"ancestor" "::">
<"following-sibling" "::">
<"preceding-sibling" "::">
<"following" "::">
<"preceding" "::">
<"namespace" "::">
<"ancestor-or-self" "::">
"@"
QNAME
<"cast" "as">
<"treat" "as">
ITEMTYPE
"$"
<"for" "$">
<"some" "$">
<"every" "$">
VARNAME
","
resetParenStateOrSwitch(DEFAULT)
ExprComment
(maintain state)
"}"
popState

The OPERATOR State

This state is for patterns that are defined for operators.

PatternTransition To State
"/"
"//"
"div"
"idiv"
"mod"
"and"
"or"
"*"
"return"
"then"
"else"
"to"
"union"
"intersect"
"except"
"="
"is"
"!="
"isnot"
"<="
">="
"<"
">"
"|"
"<<"
">>"
"eq"
"ne"
"gt"
"ge"
"lt"
"le"
"in"
"context"
"satisfies"
"?"
"["
"+"
"-"
"("
"item"
"node"
"document"
"comment"
"text"
DEFAULT
"{"
<"validate" "{">
DEFAULT
pushState(DEFAULT)
"]"
IntegerLiteral
DecimalLiteral
DoubleLiteral
<NCName ":" "*">
<"*" ":" NCName>
"."
".."
")"
StringLiteral
OPERATOR
<"of" "type">
QNAME
<"instance" "of">
<"castable" "as">
ITEMTYPE
"$"
<"for" "$">
<"some" "$">
<"every" "$">
VARNAME
","
resetParenStateOrSwitch(DEFAULT)
ExprComment
(maintain state)
"}"
popState

The QNAME State

When a qualified name is expected, and it is required to remove ambiguity from patterns that look like keywords, this state is used.

PatternTransition To State
"("
<"text" "(">
<"comment" "(">
<"node" "(">
<"processing-instruction" "(">
DEFAULT
"*"
<NCName ":" "*">
<"*" ":" NCName>
"."
".."
")"
OPERATOR
"/"
"//"
<"child" "::">
<"descendant" "::">
<"parent" "::">
<"attribute" "::">
<"self" "::">
<"descendant-or-self" "::">
<"ancestor" "::">
<"following-sibling" "::">
<"preceding-sibling" "::">
<"following" "::">
<"preceding" "::">
<"namespace" "::">
<"ancestor-or-self" "::">
"@"
QNAME
"$"
VARNAME
","
resetParenStateOrSwitch(DEFAULT)
ExprComment
(maintain state)

The ITEMTYPE State

This state distinguishes tokens that can occur only inside the ItemType production.

PatternTransition To State
"attribute"
"element"
"node"
"document"
"comment"
"text"
"processing-instruction"
"item"
"untyped"
<"atomic" "value">
AtomicType
"empty"
DEFAULT
"{"
<"validate" "{">
DEFAULT
pushState(DEFAULT)
<NCName ":" "*">
<"*" ":" NCName>
"."
".."
")"
OPERATOR
"$"
VARNAME
ExprComment
(maintain state)

The VARNAME State

This state differentiates variable names from qualified names. This allows only the pattern of a QName to be recognized when otherwise ambiguities could occur.

PatternTransition To State
VarName
OPERATOR
ExprComment
(maintain state)

Ed. Note: Note that the "validate" keyword can not be easily distinguished at lexical evaluation time from a QName whose value is "validate". This should be fixed in the future.

1.2 BNF

The following grammar uses the same Basic EBNF notation as [XML], except that grammar symbols always have initial capital letters. The EBNF contains the lexemes embedded in the productions.

Note:

Note that the Semicolon character is reserved for future use.

NON-TERMINALS

[13]   XPath   ::=   ExprSequence?
[14]   ExprSequence   ::=   Expr ("," Expr)*
[15]   Expr   ::=   OrExpr
[16]   OrExpr   ::=   AndExpr ( "or"  AndExpr )*
[17]   AndExpr   ::=   ForExpr ( "and"  ForExpr )*
[18]   ForExpr   ::=   (SimpleForClause "return")* QuantifiedExpr
[19]   QuantifiedExpr   ::=   ((<"some" "$"> |  <"every" "$">) VarName "in" Expr ("," "$" VarName "in" Expr)* "satisfies")* IfExpr
[20]   IfExpr   ::=   (<"if" "("> Expr ")" "then" Expr "else")* InstanceofExpr
[21]   InstanceofExpr   ::=   CastableExpr ( <"instance" "of"> SequenceType )?
[22]   CastableExpr   ::=   ComparisonExpr ( <"castable" "as"> SingleType )?
[23]   ComparisonExpr   ::=   RangeExpr ( (ValueComp
|  GeneralComp
|  NodeComp
|  OrderComp)  RangeExpr )?
[24]   RangeExpr   ::=   AdditiveExpr ( "to"  AdditiveExpr )?
[25]   AdditiveExpr   ::=   MultiplicativeExpr ( ("+" |  "-")  MultiplicativeExpr )*
[26]   MultiplicativeExpr   ::=   UnaryExpr ( ("*" |  "div" |  "idiv" |  "mod")  UnaryExpr )*
[27]   UnaryExpr   ::=   ("-" |  "+")* UnionExpr
[28]   UnionExpr   ::=   IntersectExceptExpr ( ("union" |  "|")  IntersectExceptExpr )*
[29]   IntersectExceptExpr   ::=   ValueExpr ( ("intersect" |  "except")  ValueExpr )*
[30]   ValueExpr   ::=   ValidateExpr |  CastExpr |  TreatExpr |  PathExpr
[31]   PathExpr   ::=   ("/" RelativePathExpr?) |  ("//" RelativePathExpr) |  RelativePathExpr
[32]   RelativePathExpr   ::=   StepExpr (("/" |  "//") StepExpr)*
[33]   StepExpr   ::=   (ForwardStep |  ReverseStep |  PrimaryExpr) Predicates
[34]   SimpleForClause   ::=   <"for" "$"> VarName "in" Expr ("," "$" VarName "in" Expr)*
[35]   ValidateExpr   ::=   (<"validate" "{"> |  (<"validate" "context"> SchemaGlobalContext ("/" SchemaContextStep)* "{")) Expr "}"
[36]   CastExpr   ::=   <"cast" "as"> SingleType ParenthesizedExpr
[37]   TreatExpr   ::=   <"treat" "as"> SequenceType ParenthesizedExpr
[38]   GeneralComp   ::=   "=" |  "!=" |  "<" |  "<=" |  ">" |  ">="
[39]   ValueComp   ::=   "eq" |  "ne" |  "lt" |  "le" |  "gt" |  "ge"
[40]   NodeComp   ::=   "is" |  "isnot"
[41]   OrderComp   ::=   "<<" |  ">>"
[42]   PrimaryExpr   ::=   Literal |  FunctionCall |  ("$" VarName) |  ParenthesizedExpr
[43]   ForwardAxis   ::=   <"child" "::">
|  <"descendant" "::">
|  <"attribute" "::">
|  <"self" "::">
|  <"descendant-or-self" "::">
|  <"following-sibling" "::">
|  <"following" "::">
|  <"namespace" "::">
[44]   ReverseAxis   ::=   <"parent" "::">
|  <"ancestor" "::">
|  <"preceding-sibling" "::">
|  <"preceding" "::">
|  <"ancestor-or-self" "::">
[45]   NodeTest   ::=   KindTest |  NameTest
[46]   NameTest   ::=   QName |  Wildcard
[47]   Wildcard   ::=   "*" |  <NCName ":" "*"> |  <"*" ":" NCName>
[48]   KindTest   ::=   ProcessingInstructionTest
|  CommentTest
|  TextTest
|  AnyKindTest
[49]   ProcessingInstructionTest   ::=   <"processing-instruction" "("> StringLiteral? ")"
[50]   CommentTest   ::=   <"comment" "("> ")"
[51]   TextTest   ::=   <"text" "("> ")"
[52]   AnyKindTest   ::=   <"node" "("> ")"
[53]   ForwardStep   ::=   (ForwardAxis NodeTest) |  AbbreviatedForwardStep
[54]   ReverseStep   ::=   (ReverseAxis NodeTest) |  AbbreviatedReverseStep
[55]   AbbreviatedForwardStep   ::=   "." |  ("@" NameTest) |  NodeTest
[56]   AbbreviatedReverseStep   ::=   ".."
[57]   Predicates   ::=   ("[" Expr "]")*
[58]   NumericLiteral   ::=   IntegerLiteral |  DecimalLiteral |  DoubleLiteral
[59]   Literal   ::=   NumericLiteral |  StringLiteral
[60]   ParenthesizedExpr   ::=   "(" ExprSequence? ")"
[61]   FunctionCall   ::=   <QName "("> (Expr ("," Expr)*)? ")"
[62]   SchemaContext   ::=   "context" SchemaGlobalContext ("/" SchemaContextStep)*
[63]   SchemaGlobalContext   ::=   QName |  <"type" QName>
[64]   SchemaContextStep   ::=   QName
[65]   SingleType   ::=   AtomicType "?"?
[66]   SequenceType   ::=   (ItemType OccurrenceIndicator) |  "empty"
[67]   ItemType   ::=   (("element" |  "attribute") ElemOrAttrType?)
|  "node"
|  "processing-instruction"
|  "comment"
|  "text"
|  "document"
|  "item"
|  AtomicType
|  "untyped"
|  <"atomic" "value">
[68]   ElemOrAttrType   ::=   (QName (SchemaType |  SchemaContext?)) |  SchemaType
[69]   SchemaType   ::=   <"of" "type"> QName
[70]   AtomicType   ::=   QName
[71]   OccurrenceIndicator   ::=   ("*" |  "+" |  "?")?

1.3 Reserved Function Names

The following is a list of names that may not be used as user function names, in an unprefixed form.

1.4 Precedence Order

In all cases the grammar defines built-in precedence. In the cases where a number of statements are a choice at the same production level, the expressions are always evaluated from left to right.