This document is also available in these non-normative formats: XML.
Copyright © 2008 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
This document defines the syntax and formal semantics of XQuery and XPath Full Text 1.0 which is a language that extends XQuery 1.0 [XQuery 1.0: An XML Query Language] and XPath 2.0 [XML Path Language (XPath) 2.0] with full-text search capabilities.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
W3C publishes a Candidate Recommendation, as described in the Process Document, to indicate that the document is believed to be stable and to encourage implementation by the developer community. The publication of this document constitutes a call for implementations of this specification.
This document has been jointly developed by the W3C XML Query Working Group and the W3C XSL Working Group, each of which is part of the XML Activity. It will remain a Candidate Recommendation until at least 15 September 2008. The Working Groups expect to advance this specification to Recommendation Status.
The XML Query Working Group and XSL Working Group intend to submit this document for consideration as a W3C Proposed Recommendation as soon as the following conditions are all met:
A test suite is available that tests each identified XQuery and XPath Full Text 1.0 feature, both required and optional.
Minimal Conformance to this specification, as defined in 5.1 Minimal Conformance, has been demonstrated by at least two distinct implementations, at least one of which uses the XQuery human-readable syntax defined in this specification.
An XPath Full Text parsing applet that generates XQueryX is available.
The Working Groups have responded formally to all issues raised during the CR period against this document.
Once the entrance criteria for Proposed Recommendation have been achieved, the Director will be requested to advance this document to Proposed Recommendation status. Working closely with the developer community, we expect to show evidence of implementations by approximately 15 September 2008.
The 15 optional features are each individually at risk. Optional features for which there are not at least two implementations at the end of the Candidate Recommendation period may be removed from this specification.
The WG believes that this document, published on 16 May 2008, is sufficiently mature and stable for the development community to begin developing implementation experience and reporting on that experience.
The WGs particularly solicit feedback regarding how thesauri are to be used in combination.
No implementation report currently exists. However, a Test Suite for this document is under development. Implementors are encouraged to run this test suite and report their results. The Test Suite can be found at http://dev.w3.org:/cvsweb/2007/xpath-full-text-10-test-suite/.
This document incorporates changes made against the Last Call Working Draft of 18 May 2007. Changes to this document since the Last Call Working Draft are detailed in J Change Log.
Please report errors in this document using W3C's public Bugzilla system (instructions can be found at http://www.w3.org/XML/2005/04/qt-bugzilla). If access to that system is not feasible, you may send your comments to the W3C XSLT/XPath/XQuery public comments mailing list, public-qt-comments@w3.org. It will be very helpful if you include the string “[FT]” in the subject line of your report, whether made in Bugzilla or in email. Please use multiple Bugzilla entries (or, if necessary, multiple email messages) if you have more than one comment to make. Archives of the comments and responses are available at http://lists.w3.org/Archives/Public/public-qt-comments/.
Publication as a Candidate Recommendation does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by groups operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the XML Query Working Group and also maintains a public list of any patent disclosures made in connection with the deliverables of the XSL Working Group; those pages also include instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
1 Introduction
1.1 Full-Text
Search and XML
1.2 Organization of this document
1.3 A word
about namespaces
2 Full-Text Extensions to XQuery and
XPath
2.1 Processing
Model
2.2 Full-Text Contains Expression
2.2.1 Description
2.2.2 Examples
2.3 Score Variables
2.3.1 Using Weights Within a Scored
FTContainsExpr
2.4 Extensions to the Static
Context
3 Full-Text Selections
3.1 Primary Full-Text
Selections
3.2 Search Tokens and
Phrases
3.3 Cardinality
Selection
3.4 Match
Options
3.4.1 Language Option
3.4.2 Wildcard Option
3.4.3 Thesaurus Option
3.4.4 Stemming Option
3.4.5 Case Option
3.4.6 Diacritics Option
3.4.7 Stop Word Option
3.4.8 Extension Option
3.5 Logical
Full-Text Operators
3.5.1 Or-Selection
3.5.2 And-Selection
3.5.3 Mild-Not Selection
3.5.4 Not-Selection
3.6 Positional
Filters
3.6.1 Ordered Selection
3.6.2 Window Selection
3.6.3 Distance Selection
3.6.4 Scope Selection
3.6.5 Anchoring Selection
3.7 Ignore
Option
3.8 Extension Selections
4 Semantics
4.1 Tokenization
4.1.1 Examples
4.1.2 Representations of Tokenized Text
and Matching
4.2 Evaluation of FTSelections
4.2.1 AllMatches
4.2.1.1
Formal Model
4.2.1.2
Examples
4.2.1.3
XML representation
4.2.2 XML Representation
4.2.3 The evaluate function
4.2.4 Formal semantics
functions
4.2.5 FTWords
4.2.6 Match Options Semantics
4.2.6.1
Types
4.2.6.2
High-Level Semantics
4.2.6.3
Formal Semantics
Functions
4.2.6.4
FTCaseOption
4.2.6.5
FTDiacriticsOption
4.2.6.6
FTStemOption
4.2.6.7
FTThesaurusOption
4.2.6.8
FTStopWordOption
4.2.6.9
FTLanguageOption
4.2.6.10
FTWildCardOption
4.2.7 Full-Text Operators Semantics
4.2.7.1
FTOr
4.2.7.2
FTAnd
4.2.7.3
FTUnaryNot
4.2.7.4
FTMildNot
4.2.7.5
FTOrder
4.2.7.6
FTScope
4.2.7.7
FTContent
4.2.7.8
FTWindow
4.2.7.9
FTDistance
4.2.7.10
FTTimes
4.3 FTContainsExpr
4.4 Scoring
4.5 Example
5 Conformance
5.1 Minimal Conformance
5.2 Optional Features
5.2.1 FTMildNot Operator
5.2.2 FTUnaryNot Operator
5.2.3 FTUnit and FTBigUnit
5.2.4 FTOrder Operator
5.2.5 FTScope Operator
5.2.6 FTWindow Operator
5.2.7 FTDistance Operator
5.2.8 FTTimes Operator
5.2.9 FTContent Operator
5.2.10 FTCaseOption
5.2.11 FTStopWordOption
5.2.12 FTLanguageOption
5.2.13 FTIgnoreOption
5.2.14 Scoring
5.2.15 Weights
A EBNF for XQuery 1.0 Grammar
with Full-Text extensions
A.1 Terminal
Symbols
A.2 Extra-grammatical
Constraints
B EBNF for XPath 2.0 Grammar with
Full-Text extensions
B.1 Terminal Symbols
C Static Context
Components
D Error Conditions
E XML Syntax (XQueryX) for XQuery and
XPath Full Text 1.0
E.1 XQueryX representation of XQuery and
XPath Full Text 1.0
E.2 XQueryX stylesheet for XQuery and
XPath Full Text 1.0
E.3 XQueryX for XQuery and XPath Full
Text 1.0 example
E.3.1 Example
E.3.1.1
XQuery solution in XQuery and
XPath Full Text 1.0 Use Cases:
E.3.1.2
A Solution in Full Text
XQueryX:
E.3.1.3
Transformation of Full
Text XQueryX Solution into XQuery Full Text
F References
F.1 Normative References
F.2 Non-normative References
G Acknowledgements
(Non-Normative)
H Glossary (Non-Normative)
I Checklist of Implementation-Defined
Features (Non-Normative)
J Change Log
(Non-Normative)
This document defines the language and the formal semantics of XQuery and XPath Full Text 1.0. This language is designed to meet the requirements identified in W3C XQuery and XPath Full Text Requirements [XQuery and XPath Full Text Requirements] and to support the queries in the W3C XQuery and XPath Full Text Use Cases [XQuery and XPath Full Text Use Cases].
XQuery and XPath Full Text 1.0 extends the syntax and semantics of XQuery 1.0 and XPath 2.0.
Additionally, this document defines an XML syntax for XQuery and XPath Full Text 1.0. The most recent versions of the two XQueryX XML Schemas and the XQueryX XSLT stylesheet for XQuery and XPath Full Text 1.0 are available at http://www.w3.org/2007/xpath-full-text/xpath-full-text-10-xqueryx.xsd, http://www.w3.org/2007/xpath-full-text/xpath-full-text-10-xqueryx-ftmatchoption-extensions.xsd, and http://www.w3.org/2007/xpath-full-text/xpath-full-text-10-xqueryx.xsl, respectively.
As XML becomes mainstream, users expect to be able to search their XML documents. This requires a standard way to do full-text search, as well as structured searches, against XML documents. A similar requirement for full-text search led ISO to define the SQL/MM-FT [SQL/MM] standard. SQL/MM-FT defines extensions to SQL to express full-text searches providing functionality similar to that defined in this full-text language extension to XQuery 1.0 and XPath 2.0.
XML documents may contain highly structured data (fixed schemas, known types such as numbers, dates), semi-structured data (flexible schemas and types), markup data (text with embedded tags), and unstructured data (untagged free-flowing text). Where a document contains unstructured or semi-structured data, it is important to be able to search using Information Retrieval techniques such as scoring and weighting.
Full-text search is different from substring search in many ways:
A full-text search searches for tokens and phrases rather than substrings. A substring search for news items that contain the string "lease" will return a news item that contains "Foobar Corporation releases the 20.9 version ...". A full-text search for the token "lease" will not.
There is an expectation that a full-text search will support language-based searches which substring search cannot. An example of a language-based search is "find me all the news items that contain a token with the same linguistic stem as 'mouse'" (finds "mouse" and "mice"). Another example based on token proximity is "find me all the news items that contain the tokens 'XML' and 'Query' allowing up to 3 intervening tokens".
Full-text search must address the vagaries and nuances of language. Search results are often of varying usefulness. When you search a web site for cameras that cost less than $100, this is an exact search. There is a set of cameras that matches this search, and a set that does not. Similarly, when you do a string search across news items for "mouse", there is only 1 expected result set. When you do a full-text search for all the news items that contain the token "mouse", you probably expect to find news items containing the token "mice", and possibly "rodents", or possibly "computers". Not all results are equal. Some results are more "mousey" than others. Because full-text search may be inexact, we have the notion of score or relevance. We generally expect to see the most relevant results at the top of the results list.
Note:
As XQuery and XPath evolve, they may apply the notion of score to querying structured data. For example, when making travel plans or shopping for cameras, it is sometimes useful to get an ordered list of near matches in addition to exact matches. If XQuery and XPath define a generalized inexact match, we expect XQuery and XPath to utilize the scoring framework provided by XQuery and XPath Full Text.
[Definition: Full-text queries are performed on tokens and phrases. Tokens and phrases are produced via tokenization.] Informally, tokenization breaks a character string into a sequence of tokens, units of punctuation, and spaces.
Tokenization, in general terms, is the process of converting a text string into smaller units that are used in query processing. Those units, called tokens, are the most basic text units that a full-text search can refer to. Full-text operators typically work on sequences of tokens found in the target text of a search. These tokens are characterized by integers that capture the relative position(s) of the token inside the string, the relative position(s) of the sentence containing the token, and the relative position(s) of the paragraph containing the token. The positions typically comprise a start and an end position.
Tokenization, including the definition of the term "tokens", SHOULD be implementation-defined. Implementations SHOULD expose the rules and sample results of tokenization as much as possible to enable users to predict and interpret the results of tokenization. Tokenization is defined more formally in 4.1 Tokenization.
[Definition: A token is a non-empty sequence of characters returned by a tokenizer as a basic unit to be searched. Beyond that, tokens are implementation-defined.] [Definition: A phrase is an ordered sequence of any number of tokens. Beyond that, phrases are implementation-defined.]
Note:
Consecutive tokens need not be separated by either punctuation or space, and tokens may overlap.
Note:
In some natural languages, tokens and words can be used interchangeably.
[Definition: A sentence is an ordered sequence of any number of tokens. Beyond that, sentences are implementation-defined. A tokenizer is not required to support sentences.]
[Definition: A paragraph is an ordered sequence of any number of tokens. Beyond that, paragraphs are implementation-defined. A tokenizer is not required to support paragraphs.]
Some XML elements represent semantic markup, e.g., <title>. Others represent formatting markup, e.g., <b> to indicate bold. Semantic markup serves well as token boundaries. Some formatting markup serves well as token boundaries, for example, paragraphs are most commonly delimited by formatting markup. Other formatting markup may not serve well as token boundaries. Implementations are free to provide implementation-defined ways to differentiate between the markup's effect on token boundaries during tokenization. In the absence of an implementation-defined way to differentiate, element markup (start tags, end tags, and empty-element tags) creates token boundaries.
A sample tokenization is used for the examples in this document. The results might be different for other tokenizations.
Tokenization enables functions and operators that operate on a part or the root of the token (e.g., wildcards, stemming).
Tokenization enables functions and operators which work with the relative positions of tokens (e.g., proximity operators).
This specification focuses on functionality that serves all languages. It also selectively includes functionalities useful within specific families of languages. For example, searching within sentences and paragraphs is useful to many western languages and to some non-western languages, so that functionality is incorporated into this specification.
Certain aspects of language processing are described in this specification as implementation-defined or implementation-dependent.
[Definition: Implementation-defined indicates an aspect that may differ between implementations, but must be specified by the implementor for each particular implementation.]
[Definition: Implementation-dependent indicates an aspect that may differ between implementations, is not specified by this or any W3C specification, and is not required to be specified by the implementor for any particular implementation.]
This document is organized as follows. We first present a high level syntax for the XQuery and XPath Full Text 1.0 language along with some examples. Then, we present the syntax and examples of the basic primitives in the XQuery and XPath Full Text 1.0 language. This is followed by the semantics of the XQuery and XPath Full Text 1.0 language. The appendix contains a section that provides an EBNF for the XPath 2.0 Grammar with Full-Text Extensions, an EBNF for XQuery 1.0 Grammar with Full-Text Extensions, acknowledgements and a glossary.
Certain namespace prefixes are predeclared by XQuery 1.0 and, by implication, by this specification, and bound to fixed namespace URIs. These namespace prefixes are as follows:
xml = http://www.w3.org/XML/1998/namespace
xs = http://www.w3.org/2001/XMLSchema
xsi = http://www.w3.org/2001/XMLSchema-instance
fn = http://www.w3.org/2005/xpath-functions
local =
http://www.w3.org/2005/xquery-local-functions
In addition to the prefixes in the above list, this document
uses the prefix err to represent the namespace URI
http://www.w3.org/2005/xqt-errors, This namespace
prefix is not predeclared and its use in this document is not
normative. Error codes that are not defined in this document are
defined in other XQuery 1.0 and XPath 2.0 specifications,
particularly [XML Path Language (XPath) 2.0]
and [XQuery 1.0 and XPath 2.0 Functions
and Operators].
Finally, this document uses the prefix fts to
represent a namespace containing a number of functions used in this
document to describe the semantics of XQuery and XPath Full Text
functions. There is no requirement that these functions be
implemented, therefore no URI is associated with that prefix.
XQuery and XPath Full Text extends the languages of XQuery 1.0 and XPath 2.0 in three ways. It:
Adds a new expression called FTContainsExpr;
Enhances the syntax of FLWOR expressions in XQuery 1.0 and
for expressions in XPath 2.0 with optional score
variables; and
Adds static context declarations for full-text match options to the query prolog.
Additionally, it extends the data model and processing models in various ways.
A full-text contains expression (2.2 Full-Text Contains Expression) is composed of several parts:
An XPath 2.0 or XQuery 1.0 expression (RangeExpr) that specifies the sequence of items to be searched. [Definition: Those items are called the search context.]
The full-text selection to be applied (3 Full-Text Selections). Full-text selections are, syntactically and semantically, fully composable and contain:
Required:
Tokens and phrases for which a search is performed (3.2 Search Tokens and Phrases).
Optional:
Match options, such as indicators for case sensitivity and stop words (3.4 Match Options);
Boolean full-text operators, that compose a full-text selection from simpler full-text selections (3.5 Logical Full-Text Operators);
Other full-text operators that are constraints on the positions of matches, such as indicators for distance between tokens and for the cardinality of matches (3.6 Positional Filters and 3.3 Cardinality Selection); and
The weighting information. Each individual search term in a full-text selection may be annotated with optional weight information. This information may be used during the evaluation of the full-text selections to calculate scoring, information that quantifies the relevance of the result to the given search criteria.
An optional XPath 2.0 or XQuery 1.0 expression (UnionExpr) that specifies the set of nodes, descendents of the RangeExp, whose contents must be ignored for the purpose of determining a match during the search (3.7 Ignore Option).
The results of the evaluation of the full-text selection operators are instances of the AllMatches model, which complements the XQuery Data Model (XDM) for processing full-text queries. An AllMatches instance describes all possible solutions to the full-text query for a given search context item. Each solution is described by a Match instance. A Match instance contains the tokens from the search context that must be included (described using StringInclude instances which model the positive terms) and the tokens from search context item that must be excluded (described using StringExclude instances which model the negative terms). Each negative or positive term is modeled as a tuple: the position of the query token or phrase in the full-text selection, and a TokenInfo structure that describes a set of tokens in the text string which match the query token or phrase.
Figure 1 provides a schematic overview of the XQuery and XPath Full Text processing steps that are discussed in detail below. Some of these steps are completely outside the domain of XQuery; in Figure 1, these are depicted outside the black line that represents the boundaries of the language. The diagram only shows the central pieces of the XQuery Processing Model (see Section 2.2 Processing ModelXQ), however zooms in on the Execution Engine where the processing of the full-text extensions takes place. The full-text processing steps are labeled as FTn within the diagram and are referenced within the text.
Like all XQuery expressions, an FTContainsExpr returns an XDM Instance (see Fig. 1). With the exception of FTWords, which consumes TokenInfos, all full-text selections are closed under the AllMatches data model, i.e., their input and output are AllMatches instances. Tokenization transforms an XDM instance into TokenInfos, which ultimately get converted into AllMatches instances by the evaluation of full-text selections. Thus, the evaluation of nested full-text and XQuery expressions instances moves back and forth between these two models.
The resulting AllMatches instance obtained by the evaluation of an FTContainsExpr is converted into a Boolean value before being returned to the enclosing XPath or XQuery operation as follows. If at least one member of the disjunction contains only positive terms then value returned is true. If all members of the disjunction contain negative terms the result is false.
Weighting information, in an implementation-dependent fashion, may be used when calculating the scoring information computed and made available by FTContainsExpr to the optional score construct.
Given the components of a given full-text contains expression, the evaluation algorithm will proceed according to the following steps, also referenced in the processing model diagram as steps FTn (see Fig. 1):
Evaluate the search context expression (resulting in the sequence of search context items), the ignore option, if any (resulting in the set of ignored nodes), and any other XQuery/XPath exprssions nested within the full-text contains expression. (FT1)
Tokenize the query string(s). (FT2.1)
For each search context item:
Delete the ignored nodes from the search context item.
Tokenize the result of the previous step. This produces a sequence of tokens. (FT2.2) Note that implementations may (as an optimization) perform tokenization as part of the External Processing that is described in the XQuery Processing Model, when an XML document is parsed into an Infoset/PSVI and ultimately into a XQuery Data Model instance.
Evaluate the FTSelection against the tokens of the search context. (FT3, FT4)
Convert the topmost AllMatches instances into a Boolean value. (FT5)
The additional scoring information (also part of FT5) that is produced by the evaluation of the full-text contains expression is implementation-dependent and is not specified in this document. The scoring information is made available at the same time the Boolean value is returned.
(A more detailed version of the above procedure appears in Section 4.3 FTContainsExpr.)
Section 3 Full-Text Selections describes the syntax and the informal semantics of full-text operators. Their formal semantics as well as the formal definition of the AllMatches data model are given in Section 4 Semantics.
[Definition: A full-text contains expression is a expression that evaluates a sequence of items against a full-text selection. ]
As a syntactic construct, a full-text contains expression (grammar symbol: FTContainsExpr) behaves like a comparison expression (see Section 3.5.2 General ComparisonsXQ). This grammar rule introduces FTContainsExpr.
| [50] | ComparisonExpr |
::= | FTContainsExpr (
(ValueComp |
A full-text contains expression may be used anywhere a
ComparisonExpr may be used. The ftcontains operator
has higher precedence than other comparison operators, so the
results of ftcontains expressions may be compared
without enclosing them in parentheses.
| [51] | FTContainsExpr |
::= | RangeExpr (
"ftcontains" FTSelection
FTIgnoreOption?
)? |
A full-text contains expression returns a Boolean value. It returns true if there is some item returned by the RangeExpr that, after tokenization, matches the full-text selection FTSelection. See Section 3 Full-Text Selections for more details. For the purpose of determining a match, certain descendants of nodes (identified by FTIgnoreOption) in the RangeExpr may be ignored, as specified in Section 3.7 Ignore Option.
An XQuery and XPath Full Text processor SHOULD try to use the information available in xml:lang for processing of collations, as well as the various match options defined in Section 3.4 Match Options.
The following example in XQuery Full Text returns the author of
each book with a title containing a token with the same root as
dog and the token cat.
for $b in /books/book
where $b/title ftcontains ("dog" with stemming) ftand "cat"
return $b/author
The same example in XPath Full Text is written as:
/books/book[title ftcontains ("dog" with stemming) ftand "cat"]/author
In the next example a ComparisonExpr is combined with an
FTContainsExpr using the logical XQuery operator and.
The query selects books that have a price of less than 50 and a
title which contains a token with the same root as
train:
/books/book[price < 50 and title ftcontains ("train" with stemming)]
The following example shows the combination of two
ftcontains expressions the results of which are
compared using the not-equals operator. The query selects books
where either the title contains the token dog and the
token cat and the content does not contain a token
with the same root as train, or where the title fails
to have one of the matching tokens but the content does:
/books/book[title ftcontains "dog" ftand "cat" ne
content ftcontains ("train" with stemming)]
Besides specifying a match of a full-text query as a Boolean condition, full-text query applications typically also have the ability to associate scores with the results. [Definition: The score of a full-text query result expresses its relevance to the search conditions.]
XQuery and XPath Full Text extends the languages of XQuery 1.0
and XPath 2.0 further by adding optional score
variables to the for and let clauses of
FLWOR expressions.
The production for the extended for clause in
XQuery 1.0 follows.
| [35] | ForClause |
::= | "for" "$" VarName
TypeDeclaration?
PositionalVar? FTScoreVar? "in" ExprSingle ("," "$" VarName TypeDeclaration? PositionalVar? FTScoreVar? "in" ExprSingle)* |
| [37] | FTScoreVar |
::= | "score" "$" VarName |
In XPath 2.0, the SimpleForClause is extended similarly.
When a score variable is present in a
for clause the evaluation of the expression following
the in keyword not only needs to determine the result
sequence of the expression, i.e., the sequence of items which are
iteratively bound to the for variable. It must also
determine in each iteration the relevance "score" value of the
current item and bind the score variable to that
value.
The semantics of scoring and how it relates to second-order functions is discussed in Section 4.4 Scoring.
In the following example book elements are
determined that satisfy the condition [content ftcontains
"web site" ftand "usability" and .//chapter/title ftcontains
"testing"]. The scores assigned to the book
elements are returned.
for $b score $s
in /books/book[content ftcontains "web site" ftand "usability"
and .//chapter/title ftcontains "testing"]
return $s
The example above is also a legal example of the XPath 2.0 extension.
Scores are typically used to order results, as in the following, more complete example.
for $b score $s
in /books/book[content ftcontains "web site" ftand "usability"]
where $s > 0.5
order by $s descending
return <result>
<title> {$b//title} </title>
<score> {$s} </score>
</result>
Note that the score variable gets one score value for
each item in the value of the expression after the in
keyword, regardless of the number of FTContainsExprs in that
expression. In the following example, two separate full-text
contains expressions are used to select the matching paragraphs.
There is still just one score for each para returned.
The highest scoring paragraphs will be returned first:
for $p score $s in //book[title ftcontains "software"]/para[. ftcontains "usability"]
order by $s descending
return $p
The following more elaborate example uses multiple score variables to return the matching paragraphs ordered so that those from the highest scoring books precede those from the lowest scoring books, where the highest scoring paragraphs of each book are returned before the lower scoring paragraphs of that book:
for $b score $score1 in //book[title ftcontains "software"]
order by $score1 descending
return
for $p score $score2 in $b/para[. ftcontains "usability"]
order by $score2 descending
return $p
The score variable is bound to a value which
reflects the relevance of the match criteria in the full-text
selections to the items returned by the respective RangeExprs. The
calculation of relevance is implementation-dependent,
but score evaluation must follow these rules:
Score values are of type xs:double in the range [0,
1].
For score values greater than 0, a higher score must imply a higher degree of relevance
Similarly to their use in a for clause, score
variables may be specified in a let clause. A score
variable in a let clause is also bound to the score of
the expression evaluation, but in the let clause one
score is determined for the complete result.
The production for the extended let clause
follows.
| [38] | LetClause |
::= | (("let" "$" VarName
TypeDeclaration?) |
("let" "score" "$" VarName))
":=" ExprSingle ("," (("$"
VarName TypeDeclaration?) | FTScoreVar) ":=" ExprSingle)* |
When using the score option in a for clause the
expression following the in keyword has the dual
purpose of filtering, i.e., driving the iteration, and determining
the scores. It is possible to separately specify expressions for
filtering and scoring by combining a simple for clause
with a let clause that uses scoring. The following is
an example of this.
for $b in /books/book[.//chapter/title ftcontains "testing"]
let score $s := $b/content ftcontains "web site" ftand "usability"
order by $s descending
return <result score="{$s}">{$b}</result>
This example returns book elements with chapter
titles that contain "testing". Along with the book
elements scores are returned. These scores, however, reflect
whether the book content contains "web site" and "usability".
Note that it is not a requirement of the score of an
FTContainsExpr to be 0, if the expression evaluates to false, nor
to be non-zero, if the expression evaluates to true. Hence, in the
example above it is not possible to infer the Boolean value of the
FTContainsExpr in the let clause from the calculated
score of a returned result element. For instance, an
implementation may want to assign a non-zero score to a book that
contained "web site", but not "usability", as this may be
considered more relevant than a book that does not contain "web
site" or "usability".
The expression ExprSingle associated with the score variable is passed to the scoring algorithm. The scoring algorithm calculates the score value based on the passed expression (not on the value returned by evaluating the expression). The set of expressions supported by the scoring algorithm is implementation-defined. If an expression not supported by the scoring algorithm is passed to the scoring algorithm, the result is implementation-defined.
The use of score variables introduces a
second-order aspect to the evaluation of expressions which cannot
be emulated by (first-order) XQuery functions. Consider the
following replacement of the clause let score $s :=
FTContainsExpr
let $s := score(FTContainsExpr)
where a function score is applied to some
FTContainsExpr. If the function score were
first-order, it would only be applied to the result of the
evaluation of its argument, which is one of the Boolean constants
true or false. Hence, there would be at
most two possible values such a score function would
be able to return and no further differentiation would be
possible.
[Definition: Scoring may be influenced by adding weight declarations to search tokens, phrases, and expressions.] Weight declarations are introduced syntactically in the FTSelection production, described in Section 3 Full-Text Selections.
The weight MUST have an absolute value between 0.0 and 1000.0 inclusive.
The weights assigned are not related to any absolute standard, but typically have a relationship to other weights within the same FTContains expression.
The effect of weights on the resulting score is implementation-dependent. However, scoring algorithms MUST conform to these constraints:
When no explicit weight is specified, the default weight is 1.0; and
Weight declarations in an FTContainsExpr for which no scores are evaluated are ignored.
The following example illustrates how different weights can be used for different search terms.
for $b in /books/book
let score $s := $b/content ftcontains ("web site" weight 0.5)
ftand ("usability" weight 2)
return <result score="{$s}">{$b}</result>
The XQuery Static Context is extended with a component for each full-text match option group. The settings of these components can be changed by using the following declaration syntax in the Prolog.
| [6] | Prolog |
::= | ((DefaultNamespaceDecl |
Setter | NamespaceDecl | Import | FTOptionDecl) Separator)* ((VarDecl | FunctionDecl | OptionDecl) Separator)* |
| [14] | FTOptionDecl |
::= | "declare" "ft-option" FTMatchOptions |
Match options modify the match semantics of full-text expressions. They are described in detail in Section 3.4 Match Options. When a match option is specified explicitly in a full-text expression, it overrides the setting of the respective component in the static context.
This section describes the full-text selections which contain the full-text operators in a full-text contains expression (FTContainsExpr), as well as the match options which modify the matching semantics of the full-text selections. In the following, the syntax for each type of full-text selection is given together with an informal statement of its meaning.
[Definition: A full-text selection specifies the conditions of a full-text search. ]
| [144] | FTSelection |
::= | FTOr FTPosFilter* ("weight" RangeExpr)? |
As shown in the grammar, a full-text selection consists of
search conditions possibly involving logical operators (FTOr) followed by an arbitrary number of
positional filters (FTPosFilter) optionally followed by a
"weight" value which is specified using a range expression. The
RangeExpr is evaluated, as if it were an argument to a function
with an expected type xs:double; it must be between
0.0 and 1000.0 inclusive.
The syntax and semantics of the individual full-text selection operators follow.
This XML document is the source document for examples in this section.
<books>
<book number="1">
<title shortTitle="Improving Web Site Usability">Improving
the Usability of a Web Site Through Expert Reviews and
Usability Testing</title>
<author>Millicent Marigold</author>
<author>Montana Marigold</author>
<editor>Véra Tudor-Medina</editor>
<content>
<p>The usability of a Web site is how well the
site supports the users in achieving specified
goals. A Web site should facilitate learning,
and enable efficient and effective task
completion, while propagating few errors.
</p>
<note>This book has been approved by the Web Site
Users Association.
</note>
</content>
</book>
</books>
Tokenization is implementation-defined. A sample
tokenization is used for the examples in this section. This sample
tokenization uses white space, punctuation and XML tags as
word-breakers and <p> for paragraph boundaries.
The results may be different for other tokenizations.
The first five tokens in this example using the sample tokenization would be "Improving", "the", "usability", "of", and "a".
Unless stated otherwise, the results assume a case-insensitive match.
| [150] | FTPrimary |
::= | (FTWords FTTimes?) | ("(" FTSelection ")") | FTExtensionSelection |
[Definition: A primary full-text selection is the basic form of a full-text selection. It specifies tokens and phrases as search conditions (FTWords), optionally followed by a cardinality constraint (FTTimes). An FTSelection in parentheses and the FTExtensionSelection are also a primary full-text selections.]
| [151] | FTWords |
::= | FTWordsValue
FTAnyallOption? |
| [152] | FTWordsValue |
::= | Literal | ("{"
Expr "}") |
| [154] | FTAnyallOption |
::= | ("any" "word"?) | ("all" "words"?) | "phrase" |
FTWords finds matches that contain the specified tokens and phrases.
FTWords consists of two parts: a mandatory FTWordsValue part and an optional FTAnyallOption part. FTWordsValue specifies the tokens and phrases that must be contained in the matches. FTAnyallOption specifies how containment is checked.
In general, the tokens and phrases in FTWordsValue are specified using a nested XQuery expression. To simplify notation, the enclosing braces may be omitted if FTWordsValue consists of a single literal.
The following rules specify how an FTWordsValue matches tokens and
phrases. First, the FTWordsValue is converted to a
sequence of strings as though it were an argument to a function
with the expected type of xs:string*. Then, each of
those strings is tokenized into a sequence of tokens as described
in Section 4.1 Tokenization. Then,
FTAnyallOption is
checked.
If FTAnyallOption is "any", the sequence of tokens for each string is considered as a phrase, i.e. a match is found in the tokenized form of the text being searched, whenever that form contains a subsequence of tokens that corresponds to the sequence of query tokens in an implementation-defined way and that subsequence of tokens covers consecutive token positions in the tokenized text. If the value of the FTWordsValue contains more than one string, the different strings are considered to be alternatives, i.e. the resulting matches must contain at least one of the generated phrases.
If FTAnyallOption is "all", the sequence of tokens for each string is considered as a phrase. The resulting matches must contain all of the generated phrases.
If FTAnyallOption is "phrase", the tokens from all the strings are concatenated in a single sequence, which is considered as a phrase. The resulting matches must contain the generated phrase.
If FTAnyallOption is "any word", the tokens from all the strings are combined into a single set. The resulting matches must contain at least one of the tokens in the set.
If FTAnyallOption is "all words", the tokens from all the strings are combined into a single set. The resulting matches must contain all of the tokens in the set.
If the FTWordsValue evaluates to a single string, the use of "any", "all", and "phrase" in FTAnyallOption produces the same results.
If FTAnyallOptions is omitted, "any" is the default.
The following expression returns the sample book
element, because its title element contains the token
"Expert":
//book[./title ftcontains "Expert"]
The following expression returns the sample book
element, because its title element contains the phrase
"Expert Reviews":
//book[./title ftcontains "Expert Reviews"]
The following expression returns the sample book
element, because its title element contains the two
tokens "Expert" and "Reviews":
//book[./title ftcontains {"Expert", "Reviews"} all]
The following expression returns false for our sample document,
because the p element doesn't contain the phrase "Web
Site Usability" although it contains all of the tokens in the
phrase:
//book//p ftcontains "Web Site Usability"
The following expression returns book numbers of
book elements by "Marigold" with a title about "Web
Site Usability", sorting them in descending score order:
for $book in /books/book[.//author ftcontains "Marigold"] let score $score := $book/title ftcontains "Web Site Usability" where $score > 0.8 order by $score descending return $book/@number
| [155] | FTTimes |
::= | "occurs" FTRange
"times" |
[Definition: A cardinality selection consist of an FTWords followed by the FTTimes postfix operator.] A cardinality selection selects matches for which the operand FTWords is matched a specified number of times.
A cardinality selection limits the number of different matches of FTWords within the specified range. The semantics of FTRange are described in 3.6.3 Distance Selection.
In the document fragment "very very big":
The FTWords "very
big" has 1 match consisting of the second "very" and
"big".
The FTWords {"very",
"big"} all has 2 matches; one consisting of the first "very"
and "big", and the other containing the second "very" and
"big".
The FTWords {"very",
"big"} any has 3 matches.
The following expression returns the example book
element's number, because the book element contains 2
or more occurrences of "usability":
//book[. ftcontains "usability" occurs at least 2 times]/@number
The following expression returns the empty sequence, because
there are 3 occurrences of {"usability", "testing"}
any in the designated title:
//book[@number="1" and title ftcontains {"usability",
"testing"} any occurs at most 2 times]
Full-text match options modify the matching behaviour of the primary full-text selection to which they are applied.
| [149] | FTPrimaryWithOptions |
::= | FTPrimary FTMatchOptions? |
| [165] | FTMatchOptions |
::= | FTMatchOption+ |
| [166] | FTMatchOption |
::= | FTLanguageOption |
[Definition: Match options modify the set of tokens in the query, or how they are matched against tokens in the text.]
[Definition: Each of the seven alternatives of production FTMatchOption corresponds to one match option group. ] The match options from any given group are mutually exclusive, i.e., only one of these settings can be in effect, whereas match options of different groups can be combined freely.
Note that, along with the syntax rules above, there is an extra-grammatical constraint, multiple-match-options , which needs to be considered, if multiple match options are specified. It states that within a single FTMatchOptions at most one match option of any given match option group may be specified. For example, if the FTCaseOption "lowercase" is specified, then "uppercase" cannot also be specified as part of the same FTMatchOptions.
Although match options only take effect in the application of
FTWords, the syntax also allows
to specify match options that modify the non-primitive full-text
selection "(" FTSelection ")". Such a higher-level
match option provides a default for the respective match option
group for any embedded FTPrimary, just as match option declarations in the
Prolog provide default match
options for the whole query.
Match options are propagated through the query via the static
context. For each of the seven match option groups, the static
context has a component that contains one option from that group.
The seven settings are initialized by the implementation in
accordance with the table in Appendix C Static Context
Components, and are modified by any FTOptionDecls in the Prolog. The resulting settings are then
propagated unchanged to every FTContainsExpr in the module
(including those in VarDecls and
FunctionDecls, and including any that happen to be
nested within another FTContainsExpr). At any given
FTContainsExpr, the settings from the static context
are copied to the FTContainsExpr's inner settings,
which are then propagated down the syntax tree. At each FTPrimaryWithOptions, the
locally specified match options (if any) overwrite the
corresponding inner setting(s). At each FTWords, the inner settings are used as
the effective match options for tokenizing the query strings and
matching them against the tokens in the text. (These inner settings
could be seen as a parallel set of components in the static
context, but Section 4 Semantics
models them as structures that get passed as parameters to various
semantic functions.)
Thus, when a match option appears in an FTSelection, it applies to the
associated FTPrimary, but not
to any FTContainsExprs that happen to be embedded
within that FTPrimary. Instead, for a nested
FTContainsExpr, the default match options are those
declared in the Prolog or, if not declared in the
Prolog, then supplied by the implementation's initial
values.
[Definition: The order in which effective match options for an FTWords are applied is called the match option application order.] This order is significant because match options are not always commutative. For example, synonym(stem(word)) is not always the same as stem(synonym(word)).
The match option application order is subject to some constraints:
The Language Option must be applied first
The Stemming Option must be applied before the Case Option and the Diacritics Option
Aside from these constraints, the full order of the application of match options is implementation-defined.
More information on their semantics is given in 4.2.6 Match Options Semantics.
If no match options declarations are present in the prolog and the implementation does not define any overwriting of the static context components for the match options, the query:
/books/book/title ftcontains "usability"
is, assuming "de" is the implementation-defined default language, equivalent to the query:
/books/book/title ftcontains "usability"
language "de"
without wildcards
without thesaurus
without stemming
case insensitive
diacritics insensitive
without stop words
We describe each match option group in more detail in the following sections.
| [175] | FTLanguageOption |
::= | "language" StringLiteral |
[Definition: A language option modifies token matching by specifying the language of search tokens and phrases.]
The StringLiteral following the keyword language
designates one language. It must be castable to
xs:language; otherwise, an error is raised: [err:XPTY0004]XP.
The "language" option influences tokenization, stemming, and stop words in an implementation-defined way. The "language" option MAY influence the behavior of other match options in an implementation-defined way.
The set of standardized language identifiers is defined in [BCP 47]. The set of valid language identifiers among the standardized set is implementation-defined. An implementation MAY choose to use private extensions introduced by a singleton 'x' for additional language identifiers, or other singletons for registered extensions as described in sec. 2.2.6 of [BCP 47]. It is implementation-defined what additional language identifiers, if any, are valid. If an invalid language identifier is specified, then the behavior is implementation-defined. If the implementation chooses to raise an error in that case, it must raise [err:FTST0009].
The default language is specified in the static context.
When an XQuery and XPath Full Text processor evaluates text in a document that is governed by an xml:lang attribute and the portion of the full-text query doing that evaluation contains an FTLanguageOption that specifies a different language from the language specified by the governing xml:lang attribute, the language-related behavior of that full-text query is implementation-defined.
This is an example where the language option is used to select the appropriate stop word list:
//book[@number="1"]//editor ftcontains "salon de the" with default stop words language "fr"
| [176] | FTWildCardOption |
::= | ("with" "wildcards") | ("without"
"wildcards") |
[Definition: A wildcard option modifies token and phrase matching by specifying whether wildcards are used or not.]
When the "with wildcards" option is used, wildcard indicators (represented by periods (.)) and qualifiers may be appended to or inserted into the query tokens. If the period is at the beginning of a query token, the wildcard is a prefix wildcard. If the period is at the end of a query token, it is a suffix wildcard. If the period is inserted into a query token, it is an infix wildcard.
Each indicator and qualifier in a query token will match zero or more characters within a token in the text being searched, as described below. The number of characters matched depends on the qualifier. Qualifiers available are none, question mark, asterisk, plus sign, and two numbers separated by a comma, both enclosed by curly braces.
If a period is present, but there are no qualifiers, one character in the text will match.
If a period is followed by a question mark (.?), zero or one characters in the text being searched will match.
If a period is followed by an asterisk (.*), zero or more characters will match.
If a period is followed by a plus sign (.+), one or more characters will match.
If a period is followed by two numbers separated by a comma, both enclosed by curly braces (.{n,m}), a specified range of characters (at least n characters and no more than m characters) will match.
When "with wildcards" is present and an indicator or qualifier character is intended to be taken literally (as itself), that character must be preceded by ("escaped by") a backslash (\). For example, a period (.) that is intended to be a sentence terminator or a decimal point must be preceded by a backslash so that it is not interpreted to be an indicator. Similarly a question mark (?), asterisk (*), or plus sign (+) that is intended to be interpreted as an ordinary text character must be preceded by a backslash so that it is not interpreted to be an indicator.
The "without wildcards" option finds tokens without recognizing wildcard indicators and qualifiers. Periods, question marks, asterisks, plus signs, and two numbers separated by a comma, both enclosed by curly braces, are always recognized as ordinary text characters.
The default is "without wildcards".
Note: Wildcard indicators and qualifiers may be token boundaries. How text with wildcard indicators and qualifiers is tokenized is implementation-defined.
The expression returns true, because the title
element contains "improving":
//book[@number="1"]/title ftcontains "improv.*" with wildcards
The following expression returns true, because the
title element contains "site":
//book[@number="1"]/title ftcontains ".?site" with wildcards
The following expression returns true, because the
p element contains "well":
//book[@number="1"]/p ftcontains "w.ll" with wildcards
The following expression returns false, because the
p element does not contain the phrase "w ll":
//book[@number="1"]/p ftcontains "w.ll" without wildcards
(Note that, without wildcards, the sample tokenization will treat the period in "w.ll" as punctuation, thus producing "w" and "ll" as separate tokens.)
| [170] | FTThesaurusOption |
::= | ("with" "thesaurus" (FTThesaurusID | "default")) |
| [171] | FTThesaurusID |
::= | "at" URILiteral
("relationship" StringLiteral)? (FTRange "levels")? |
| [143] | URILiteral |
::= | StringLiteral |
[Definition: A thesaurus option modifies token and phrase matching by specifying whether a thesaurus is used or not.] If thesauri are used, the thesaurus option specifies information to locate the thesauri either by default or through a URI reference. It also states the relationship to be applied and how many levels within the thesaurus to be traversed.
Thesauri add related tokens and phrases to the query or change query tokens. Thus, the user may narrow, broaden, or otherwise modify the query using synonyms, hypernyms (more generic terms), etc. The search is performed as though the user has specified all related query tokens and phrases in a disjunction (FTOr).
Note:
A thesaurus may be standards-based or locally-defined. It may be a traditional thesaurus, or a taxonomy, soundex, ontology, or topic map. How the thesaurus is represented is implementation-dependent.
FTThesaurusID specifies the relationship sought between tokens and phrases written in the query and terms in the thesaurus and the number of levels to be queried in hierarchical relationships by including an FTRange "levels". If no levels are specified, the default is to query all levels in hierarchical relationships.
Relationships include, but are not limited to, the relationships and their abbreviations presented in [ISO 2788] and their equivalents in other languages. The set of relationships supported by an implementation is implementation-defined, but implementations SHOULD support the relationships defined in [ISO 2788]. The following list of terms have the meanings defined in [ISO 2788]. If a query specifies thesaurus relationships or levels not supported by the thesaurus, or does not specify a relationship, the behavior is implementation-defined.
equivalence relationships (synonyms): PREFERRED TERM (USE), NONPREFERRED USED FOR TERM (UF);
hierarchical relationships: BROADER TERM (BT), NARROWER TERM (NT), BROADER TERM GENERIC (BTG), NARROWER TERM GENERIC (NTG), BROADER TERM PARTITIVE (BTP), NARROWER TERM PARTITIVE (NTP), TOP Terms (TT); and
associative relationships: RELATED TERM (RT).
The "with thesaurus" option specifies that string matches include tokens that can be found in one of the specified thesauri. When "default" is used in place of a FTThesaurusID, the thesauri specified in the static context are used, which are either given by the prolog declaration for the thesaurus option, or, if no such declaration exists a system-defined default thesaurus with a system-defined relationship. The default thesaurus may be used in combination with other explicitly specified thesauri.
The "without thesaurus" option specifies that no thesaurus will be used.
The default is "without thesaurus".
The following expression returns true, because it finds a
content element containing "tasks" which the thesaurus
identified as a synonym for "duties":
count(.//book/content ftcontains "duties" with thesaurus at "http://bstore1.example.com/UsabilityThesaurus.xml" relationship "UF")>0
The following expression returns book elements,
because it finds a content element containing "web
site components", and narrower terms "navigation" and "layout":
doc("http://bstore1.example.com/full-text.xml")
/books/book[count(./content ftcontains "web site components" with
thesaurus at "http://bstore1.example.com/UsabilityThesaurus.xml"
relationship "NT" at most 2 levels)>0]
Assuming the thesaurus available at URL
"http://bstore1.example.com/UsabilitySoundex.xml" contains soundex
capabilities, the following query returns a book
element containing "Marigold" which sounds like "Merrygould":
doc("http://bstore1.example.com/full-text.xml")
/books/book[count(. ftcontains "Merrygould" with thesaurus at
"http://bstore1.example.com/UsabilitySoundex.xml" relationship
"sounds like")>0]
| [169] | FTStemOption |
::= | ("with" "stemming") | ("without" "stemming") |
[Definition: A stemming option modifies token and phrase matching by specifying whether stemming is applied or not. ]
The "with stemming" option specifies that matches may contain tokens that have the same stem as the tokens and phrases written in the query. It is implementation-defined what a stem of a token is.
The "without stemming" option specifies that the tokens and phrases are not stemmed.
It is implementation-defined whether the stemming is based on an algorithm, dictionary, or mixed approach.
The default is "without stemming".
The following expression returns true, because the
title of the specified book contains
"improving" which has the same stem as "improve":
/books/book[@number="1"]/title ftcontains "improve" with stemming
| [167] | FTCaseOption |
::= | ("case" "insensitive") |
[Definition: A case option modifies the matching of tokens and phrases by specifying how uppercase and lowercase characters are considered.]
There are four possible character case options:
Using the option "case insensitive", tokens and phrases are matched, regardless of the case of characters of the query tokens and phrases.
Using the option "case sensitive", tokens and phrases are matched, if and only if the case of their characters is the same as written in the query.
Using the option "lowercase", tokens and phrases are matched, if and only if they match the query without regard to character case, but contain only lowercase characters.
Using the option "uppercase", tokens and phrases are matched, if and only if they match the query without regard to character case, but contain only uppercase characters.
The default is "case insensitive".
The effect of the case options is also influenced by the query's default collation (see Section 2.1.1 Static ContextXQ and Section 4.4 Default Collation DeclarationXQ). The following table summarizes how these interact.
| Case option \ Default collation | UCC (Unicode Codepoint Collation) | CCS (some generic case-sensitive collation) | CCI (some generic case-insensitive collation) |
|---|---|---|---|
| case insensitive | compare as if both lower | case-insensitive variant of CCS if it exists, else error | CCI |
| case sensitive | UCC | CCS | case-sensitive variant of CCI if it exists, else error |
| lowercase | compare using UCC after applying fn:lower-case() to the query string | compare using CCS after applying fn:lower-case() to the query string | CCI |
| uppercase | compare using UCC after applying fn:upper-case() to the query string | compare using CCS after applying fn:upper-case() to the query string | CCI |
Note:
In this table, "else error" means "Otherwise, an error is raised: [err:FOCH0002]FO". The phrase "if it exists" is used, because the case-sensitive collation CCS does not always have a case-insensitive variant (and, even if one exists, it may not be possible to determine it algorithmically), and because the case-insensitive collation CCI does not always have a case-sensitive variant (and, even if one exists, it may not be possible to determine it algorithmically).
The following expression returns false, because the
title element doesn't contain "usability" in
lower-case characters:
//book[@number="1"]/title ftcontains "Usability" lowercase
The following expression returns true, because the character case is not considered:
//book[@number="1"]/title ftcontains "usability" case insensitive
| [168] | FTDiacriticsOption |
::= | ("diacritics" "insensitive") |
[Definition: A diacritics option modifies token and phrase matching by specifying how diacritics are considered. ]
There are two possible diacritics options:
The option "diacritics" "insensitive" matches tokens and phrases with and without diacritics. Whether diacritics are written in the query or not is not considered.
The option "diacritics" "sensitive" matches tokens and phrases only if they contain the diacritics as they are written in the query.
The default is "diacritics insensitive".
The effect of the diacritics options is also influenced by the query's default collation (see Section 2.1.1 Static ContextXQ and Section 4.4 Default Collation DeclarationXQ). The following table summarizes how these interact.
| Diacritics option \ Default collation | UCC (Unicode Codepoint Collation) | CDS (some generic diacritics-sensitive collation) | CDI (some generic diacritics-insensitive collation) |
|---|---|---|---|
| diacritics insensitive | UCC comparison, but without considering diacritics | diacritics-insensitive variant of CDS if it exists, else error | CDI |
| diacritics sensitive | UCC | CDS | diacritics-sensitive variant of CDI if it exists, else error |
Note:
In this table, "else error" means "Otherwise, an error is raised: [err:FOCH0002]FO". The phrase "if it exists" is used, because the diacritics-sensitive collation CDS does not always have a diacritics-insensitive variant (and, even if one exists, it may not be possible to determine it algorithmically), and because the diacritics-insensitive collation CDI does not always have a diacritics-sensitive variant (and, even if one exists, it may not be possible to determine it algorithmically).
The following expression returns true, because the token "Véra"
in the editor element is matched, as the acute accent
is not considered in the comparison:
//book[@number="1"]//editor ftcontains "Vera" diacritics insensitive
This returns false, because the editor element does
not contain the token "Vera" in this exact form, i.e. without any
diacritics:
//book[@number="1"]/editors ftcontains "Vera" diacritics sensitive
| [172] | FTStopWordOption |
::= | ("with" "stop" "words" FTStopWords FTStopWordsInclExcl*) |
| [173] | FTStopWords |
::= | ("at" URILiteral) |
| [174] | FTStopWordsInclExcl |
::= | ("union" | "except") FTStopWords |
[Definition: A stop word option controls matching of FTWords by specifying whether stop words are used or not. Stop words are tokens in the query that match any token in the text being searched. ] Normally a stop word matches exactly one token, but there may be implementation-defined conditions, under which a stop word may match a different number of tokens.
FTStopWords specifies the
list of stop words either explicitly as a comma-separated list of
string literals, or by the keyword at followed by a
literal URI. If the URI specifies a list of stop words that is not
found in the statically known stop word lists, an error is raised
[err:FTST0008].
Whether the stop word list is resolved from the statically known
stop word lists or given explicitly, no tokenization is performed
on the stop words: they are used as they occur in the list.
The "with stop words" option specifies that if a token is within the specified collection of stop words, it is removed from the search and any token may be substituted for it. Stop words retain their position numbers and are counted in FTDistance and FTWindow searches.
Multiple stop word lists may be combined using "union" or "except". The keywords "union" and "except" are applied from left to right. If "union" is specified, every string occurring in the lists specified by the left-hand side or the right-hand side is a stop word. If "except" is specified, only strings occurring in the list specified by the left-hand side but not in the list specified by the right-hand side are stop words.
The "with default stop words" option specifies that an implementation-defined collection of stop words is used.
The "without stop words" option specifies that no stop words are used. This is equivalent to specifying an empty list of stop words.
The default is "without stop words".
Note:
Some implementations may apply stop word lists during indexing and be unable to comply with query-time requests to not apply those stop words. An implementation may still support stop-word options (and therefore not raise [err:FTST0006]) by applying any additional stop words specified in the query. Pre-application of irrevocable stop word lists falls under implementation-defined tokenization behavior in this case, and a query that specifies "without stop words" may still have some words ignored.
The following expression returns true, because the document contains the phrase "propagating few errors":
/books/book[@number="1"]//p ftcontains "propagation of errors"
with stemming with stop words ("a", "the", "of")
Note the asymmetry in the stop word semantics: the property of being a stop word is only relevant to query terms, not to document terms. Hence, it is irrelevant for the above-mentioned match whether "few" is a stop word or not, and on the other hand we do not want the query above to match "propagation" followed by 2 stop words, or even a sequence of 3 stop words in the document.
The following expression returns false. In this case specifying "few" as a stop word has no effect, since "few" does not appear in the query. Although the words "propagating" and "errors" appear in the text being searched, the phrase "propagating errors" cannot be matched, since that phrase does not occur.
/books/book[@number="1"]//p ftcontains "propagating errors"
with stop words ("few")
The following expression returns false, because "of" is not in
the p element between "propagating" and "errors":
/books/book[@number="1"]//p ftcontains "propagation of errors" with stemming without stop words
The following expression uses the stop words list specified at
the URL. Assuming that the specified stop word list contains the
word "then", this query is reduced to a query on the phrase
"planning X conducting", allowing any token as a substitute for X.
It returns a book element, because its
content element contains "planning then conducting".
It would also return the book if the phrases "planning
and conducting" and "planning before conducting" had been in its
content:
doc("http://bstore1.example.com/full-text.xml")
/books/book[count(.//content ftcontains "planning then
conducting" with stop words at
"http://bstore1.example.com/StopWordList.xml")>0]
The following expression returns books containing
"planning then conducting", but not does not return
books containing "planning and conducting", since it
is exempting "then" from being a stop word:
doc("http://bstore1.example.com/full-text.xml")
/books/book[count(.//content ftcontains "planning then conducting"
with stop words at "http://bstore1.example.com/StopWordList.xml"
except ("the", "then"))>0]
[Definition: An extension option is a match option that acts in an implementation-defined way. ]
| [177] | FTExtensionOption |
::= | "option" QName StringLiteral |
An extension option consists of an identifying QName and a StringLiteral. Typically, a particular option will be recognized by some implementations and not by others. The syntax is designed so that option declarations can be successfully parsed by all implementations.
The QName of an extension option must resolve to a namespace URI and local name, using the statically known namespaces.
Note:
There is no default namespace for options.
Each implementation recognizes an implementation-defined set of namespace URIs used to denote extension options.
If the namespace part of the QName is not a namespace recognized by the implementation as one used to denote extension option, then the extension option is ignored.
Otherwise, the effect of the extension option, including its error behavior, is implementation-defined. For example, if the local part of the QName is not recognized, or if the StringLiteral does not conform to the rules defined by the implementation for the particular extension option, the implementation may choose whether to report an error, ignore the extension option, or take some other action.
Implementations may impose rules on where particular extension options may appear relative to other match options, and the interpretation of an option declaration may depend on its position.
An extension option must not be used to change the syntax accepted by the processor, or to suppress the detection of static errors. However, it may be used without restriction to modify the set of tokens in the query or how they are matched against tokens in the text being searched. An extension option has the same scope as other match options.
The following examples illustrate several possible uses for extension options:
This extension option is set as part of the static context of all full-text expressions in the module and might be used to ensure that queries are insensitive to Arabic short-vowels.
declare namespace exq = "http://example.org/XQueryImplementation"; declare ft-option option exq:diacritics "short-vowel insensitive"
This extension option applies only to the matching in the full-text selection in which it is found and might be used to specify how compound words should be matched.
declare namespace exq = "http://example.org/XQueryImplementation";
//para[. ftcontains
("Kinder" ftand "Platz" distance exactly 1 words)
with stemming
option exq:compounds "distance=1" ]
Full-text selections can be combined with the logical
connectives ftor (full-text or), ftand
(full-text and), not in (mild not), and
ftnot (unary full-text not).
| [145] | FTOr |
::= | FTAnd ( "ftor" FTAnd )* |
| [146] | FTAnd |
::= | FTMildNot ( "ftand"
FTMildNot )* |
| [147] | FTMildNot |
::= | FTUnaryNot ( "not"
"in" FTUnaryNot )* |
| [148] | FTUnaryNot |
::= | ("ftnot")? FTPrimaryWithOptions |
[Definition: An or-selection combines
two full-text selections using the ftor operator.]
An or-selection finds all matches that satisfy at least one of the operand full-text selections.
The following expression returns the book element
written by "Millicent":
//book[.//author ftcontains "Millicent" ftor "Voltaire"]
[Definition: An and-selection combines
two full-text selections using the ftand
operator.]
An and-selection finds matches that satisfy all of the operand full-text selections simultaneously. A match of an and-selection is formed by combining matches for each of the operand full-text selections as described in 4.2.7.2 FTAnd.
For example, "usability" ftand "testing" will find
two matches in //book[@number="1"]/title: each of the
two matches for the FTWords selection "usability" (the
two occurrences of "usability" in the string value of the title
element) is combined with the single match for the FTWords
"testing" (only one occurrence of "testing" in the
title). Since the above and-selection has at least one match, the
following expression will return "true".
//book[@number="1"]/title ftcontains ("usability" ftand "testing")
The following expression returns false, because "Millicent" and
"Montana" are not contained by the same author element
in any book element:
//book/author ftcontains "Millicent" ftand "Montana"
No author element in any book element
contains both "Millicent" and "Montana". Therefore, for any such
author element, there are either one match for the
FTWords "Millicent" and zero matches for the FTWords
"Montana", or vice versa, or no matches for both of
them. In any of these cases, the and-selection will have zero
matches.
[Definition: A mild-not selection
combines two full-text selections using the not in
operator.]
The not in operator is a milder form of the
operator combination ftand ftnot. The selection
A not in B matches a token sequence that matches
A, but not when it is a part of a match of
B. In contrast, A ftand ftnot B only
finds matches when the token sequence contains A and
does not contain B.
As an example, consider a search for "Mexico" not in "New
Mexico". This may return, among others, a document which is
all about "Mexico" but mentions at the end that "New Mexico was
named after Mexico". The occurrence of "Mexico" in "New Mexico" is
not considered, but other occurrences of "Mexico" are matched. Note
that this document would not be matched by the full-text selection
"Mexico" ftand ftnot "New Mexico".
A match to a mild-not selection must contain at least one token that satisfies the first condition and does not satisfy the second condition. If it contains a token that satisfies both the first and the second condition, the token is not considered as a match.
The following expression returns true, because "usability"
appears in the title and the p elements
and the token within the phrase "Usability Testing" in the
title element is not considered:
/books/book ftcontains "usability" not in "usability testing"
Operands of a mild-not selection may not contain a full-text
selection that evaluates to an AllMatches that contains a
StringExclude. Such full-text selections are not-selection
and FTWords with a cardinality
constraint using at most, from ... to,
and exactly occurrences ranges. If such an expression
is encountered, an error [err:FTDY0017] is raised.
[Definition: A not-selection is
a full-text selection starting with the prefix operator
ftnot.]
A not-selection selects matches that do not satisfy the operand full-text selection. Details about how such matches are constructed are given in 4.2.7.3 FTUnaryNot.
The following expression returns the empty sequence, because all
book elements contain "usability":
//book[. ftcontains ftnot "usability"]
The following expression returns true, because book
elements contain "information" and "retrieval" but not "information
retrieval":
//book ftcontains "information" ftand "retrieval" ftand ftnot "information retrieval"
The following expression returns book elements
containing "web site usability" but not "usability testing":
//book[. ftcontains "web site usability" ftand ftnot "usability testing"]
| [157] | FTPosFilter |
::= | FTOrder | FTWindow | FTDistance | FTScope | FTContent |
[Definition: Positional filters are postfix operators that serve to filter matches based on various constraints on their positional information.]
Recall that the grammar rule for FTSelection allows an arbitrary number of positional filters to follow an FTOr. Multiple adjacent positional filters are applied from left to right, i.e., the first filter is applied to the result of the FTOr, the second is applied to the result of that first application, and so on.
| [158] | FTOrder |
::= | "ordered" |
[Definition: An ordered selection consists of a full-text selection followed by the postfix operator "ordered".] An ordered selection constrains the order of tokens and phrases to be the same as the order in which they are written in the operand selection.
The default is unordered. Unordered is in effect when ordered is not specified in the query. Unordered cannot be written explicitly in the query.
An ordered selection selects matches which satisfy the operand full-text selection and which also satisfy the following constraint: the order that the matching tokens or phrases have in the text being searched is the same order that the corresponding query tokens or phrases have in the operand selection. In both cases, the ordering is determined from the minimum start positions of the contituent tokens.
The following expression returns true, because titles of
book elements contain "web site" and "usability" in
the order in which they are written in the query, i.e., "web site"
must precede "usability":
//book/title ftcontains ("web site" ftand "usability") ordered
The following expression returns false, because although
"Montana" and "Millicent" both appear in the book
element, they do not appear in the order they are written in the
query:
//book[@number="1"] ftcontains ("Montana" ftand "Millicent") ordered
| [159] | FTWindow |
::= | "window" AdditiveExpr FTUnit |
| [161] | FTUnit |
::= | "words" | "sentences" | "paragraphs" |
[Definition: A window selection consists of a full-text selection followed by one of the (complex) postfix operators derived from FTWindow.] A window selection selects matches which satisfy the operand full-text selection and for which the matched tokens and phrases, more precisely the individual StringIncludes of that match, are found within a number of FTUnits (words, sentences, and paragraphs). The number of FTUnits is specified by an AdditiveExpr that is converted as though it were an argument to a functio