W3C

XQuery and XPath Full Text 1.0

W3C Candidate Recommendation 16 May 2008

This version:
http://www.w3.org/TR/2008/CR-xpath-full-text-10-20080516/
Latest version:
http://www.w3.org/TR/xpath-full-text-10/
Previous versions:
http://www.w3.org/TR/2006/WD-xquery-full-text-20060501/ http://www.w3.org/TR/2005/WD-xquery-full-text-20051103/ http://www.w3.org/TR/2005/WD-xquery-full-text-20050915/ http://www.w3.org/TR/2005/WD-xquery-full-text-20050404/ http://www.w3.org/TR/2004/WD-xquery-full-text-20040709/
Editors:
Sihem Amer-Yahia, AT&T Labs - Research
Chavdar Botev, Invited Expert
Stephen Buxton, Mark Logic Corporation
Pat Case, Library of Congress
Jochen Doerre, IBM
Mary Holstege, Mark Logic Corporation
Jim Melton, Oracle
Michael Rys, Microsoft
Jayavel Shanmugasundaram, Invited Expert

This document is also available in these non-normative formats: XML.


Abstract

This document defines the syntax and formal semantics of XQuery and XPath Full Text 1.0 which is a language that extends XQuery 1.0 [XQuery 1.0: An XML Query Language] and XPath 2.0 [XML Path Language (XPath) 2.0] with full-text search capabilities.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

W3C publishes a Candidate Recommendation, as described in the Process Document, to indicate that the document is believed to be stable and to encourage implementation by the developer community. The publication of this document constitutes a call for implementations of this specification.

This document has been jointly developed by the W3C XML Query Working Group and the W3C XSL Working Group, each of which is part of the XML Activity. It will remain a Candidate Recommendation until at least 15 September 2008. The Working Groups expect to advance this specification to Recommendation Status.

The XML Query Working Group and XSL Working Group intend to submit this document for consideration as a W3C Proposed Recommendation as soon as the following conditions are all met:

  1. A test suite is available that tests each identified XQuery and XPath Full Text 1.0 feature, both required and optional.

  2. Minimal Conformance to this specification, as defined in 5.1 Minimal Conformance, has been demonstrated by at least two distinct implementations, at least one of which uses the XQuery human-readable syntax defined in this specification.

  3. An XPath Full Text parsing applet that generates XQueryX is available.

  4. The Working Groups have responded formally to all issues raised during the CR period against this document.

Once the entrance criteria for Proposed Recommendation have been achieved, the Director will be requested to advance this document to Proposed Recommendation status. Working closely with the developer community, we expect to show evidence of implementations by approximately 15 September 2008.

The 15 optional features are each individually at risk. Optional features for which there are not at least two implementations at the end of the Candidate Recommendation period may be removed from this specification.

The WG believes that this document, published on 16 May 2008, is sufficiently mature and stable for the development community to begin developing implementation experience and reporting on that experience.

The WGs particularly solicit feedback regarding how thesauri are to be used in combination.

No implementation report currently exists. However, a Test Suite for this document is under development. Implementors are encouraged to run this test suite and report their results. The Test Suite can be found at http://dev.w3.org:/cvsweb/2007/xpath-full-text-10-test-suite/.

This document incorporates changes made against the Last Call Working Draft of 18 May 2007. Changes to this document since the Last Call Working Draft are detailed in J Change Log.

Please report errors in this document using W3C's public Bugzilla system (instructions can be found at http://www.w3.org/XML/2005/04/qt-bugzilla). If access to that system is not feasible, you may send your comments to the W3C XSLT/XPath/XQuery public comments mailing list, public-qt-comments@w3.org. It will be very helpful if you include the string “[FT]” in the subject line of your report, whether made in Bugzilla or in email. Please use multiple Bugzilla entries (or, if necessary, multiple email messages) if you have more than one comment to make. Archives of the comments and responses are available at http://lists.w3.org/Archives/Public/public-qt-comments/.

Publication as a Candidate Recommendation does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by groups operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the XML Query Working Group and also maintains a public list of any patent disclosures made in connection with the deliverables of the XSL Working Group; those pages also include instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents

1 Introduction
    1.1 Full-Text Search and XML
    1.2 Organization of this document
    1.3 A word about namespaces
2 Full-Text Extensions to XQuery and XPath
    2.1 Processing Model
    2.2 Full-Text Contains Expression
        2.2.1 Description
        2.2.2 Examples
    2.3 Score Variables
        2.3.1 Using Weights Within a Scored FTContainsExpr
    2.4 Extensions to the Static Context
3 Full-Text Selections
    3.1 Primary Full-Text Selections
    3.2 Search Tokens and Phrases
    3.3 Cardinality Selection
    3.4 Match Options
        3.4.1 Language Option
        3.4.2 Wildcard Option
        3.4.3 Thesaurus Option
        3.4.4 Stemming Option
        3.4.5 Case Option
        3.4.6 Diacritics Option
        3.4.7 Stop Word Option
        3.4.8 Extension Option
    3.5 Logical Full-Text Operators
        3.5.1 Or-Selection
        3.5.2 And-Selection
        3.5.3 Mild-Not Selection
        3.5.4 Not-Selection
    3.6 Positional Filters
        3.6.1 Ordered Selection
        3.6.2 Window Selection
        3.6.3 Distance Selection
        3.6.4 Scope Selection
        3.6.5 Anchoring Selection
    3.7 Ignore Option
    3.8 Extension Selections
4 Semantics
    4.1 Tokenization
        4.1.1 Examples
        4.1.2 Representations of Tokenized Text and Matching
    4.2 Evaluation of FTSelections
        4.2.1 AllMatches
            4.2.1.1 Formal Model
            4.2.1.2 Examples
            4.2.1.3 XML representation
        4.2.2 XML Representation
        4.2.3 The evaluate function
        4.2.4 Formal semantics functions
        4.2.5 FTWords
        4.2.6 Match Options Semantics
            4.2.6.1 Types
            4.2.6.2 High-Level Semantics
            4.2.6.3 Formal Semantics Functions
            4.2.6.4 FTCaseOption
            4.2.6.5 FTDiacriticsOption
            4.2.6.6 FTStemOption
            4.2.6.7 FTThesaurusOption
            4.2.6.8 FTStopWordOption
            4.2.6.9 FTLanguageOption
            4.2.6.10 FTWildCardOption
        4.2.7 Full-Text Operators Semantics
            4.2.7.1 FTOr
            4.2.7.2 FTAnd
            4.2.7.3 FTUnaryNot
            4.2.7.4 FTMildNot
            4.2.7.5 FTOrder
            4.2.7.6 FTScope
            4.2.7.7 FTContent
            4.2.7.8 FTWindow
            4.2.7.9 FTDistance
            4.2.7.10 FTTimes
    4.3 FTContainsExpr
    4.4 Scoring
    4.5 Example
5 Conformance
    5.1 Minimal Conformance
    5.2 Optional Features
        5.2.1 FTMildNot Operator
        5.2.2 FTUnaryNot Operator
        5.2.3 FTUnit and FTBigUnit
        5.2.4 FTOrder Operator
        5.2.5 FTScope Operator
        5.2.6 FTWindow Operator
        5.2.7 FTDistance Operator
        5.2.8 FTTimes Operator
        5.2.9 FTContent Operator
        5.2.10 FTCaseOption
        5.2.11 FTStopWordOption
        5.2.12 FTLanguageOption
        5.2.13 FTIgnoreOption
        5.2.14 Scoring
        5.2.15 Weights

Appendices

A EBNF for XQuery 1.0 Grammar with Full-Text extensions
    A.1 Terminal Symbols
    A.2 Extra-grammatical Constraints
B EBNF for XPath 2.0 Grammar with Full-Text extensions
    B.1 Terminal Symbols
C Static Context Components
D Error Conditions
E XML Syntax (XQueryX) for XQuery and XPath Full Text 1.0
    E.1 XQueryX representation of XQuery and XPath Full Text 1.0
    E.2 XQueryX stylesheet for XQuery and XPath Full Text 1.0
    E.3 XQueryX for XQuery and XPath Full Text 1.0 example
        E.3.1 Example
            E.3.1.1 XQuery solution in XQuery and XPath Full Text 1.0 Use Cases:
            E.3.1.2 A Solution in Full Text XQueryX:
            E.3.1.3 Transformation of Full Text XQueryX Solution into XQuery Full Text
F References
    F.1 Normative References
    F.2 Non-normative References
G Acknowledgements (Non-Normative)
H Glossary (Non-Normative)
I Checklist of Implementation-Defined Features (Non-Normative)
J Change Log (Non-Normative)


1 Introduction

This document defines the language and the formal semantics of XQuery and XPath Full Text 1.0. This language is designed to meet the requirements identified in W3C XQuery and XPath Full Text Requirements [XQuery and XPath Full Text Requirements] and to support the queries in the W3C XQuery and XPath Full Text Use Cases [XQuery and XPath Full Text Use Cases].

XQuery and XPath Full Text 1.0 extends the syntax and semantics of XQuery 1.0 and XPath 2.0.

Additionally, this document defines an XML syntax for XQuery and XPath Full Text 1.0. The most recent versions of the two XQueryX XML Schemas and the XQueryX XSLT stylesheet for XQuery and XPath Full Text 1.0 are available at http://www.w3.org/2007/xpath-full-text/xpath-full-text-10-xqueryx.xsd, http://www.w3.org/2007/xpath-full-text/xpath-full-text-10-xqueryx-ftmatchoption-extensions.xsd, and http://www.w3.org/2007/xpath-full-text/xpath-full-text-10-xqueryx.xsl, respectively.

1.1 Full-Text Search and XML

As XML becomes mainstream, users expect to be able to search their XML documents. This requires a standard way to do full-text search, as well as structured searches, against XML documents. A similar requirement for full-text search led ISO to define the SQL/MM-FT [SQL/MM] standard. SQL/MM-FT defines extensions to SQL to express full-text searches providing functionality similar to that defined in this full-text language extension to XQuery 1.0 and XPath 2.0.

XML documents may contain highly structured data (fixed schemas, known types such as numbers, dates), semi-structured data (flexible schemas and types), markup data (text with embedded tags), and unstructured data (untagged free-flowing text). Where a document contains unstructured or semi-structured data, it is important to be able to search using Information Retrieval techniques such as scoring and weighting.

Full-text search is different from substring search in many ways:

  1. A full-text search searches for tokens and phrases rather than substrings. A substring search for news items that contain the string "lease" will return a news item that contains "Foobar Corporation releases the 20.9 version ...". A full-text search for the token "lease" will not.

  2. There is an expectation that a full-text search will support language-based searches which substring search cannot. An example of a language-based search is "find me all the news items that contain a token with the same linguistic stem as 'mouse'" (finds "mouse" and "mice"). Another example based on token proximity is "find me all the news items that contain the tokens 'XML' and 'Query' allowing up to 3 intervening tokens".

  3. Full-text search must address the vagaries and nuances of language. Search results are often of varying usefulness. When you search a web site for cameras that cost less than $100, this is an exact search. There is a set of cameras that matches this search, and a set that does not. Similarly, when you do a string search across news items for "mouse", there is only 1 expected result set. When you do a full-text search for all the news items that contain the token "mouse", you probably expect to find news items containing the token "mice", and possibly "rodents", or possibly "computers". Not all results are equal. Some results are more "mousey" than others. Because full-text search may be inexact, we have the notion of score or relevance. We generally expect to see the most relevant results at the top of the results list.

Note:

As XQuery and XPath evolve, they may apply the notion of score to querying structured data. For example, when making travel plans or shopping for cameras, it is sometimes useful to get an ordered list of near matches in addition to exact matches. If XQuery and XPath define a generalized inexact match, we expect XQuery and XPath to utilize the scoring framework provided by XQuery and XPath Full Text.

[Definition: Full-text queries are performed on tokens and phrases. Tokens and phrases are produced via tokenization.] Informally, tokenization breaks a character string into a sequence of tokens, units of punctuation, and spaces.

Tokenization, in general terms, is the process of converting a text string into smaller units that are used in query processing. Those units, called tokens, are the most basic text units that a full-text search can refer to. Full-text operators typically work on sequences of tokens found in the target text of a search. These tokens are characterized by integers that capture the relative position(s) of the token inside the string, the relative position(s) of the sentence containing the token, and the relative position(s) of the paragraph containing the token. The positions typically comprise a start and an end position.

Tokenization, including the definition of the term "tokens", SHOULD be implementation-defined. Implementations SHOULD expose the rules and sample results of tokenization as much as possible to enable users to predict and interpret the results of tokenization. Tokenization is defined more formally in 4.1 Tokenization.

[Definition: A token is a non-empty sequence of characters returned by a tokenizer as a basic unit to be searched. Beyond that, tokens are implementation-defined.] [Definition: A phrase is an ordered sequence of any number of tokens. Beyond that, phrases are implementation-defined.]

Note:

Consecutive tokens need not be separated by either punctuation or space, and tokens may overlap.

Note:

In some natural languages, tokens and words can be used interchangeably.

[Definition: A sentence is an ordered sequence of any number of tokens. Beyond that, sentences are implementation-defined. A tokenizer is not required to support sentences.]

[Definition: A paragraph is an ordered sequence of any number of tokens. Beyond that, paragraphs are implementation-defined. A tokenizer is not required to support paragraphs.]

Some XML elements represent semantic markup, e.g., <title>. Others represent formatting markup, e.g., <b> to indicate bold. Semantic markup serves well as token boundaries. Some formatting markup serves well as token boundaries, for example, paragraphs are most commonly delimited by formatting markup. Other formatting markup may not serve well as token boundaries. Implementations are free to provide implementation-defined ways to differentiate between the markup's effect on token boundaries during tokenization. In the absence of an implementation-defined way to differentiate, element markup (start tags, end tags, and empty-element tags) creates token boundaries.

A sample tokenization is used for the examples in this document. The results might be different for other tokenizations.

Tokenization enables functions and operators that operate on a part or the root of the token (e.g., wildcards, stemming).

Tokenization enables functions and operators which work with the relative positions of tokens (e.g., proximity operators).

This specification focuses on functionality that serves all languages. It also selectively includes functionalities useful within specific families of languages. For example, searching within sentences and paragraphs is useful to many western languages and to some non-western languages, so that functionality is incorporated into this specification.

Certain aspects of language processing are described in this specification as implementation-defined or implementation-dependent.

  • [Definition: Implementation-defined indicates an aspect that may differ between implementations, but must be specified by the implementor for each particular implementation.]

  • [Definition: Implementation-dependent indicates an aspect that may differ between implementations, is not specified by this or any W3C specification, and is not required to be specified by the implementor for any particular implementation.]

1.2 Organization of this document

This document is organized as follows. We first present a high level syntax for the XQuery and XPath Full Text 1.0 language along with some examples. Then, we present the syntax and examples of the basic primitives in the XQuery and XPath Full Text 1.0 language. This is followed by the semantics of the XQuery and XPath Full Text 1.0 language. The appendix contains a section that provides an EBNF for the XPath 2.0 Grammar with Full-Text Extensions, an EBNF for XQuery 1.0 Grammar with Full-Text Extensions, acknowledgements and a glossary.

1.3 A word about namespaces

Certain namespace prefixes are predeclared by XQuery 1.0 and, by implication, by this specification, and bound to fixed namespace URIs. These namespace prefixes are as follows:

  • xml = http://www.w3.org/XML/1998/namespace

  • xs = http://www.w3.org/2001/XMLSchema

  • xsi = http://www.w3.org/2001/XMLSchema-instance

  • fn = http://www.w3.org/2005/xpath-functions

  • local = http://www.w3.org/2005/xquery-local-functions

In addition to the prefixes in the above list, this document uses the prefix err to represent the namespace URI http://www.w3.org/2005/xqt-errors, This namespace prefix is not predeclared and its use in this document is not normative. Error codes that are not defined in this document are defined in other XQuery 1.0 and XPath 2.0 specifications, particularly [XML Path Language (XPath) 2.0] and [XQuery 1.0 and XPath 2.0 Functions and Operators].

Finally, this document uses the prefix fts to represent a namespace containing a number of functions used in this document to describe the semantics of XQuery and XPath Full Text functions. There is no requirement that these functions be implemented, therefore no URI is associated with that prefix.

2 Full-Text Extensions to XQuery and XPath

XQuery and XPath Full Text extends the languages of XQuery 1.0 and XPath 2.0 in three ways. It:

  1. Adds a new expression called FTContainsExpr;

  2. Enhances the syntax of FLWOR expressions in XQuery 1.0 and for expressions in XPath 2.0 with optional score variables; and

  3. Adds static context declarations for full-text match options to the query prolog.

Additionally, it extends the data model and processing models in various ways.

2.1 Processing Model

A full-text contains expression (2.2 Full-Text Contains Expression) is composed of several parts:

  1. An XPath 2.0 or XQuery 1.0 expression (RangeExpr) that specifies the sequence of items to be searched. [Definition: Those items are called the search context.]

  2. The full-text selection to be applied (3 Full-Text Selections). Full-text selections are, syntactically and semantically, fully composable and contain:

    • Required:

    • Optional:

      • Match options, such as indicators for case sensitivity and stop words (3.4 Match Options);

      • Boolean full-text operators, that compose a full-text selection from simpler full-text selections (3.5 Logical Full-Text Operators);

      • Other full-text operators that are constraints on the positions of matches, such as indicators for distance between tokens and for the cardinality of matches (3.6 Positional Filters and 3.3 Cardinality Selection); and

      • The weighting information. Each individual search term in a full-text selection may be annotated with optional weight information. This information may be used during the evaluation of the full-text selections to calculate scoring, information that quantifies the relevance of the result to the given search criteria.

  3. An optional XPath 2.0 or XQuery 1.0 expression (UnionExpr) that specifies the set of nodes, descendents of the RangeExp, whose contents must be ignored for the purpose of determining a match during the search (3.7 Ignore Option).

The results of the evaluation of the full-text selection operators are instances of the AllMatches model, which complements the XQuery Data Model (XDM) for processing full-text queries. An AllMatches instance describes all possible solutions to the full-text query for a given search context item. Each solution is described by a Match instance. A Match instance contains the tokens from the search context that must be included (described using StringInclude instances which model the positive terms) and the tokens from search context item that must be excluded (described using StringExclude instances which model the negative terms). Each negative or positive term is modeled as a tuple: the position of the query token or phrase in the full-text selection, and a TokenInfo structure that describes a set of tokens in the text string which match the query token or phrase.

Processing Model Extensions

Figure 1 provides a schematic overview of the XQuery and XPath Full Text processing steps that are discussed in detail below. Some of these steps are completely outside the domain of XQuery; in Figure 1, these are depicted outside the black line that represents the boundaries of the language. The diagram only shows the central pieces of the XQuery Processing Model (see Section 2.2 Processing ModelXQ), however zooms in on the Execution Engine where the processing of the full-text extensions takes place. The full-text processing steps are labeled as FTn within the diagram and are referenced within the text.

Like all XQuery expressions, an FTContainsExpr returns an XDM Instance (see Fig. 1). With the exception of FTWords, which consumes TokenInfos, all full-text selections are closed under the AllMatches data model, i.e., their input and output are AllMatches instances. Tokenization transforms an XDM instance into TokenInfos, which ultimately get converted into AllMatches instances by the evaluation of full-text selections. Thus, the evaluation of nested full-text and XQuery expressions instances moves back and forth between these two models.

The resulting AllMatches instance obtained by the evaluation of an FTContainsExpr is converted into a Boolean value before being returned to the enclosing XPath or XQuery operation as follows. If at least one member of the disjunction contains only positive terms then value returned is true. If all members of the disjunction contain negative terms the result is false.

Weighting information, in an implementation-dependent fashion, may be used when calculating the scoring information computed and made available by FTContainsExpr to the optional score construct.

Given the components of a given full-text contains expression, the evaluation algorithm will proceed according to the following steps, also referenced in the processing model diagram as steps FTn (see Fig. 1):

  1. Evaluate the search context expression (resulting in the sequence of search context items), the ignore option, if any (resulting in the set of ignored nodes), and any other XQuery/XPath exprssions nested within the full-text contains expression. (FT1)

  2. Tokenize the query string(s). (FT2.1)

  3. For each search context item:

    1. Delete the ignored nodes from the search context item.

    2. Tokenize the result of the previous step. This produces a sequence of tokens. (FT2.2) Note that implementations may (as an optimization) perform tokenization as part of the External Processing that is described in the XQuery Processing Model, when an XML document is parsed into an Infoset/PSVI and ultimately into a XQuery Data Model instance.

    3. Evaluate the FTSelection against the tokens of the search context. (FT3, FT4)

  4. Convert the topmost AllMatches instances into a Boolean value. (FT5)

    The additional scoring information (also part of FT5) that is produced by the evaluation of the full-text contains expression is implementation-dependent and is not specified in this document. The scoring information is made available at the same time the Boolean value is returned.

(A more detailed version of the above procedure appears in Section 4.3 FTContainsExpr.)

Section 3 Full-Text Selections describes the syntax and the informal semantics of full-text operators. Their formal semantics as well as the formal definition of the AllMatches data model are given in Section 4 Semantics.

2.2 Full-Text Contains Expression

[Definition: A full-text contains expression is a expression that evaluates a sequence of items against a full-text selection. ]

As a syntactic construct, a full-text contains expression (grammar symbol: FTContainsExpr) behaves like a comparison expression (see Section 3.5.2 General ComparisonsXQ). This grammar rule introduces FTContainsExpr.

[50]    ComparisonExpr    ::=    FTContainsExpr ( (ValueComp
| GeneralComp
| NodeComp) FTContainsExpr )?

A full-text contains expression may be used anywhere a ComparisonExpr may be used. The ftcontains operator has higher precedence than other comparison operators, so the results of ftcontains expressions may be compared without enclosing them in parentheses.

2.2.1 Description

[51]    FTContainsExpr    ::=    RangeExpr ( "ftcontains" FTSelection FTIgnoreOption? )?

A full-text contains expression returns a Boolean value. It returns true if there is some item returned by the RangeExpr that, after tokenization, matches the full-text selection FTSelection. See Section 3 Full-Text Selections for more details. For the purpose of determining a match, certain descendants of nodes (identified by FTIgnoreOption) in the RangeExpr may be ignored, as specified in Section 3.7 Ignore Option.

An XQuery and XPath Full Text processor SHOULD try to use the information available in xml:lang for processing of collations, as well as the various match options defined in Section 3.4 Match Options.

2.2.2 Examples

The following example in XQuery Full Text returns the author of each book with a title containing a token with the same root as dog and the token cat.

for $b in /books/book
where $b/title ftcontains ("dog" with stemming) ftand "cat" 
return $b/author

The same example in XPath Full Text is written as:


/books/book[title ftcontains ("dog" with stemming) ftand "cat"]/author

In the next example a ComparisonExpr is combined with an FTContainsExpr using the logical XQuery operator and. The query selects books that have a price of less than 50 and a title which contains a token with the same root as train:

/books/book[price < 50 and title ftcontains ("train" with stemming)]

The following example shows the combination of two ftcontains expressions the results of which are compared using the not-equals operator. The query selects books where either the title contains the token dog and the token cat and the content does not contain a token with the same root as train, or where the title fails to have one of the matching tokens but the content does:

/books/book[title ftcontains "dog" ftand "cat" ne
            content ftcontains ("train" with stemming)]

2.3 Score Variables

Besides specifying a match of a full-text query as a Boolean condition, full-text query applications typically also have the ability to associate scores with the results. [Definition: The score of a full-text query result expresses its relevance to the search conditions.]

XQuery and XPath Full Text extends the languages of XQuery 1.0 and XPath 2.0 further by adding optional score variables to the for and let clauses of FLWOR expressions.

The production for the extended for clause in XQuery 1.0 follows.

[35]    ForClause    ::=    "for" "$" VarName TypeDeclaration? PositionalVar? FTScoreVar? "in" ExprSingle ("," "$" VarName TypeDeclaration? PositionalVar? FTScoreVar? "in" ExprSingle)*
[37]    FTScoreVar    ::=    "score" "$" VarName

In XPath 2.0, the SimpleForClause is extended similarly.

When a score variable is present in a for clause the evaluation of the expression following the in keyword not only needs to determine the result sequence of the expression, i.e., the sequence of items which are iteratively bound to the for variable. It must also determine in each iteration the relevance "score" value of the current item and bind the score variable to that value.

The semantics of scoring and how it relates to second-order functions is discussed in Section 4.4 Scoring.

In the following example book elements are determined that satisfy the condition [content ftcontains "web site" ftand "usability" and .//chapter/title ftcontains "testing"]. The scores assigned to the book elements are returned.

for $b score $s 
    in /books/book[content ftcontains "web site" ftand "usability" 
                   and .//chapter/title ftcontains "testing"]
return $s

The example above is also a legal example of the XPath 2.0 extension.

Scores are typically used to order results, as in the following, more complete example.

for $b score $s 
    in /books/book[content ftcontains "web site" ftand "usability"]
where $s > 0.5
order by $s descending
return <result>  
          <title> {$b//title} </title> 
          <score> {$s} </score> 
       </result>

Note that the score variable gets one score value for each item in the value of the expression after the in keyword, regardless of the number of FTContainsExprs in that expression. In the following example, two separate full-text contains expressions are used to select the matching paragraphs. There is still just one score for each para returned. The highest scoring paragraphs will be returned first:

for $p score $s in //book[title ftcontains "software"]/para[. ftcontains "usability"]
     order by $s descending
  return $p

The following more elaborate example uses multiple score variables to return the matching paragraphs ordered so that those from the highest scoring books precede those from the lowest scoring books, where the highest scoring paragraphs of each book are returned before the lower scoring paragraphs of that book:

for $b score $score1 in //book[title ftcontains "software"]
    order by $score1 descending
return
    for $p score $score2 in $b/para[. ftcontains "usability"]
       order by $score2 descending
    return $p

The score variable is bound to a value which reflects the relevance of the match criteria in the full-text selections to the items returned by the respective RangeExprs. The calculation of relevance is implementation-dependent, but score evaluation must follow these rules:

  1. Score values are of type xs:double in the range [0, 1].

  2. For score values greater than 0, a higher score must imply a higher degree of relevance

Similarly to their use in a for clause, score variables may be specified in a let clause. A score variable in a let clause is also bound to the score of the expression evaluation, but in the let clause one score is determined for the complete result.

The production for the extended let clause follows.

[38]    LetClause    ::=    (("let" "$" VarName TypeDeclaration?) | ("let" "score" "$" VarName)) ":=" ExprSingle ("," (("$" VarName TypeDeclaration?) | FTScoreVar) ":=" ExprSingle)*

When using the score option in a for clause the expression following the in keyword has the dual purpose of filtering, i.e., driving the iteration, and determining the scores. It is possible to separately specify expressions for filtering and scoring by combining a simple for clause with a let clause that uses scoring. The following is an example of this.

for $b in /books/book[.//chapter/title ftcontains "testing"]
let score $s := $b/content ftcontains "web site" ftand "usability" 
order by $s descending
return <result score="{$s}">{$b}</result>

This example returns book elements with chapter titles that contain "testing". Along with the book elements scores are returned. These scores, however, reflect whether the book content contains "web site" and "usability".

Note that it is not a requirement of the score of an FTContainsExpr to be 0, if the expression evaluates to false, nor to be non-zero, if the expression evaluates to true. Hence, in the example above it is not possible to infer the Boolean value of the FTContainsExpr in the let clause from the calculated score of a returned result element. For instance, an implementation may want to assign a non-zero score to a book that contained "web site", but not "usability", as this may be considered more relevant than a book that does not contain "web site" or "usability".

The expression ExprSingle associated with the score variable is passed to the scoring algorithm. The scoring algorithm calculates the score value based on the passed expression (not on the value returned by evaluating the expression). The set of expressions supported by the scoring algorithm is implementation-defined. If an expression not supported by the scoring algorithm is passed to the scoring algorithm, the result is implementation-defined.

The use of score variables introduces a second-order aspect to the evaluation of expressions which cannot be emulated by (first-order) XQuery functions. Consider the following replacement of the clause let score $s := FTContainsExpr

let $s := score(FTContainsExpr)

where a function score is applied to some FTContainsExpr. If the function score were first-order, it would only be applied to the result of the evaluation of its argument, which is one of the Boolean constants true or false. Hence, there would be at most two possible values such a score function would be able to return and no further differentiation would be possible.

2.3.1 Using Weights Within a Scored FTContainsExpr

[Definition: Scoring may be influenced by adding weight declarations to search tokens, phrases, and expressions.] Weight declarations are introduced syntactically in the FTSelection production, described in Section 3 Full-Text Selections.

The weight MUST have an absolute value between 0.0 and 1000.0 inclusive.

The weights assigned are not related to any absolute standard, but typically have a relationship to other weights within the same FTContains expression.

The effect of weights on the resulting score is implementation-dependent. However, scoring algorithms MUST conform to these constraints:

  1. When no explicit weight is specified, the default weight is 1.0; and

  2. Weight declarations in an FTContainsExpr for which no scores are evaluated are ignored.

The following example illustrates how different weights can be used for different search terms.

for $b in /books/book
let score $s := $b/content ftcontains ("web site" weight 0.5)
                                ftand ("usability" weight 2)
return <result score="{$s}">{$b}</result>

2.4 Extensions to the Static Context

The XQuery Static Context is extended with a component for each full-text match option group. The settings of these components can be changed by using the following declaration syntax in the Prolog.

[6]    Prolog    ::=    ((DefaultNamespaceDecl | Setter | NamespaceDecl | Import | FTOptionDecl) Separator)* ((VarDecl | FunctionDecl | OptionDecl) Separator)*
[14]    FTOptionDecl    ::=    "declare" "ft-option" FTMatchOptions

Match options modify the match semantics of full-text expressions. They are described in detail in Section 3.4 Match Options. When a match option is specified explicitly in a full-text expression, it overrides the setting of the respective component in the static context.

3 Full-Text Selections

This section describes the full-text selections which contain the full-text operators in a full-text contains expression (FTContainsExpr), as well as the match options which modify the matching semantics of the full-text selections. In the following, the syntax for each type of full-text selection is given together with an informal statement of its meaning.

[Definition: A full-text selection specifies the conditions of a full-text search. ]

[144]    FTSelection    ::=    FTOr FTPosFilter* ("weight" RangeExpr)?

As shown in the grammar, a full-text selection consists of search conditions possibly involving logical operators (FTOr) followed by an arbitrary number of positional filters (FTPosFilter) optionally followed by a "weight" value which is specified using a range expression. The RangeExpr is evaluated, as if it were an argument to a function with an expected type xs:double; it must be between 0.0 and 1000.0 inclusive.

The syntax and semantics of the individual full-text selection operators follow.

This XML document is the source document for examples in this section.

<books>
  <book number="1">
    <title shortTitle="Improving Web Site Usability">Improving  
        the Usability of a Web Site Through Expert Reviews and
        Usability Testing</title>
    <author>Millicent Marigold</author>
    <author>Montana Marigold</author>
    <editor>Véra Tudor-Medina</editor>
    <content>
      <p>The usability of a Web site is how well the  
          site supports the users in achieving specified  
          goals. A Web site should facilitate learning,  
          and enable efficient and effective task  
          completion, while propagating few errors.
      </p>
      <note>This book has been approved by the Web Site  
          Users Association.
      </note>
    </content>
  </book>
</books>

Tokenization is implementation-defined. A sample tokenization is used for the examples in this section. This sample tokenization uses white space, punctuation and XML tags as word-breakers and <p> for paragraph boundaries. The results may be different for other tokenizations.

The first five tokens in this example using the sample tokenization would be "Improving", "the", "usability", "of", and "a".

Unless stated otherwise, the results assume a case-insensitive match.

3.1 Primary Full-Text Selections

[150]    FTPrimary    ::=    (FTWords FTTimes?) | ("(" FTSelection ")") | FTExtensionSelection

[Definition: A primary full-text selection is the basic form of a full-text selection. It specifies tokens and phrases as search conditions (FTWords), optionally followed by a cardinality constraint (FTTimes). An FTSelection in parentheses and the FTExtensionSelection are also a primary full-text selections.]

3.2 Search Tokens and Phrases

[151]    FTWords    ::=    FTWordsValue FTAnyallOption?
[152]    FTWordsValue    ::=    Literal | ("{" Expr "}")
[154]    FTAnyallOption    ::=    ("any" "word"?) | ("all" "words"?) | "phrase"

FTWords finds matches that contain the specified tokens and phrases.

FTWords consists of two parts: a mandatory FTWordsValue part and an optional FTAnyallOption part. FTWordsValue specifies the tokens and phrases that must be contained in the matches. FTAnyallOption specifies how containment is checked.

In general, the tokens and phrases in FTWordsValue are specified using a nested XQuery expression. To simplify notation, the enclosing braces may be omitted if FTWordsValue consists of a single literal.

The following rules specify how an FTWordsValue matches tokens and phrases. First, the FTWordsValue is converted to a sequence of strings as though it were an argument to a function with the expected type of xs:string*. Then, each of those strings is tokenized into a sequence of tokens as described in Section 4.1 Tokenization. Then, FTAnyallOption is checked.

If FTAnyallOption is "any", the sequence of tokens for each string is considered as a phrase, i.e. a match is found in the tokenized form of the text being searched, whenever that form contains a subsequence of tokens that corresponds to the sequence of query tokens in an implementation-defined way and that subsequence of tokens covers consecutive token positions in the tokenized text. If the value of the FTWordsValue contains more than one string, the different strings are considered to be alternatives, i.e. the resulting matches must contain at least one of the generated phrases.

If FTAnyallOption is "all", the sequence of tokens for each string is considered as a phrase. The resulting matches must contain all of the generated phrases.

If FTAnyallOption is "phrase", the tokens from all the strings are concatenated in a single sequence, which is considered as a phrase. The resulting matches must contain the generated phrase.

If FTAnyallOption is "any word", the tokens from all the strings are combined into a single set. The resulting matches must contain at least one of the tokens in the set.

If FTAnyallOption is "all words", the tokens from all the strings are combined into a single set. The resulting matches must contain all of the tokens in the set.

If the FTWordsValue evaluates to a single string, the use of "any", "all", and "phrase" in FTAnyallOption produces the same results.

If FTAnyallOptions is omitted, "any" is the default.

The following expression returns the sample book element, because its title element contains the token "Expert":

//book[./title ftcontains "Expert"]

The following expression returns the sample book element, because its title element contains the phrase "Expert Reviews":

//book[./title ftcontains "Expert Reviews"]

The following expression returns the sample book element, because its title element contains the two tokens "Expert" and "Reviews":

//book[./title ftcontains {"Expert", "Reviews"} all]

The following expression returns false for our sample document, because the p element doesn't contain the phrase "Web Site Usability" although it contains all of the tokens in the phrase:

//book//p ftcontains "Web Site Usability"

The following expression returns book numbers of book elements by "Marigold" with a title about "Web Site Usability", sorting them in descending score order:

for $book in /books/book[.//author ftcontains "Marigold"] 
let score $score := $book/title ftcontains "Web Site Usability" 
where $score > 0.8 
order by $score descending
return $book/@number

3.3 Cardinality Selection

[155]    FTTimes    ::=    "occurs" FTRange "times"

[Definition: A cardinality selection consist of an FTWords followed by the FTTimes postfix operator.] A cardinality selection selects matches for which the operand FTWords is matched a specified number of times.

A cardinality selection limits the number of different matches of FTWords within the specified range. The semantics of FTRange are described in 3.6.3 Distance Selection.

In the document fragment "very very big":

  1. The FTWords "very big" has 1 match consisting of the second "very" and "big".

  2. The FTWords {"very", "big"} all has 2 matches; one consisting of the first "very" and "big", and the other containing the second "very" and "big".

  3. The FTWords {"very", "big"} any has 3 matches.

The following expression returns the example book element's number, because the book element contains 2 or more occurrences of "usability":

//book[. ftcontains "usability" occurs at least 2 times]/@number

The following expression returns the empty sequence, because there are 3 occurrences of {"usability", "testing"} any in the designated title:

//book[@number="1" and title ftcontains {"usability", 
"testing"} any occurs at most 2 times] 

3.4 Match Options

Full-text match options modify the matching behaviour of the primary full-text selection to which they are applied.

[149]    FTPrimaryWithOptions    ::=    FTPrimary FTMatchOptions?
[165]    FTMatchOptions    ::=    FTMatchOption+
[166]    FTMatchOption    ::=    FTLanguageOption
| FTWildCardOption
| FTThesaurusOption
| FTStemOption
| FTCaseOption
| FTDiacriticsOption
| FTStopWordOption
| FTExtensionOption

[Definition: Match options modify the set of tokens in the query, or how they are matched against tokens in the text.]

[Definition: Each of the seven alternatives of production FTMatchOption corresponds to one match option group. ] The match options from any given group are mutually exclusive, i.e., only one of these settings can be in effect, whereas match options of different groups can be combined freely.

Note that, along with the syntax rules above, there is an extra-grammatical constraint, multiple-match-options , which needs to be considered, if multiple match options are specified. It states that within a single FTMatchOptions at most one match option of any given match option group may be specified. For example, if the FTCaseOption "lowercase" is specified, then "uppercase" cannot also be specified as part of the same FTMatchOptions.

Although match options only take effect in the application of FTWords, the syntax also allows to specify match options that modify the non-primitive full-text selection "(" FTSelection ")". Such a higher-level match option provides a default for the respective match option group for any embedded FTPrimary, just as match option declarations in the Prolog provide default match options for the whole query.

Match options are propagated through the query via the static context. For each of the seven match option groups, the static context has a component that contains one option from that group. The seven settings are initialized by the implementation in accordance with the table in Appendix C Static Context Components, and are modified by any FTOptionDecls in the Prolog. The resulting settings are then propagated unchanged to every FTContainsExpr in the module (including those in VarDecls and FunctionDecls, and including any that happen to be nested within another FTContainsExpr). At any given FTContainsExpr, the settings from the static context are copied to the FTContainsExpr's inner settings, which are then propagated down the syntax tree. At each FTPrimaryWithOptions, the locally specified match options (if any) overwrite the corresponding inner setting(s). At each FTWords, the inner settings are used as the effective match options for tokenizing the query strings and matching them against the tokens in the text. (These inner settings could be seen as a parallel set of components in the static context, but Section 4 Semantics models them as structures that get passed as parameters to various semantic functions.)

Thus, when a match option appears in an FTSelection, it applies to the associated FTPrimary, but not to any FTContainsExprs that happen to be embedded within that FTPrimary. Instead, for a nested FTContainsExpr, the default match options are those declared in the Prolog or, if not declared in the Prolog, then supplied by the implementation's initial values.

[Definition: The order in which effective match options for an FTWords are applied is called the match option application order.] This order is significant because match options are not always commutative. For example, synonym(stem(word)) is not always the same as stem(synonym(word)).

The match option application order is subject to some constraints:

  1. The Language Option must be applied first

  2. The Stemming Option must be applied before the Case Option and the Diacritics Option

Aside from these constraints, the full order of the application of match options is implementation-defined.

More information on their semantics is given in 4.2.6 Match Options Semantics.

If no match options declarations are present in the prolog and the implementation does not define any overwriting of the static context components for the match options, the query:

/books/book/title ftcontains "usability" 

is, assuming "de" is the implementation-defined default language, equivalent to the query:

/books/book/title ftcontains "usability" 
    language "de"
    without wildcards
    without thesaurus
    without stemming
    case insensitive 
    diacritics insensitive 
    without stop words

We describe each match option group in more detail in the following sections.

3.4.1 Language Option

[175]    FTLanguageOption    ::=    "language" StringLiteral

[Definition: A language option modifies token matching by specifying the language of search tokens and phrases.]

The StringLiteral following the keyword language designates one language. It must be castable to xs:language; otherwise, an error is raised: [err:XPTY0004]XP.

The "language" option influences tokenization, stemming, and stop words in an implementation-defined way. The "language" option MAY influence the behavior of other match options in an implementation-defined way.

The set of standardized language identifiers is defined in [BCP 47]. The set of valid language identifiers among the standardized set is implementation-defined. An implementation MAY choose to use private extensions introduced by a singleton 'x' for additional language identifiers, or other singletons for registered extensions as described in sec. 2.2.6 of [BCP 47]. It is implementation-defined what additional language identifiers, if any, are valid. If an invalid language identifier is specified, then the behavior is implementation-defined. If the implementation chooses to raise an error in that case, it must raise [err:FTST0009].

The default language is specified in the static context.

When an XQuery and XPath Full Text processor evaluates text in a document that is governed by an xml:lang attribute and the portion of the full-text query doing that evaluation contains an FTLanguageOption that specifies a different language from the language specified by the governing xml:lang attribute, the language-related behavior of that full-text query is implementation-defined.

This is an example where the language option is used to select the appropriate stop word list:

//book[@number="1"]//editor ftcontains "salon de the"
with default stop words language "fr"

3.4.2 Wildcard Option

[176]    FTWildCardOption    ::=    ("with" "wildcards") | ("without" "wildcards")

[Definition: A wildcard option modifies token and phrase matching by specifying whether wildcards are used or not.]

When the "with wildcards" option is used, wildcard indicators (represented by periods (.)) and qualifiers may be appended to or inserted into the query tokens. If the period is at the beginning of a query token, the wildcard is a prefix wildcard. If the period is at the end of a query token, it is a suffix wildcard. If the period is inserted into a query token, it is an infix wildcard.

Each indicator and qualifier in a query token will match zero or more characters within a token in the text being searched, as described below. The number of characters matched depends on the qualifier. Qualifiers available are none, question mark, asterisk, plus sign, and two numbers separated by a comma, both enclosed by curly braces.

  1. If a period is present, but there are no qualifiers, one character in the text will match.

  2. If a period is followed by a question mark (.?), zero or one characters in the text being searched will match.

  3. If a period is followed by an asterisk (.*), zero or more characters will match.

  4. If a period is followed by a plus sign (.+), one or more characters will match.

  5. If a period is followed by two numbers separated by a comma, both enclosed by curly braces (.{n,m}), a specified range of characters (at least n characters and no more than m characters) will match.

When "with wildcards" is present and an indicator or qualifier character is intended to be taken literally (as itself), that character must be preceded by ("escaped by") a backslash (\). For example, a period (.) that is intended to be a sentence terminator or a decimal point must be preceded by a backslash so that it is not interpreted to be an indicator. Similarly a question mark (?), asterisk (*), or plus sign (+) that is intended to be interpreted as an ordinary text character must be preceded by a backslash so that it is not interpreted to be an indicator.

The "without wildcards" option finds tokens without recognizing wildcard indicators and qualifiers. Periods, question marks, asterisks, plus signs, and two numbers separated by a comma, both enclosed by curly braces, are always recognized as ordinary text characters.

The default is "without wildcards".

Note: Wildcard indicators and qualifiers may be token boundaries. How text with wildcard indicators and qualifiers is tokenized is implementation-defined.

The expression returns true, because the title element contains "improving":

//book[@number="1"]/title ftcontains "improv.*" with wildcards

The following expression returns true, because the title element contains "site":

//book[@number="1"]/title ftcontains ".?site" with wildcards

The following expression returns true, because the p element contains "well":

//book[@number="1"]/p ftcontains "w.ll" with wildcards

The following expression returns false, because the p element does not contain the phrase "w ll":

//book[@number="1"]/p ftcontains "w.ll" without wildcards

(Note that, without wildcards, the sample tokenization will treat the period in "w.ll" as punctuation, thus producing "w" and "ll" as separate tokens.)

3.4.3 Thesaurus Option

[170]    FTThesaurusOption    ::=    ("with" "thesaurus" (FTThesaurusID | "default"))
| ("with" "thesaurus" "(" (FTThesaurusID | "default") ("," FTThesaurusID)* ")")
| ("without" "thesaurus")
[171]    FTThesaurusID    ::=    "at" URILiteral ("relationship" StringLiteral)? (FTRange "levels")?
[143]    URILiteral    ::=    StringLiteral

[Definition: A thesaurus option modifies token and phrase matching by specifying whether a thesaurus is used or not.] If thesauri are used, the thesaurus option specifies information to locate the thesauri either by default or through a URI reference. It also states the relationship to be applied and how many levels within the thesaurus to be traversed.

Thesauri add related tokens and phrases to the query or change query tokens. Thus, the user may narrow, broaden, or otherwise modify the query using synonyms, hypernyms (more generic terms), etc. The search is performed as though the user has specified all related query tokens and phrases in a disjunction (FTOr).

Note:

A thesaurus may be standards-based or locally-defined. It may be a traditional thesaurus, or a taxonomy, soundex, ontology, or topic map. How the thesaurus is represented is implementation-dependent.

FTThesaurusID specifies the relationship sought between tokens and phrases written in the query and terms in the thesaurus and the number of levels to be queried in hierarchical relationships by including an FTRange "levels". If no levels are specified, the default is to query all levels in hierarchical relationships.

Relationships include, but are not limited to, the relationships and their abbreviations presented in [ISO 2788] and their equivalents in other languages. The set of relationships supported by an implementation is implementation-defined, but implementations SHOULD support the relationships defined in [ISO 2788]. The following list of terms have the meanings defined in [ISO 2788]. If a query specifies thesaurus relationships or levels not supported by the thesaurus, or does not specify a relationship, the behavior is implementation-defined.

  1. equivalence relationships (synonyms): PREFERRED TERM (USE), NONPREFERRED USED FOR TERM (UF);

  2. hierarchical relationships: BROADER TERM (BT), NARROWER TERM (NT), BROADER TERM GENERIC (BTG), NARROWER TERM GENERIC (NTG), BROADER TERM PARTITIVE (BTP), NARROWER TERM PARTITIVE (NTP), TOP Terms (TT); and

  3. associative relationships: RELATED TERM (RT).

The "with thesaurus" option specifies that string matches include tokens that can be found in one of the specified thesauri. When "default" is used in place of a FTThesaurusID, the thesauri specified in the static context are used, which are either given by the prolog declaration for the thesaurus option, or, if no such declaration exists a system-defined default thesaurus with a system-defined relationship. The default thesaurus may be used in combination with other explicitly specified thesauri.

The "without thesaurus" option specifies that no thesaurus will be used.

The default is "without thesaurus".

The following expression returns true, because it finds a content element containing "tasks" which the thesaurus identified as a synonym for "duties":

count(.//book/content ftcontains "duties" with
thesaurus at "http://bstore1.example.com/UsabilityThesaurus.xml"
relationship "UF")>0

The following expression returns book elements, because it finds a content element containing "web site components", and narrower terms "navigation" and "layout":

doc("http://bstore1.example.com/full-text.xml")
/books/book[count(./content ftcontains "web site components" with
thesaurus at "http://bstore1.example.com/UsabilityThesaurus.xml"
relationship "NT" at most 2 levels)>0]

Assuming the thesaurus available at URL "http://bstore1.example.com/UsabilitySoundex.xml" contains soundex capabilities, the following query returns a book element containing "Marigold" which sounds like "Merrygould":

doc("http://bstore1.example.com/full-text.xml")
/books/book[count(. ftcontains "Merrygould" with thesaurus at
"http://bstore1.example.com/UsabilitySoundex.xml" relationship
"sounds like")>0]

3.4.4 Stemming Option

[169]    FTStemOption    ::=    ("with" "stemming") | ("without" "stemming")

[Definition: A stemming option modifies token and phrase matching by specifying whether stemming is applied or not. ]

The "with stemming" option specifies that matches may contain tokens that have the same stem as the tokens and phrases written in the query. It is implementation-defined what a stem of a token is.

The "without stemming" option specifies that the tokens and phrases are not stemmed.

It is implementation-defined whether the stemming is based on an algorithm, dictionary, or mixed approach.

The default is "without stemming".

The following expression returns true, because the title of the specified book contains "improving" which has the same stem as "improve":

/books/book[@number="1"]/title ftcontains "improve" with stemming 

3.4.5 Case Option

[167]    FTCaseOption    ::=    ("case" "insensitive")
| ("case" "sensitive")
| "lowercase"
| "uppercase"

[Definition: A case option modifies the matching of tokens and phrases by specifying how uppercase and lowercase characters are considered.]

There are four possible character case options:

  1. Using the option "case insensitive", tokens and phrases are matched, regardless of the case of characters of the query tokens and phrases.

  2. Using the option "case sensitive", tokens and phrases are matched, if and only if the case of their characters is the same as written in the query.

  3. Using the option "lowercase", tokens and phrases are matched, if and only if they match the query without regard to character case, but contain only lowercase characters.

  4. Using the option "uppercase", tokens and phrases are matched, if and only if they match the query without regard to character case, but contain only uppercase characters.

The default is "case insensitive".

The effect of the case options is also influenced by the query's default collation (see Section 2.1.1 Static ContextXQ and Section 4.4 Default Collation DeclarationXQ). The following table summarizes how these interact.

Case Matrix
Case option \ Default collation UCC (Unicode Codepoint Collation) CCS (some generic case-sensitive collation) CCI (some generic case-insensitive collation)
case insensitive compare as if both lower case-insensitive variant of CCS if it exists, else error CCI
case sensitive UCC CCS case-sensitive variant of CCI if it exists, else error
lowercase compare using UCC after applying fn:lower-case() to the query string compare using CCS after applying fn:lower-case() to the query string CCI
uppercase compare using UCC after applying fn:upper-case() to the query string compare using CCS after applying fn:upper-case() to the query string CCI

Note:

In this table, "else error" means "Otherwise, an error is raised: [err:FOCH0002]FO". The phrase "if it exists" is used, because the case-sensitive collation CCS does not always have a case-insensitive variant (and, even if one exists, it may not be possible to determine it algorithmically), and because the case-insensitive collation CCI does not always have a case-sensitive variant (and, even if one exists, it may not be possible to determine it algorithmically).

The following expression returns false, because the title element doesn't contain "usability" in lower-case characters:

//book[@number="1"]/title ftcontains "Usability" lowercase 

The following expression returns true, because the character case is not considered:

//book[@number="1"]/title ftcontains "usability" case insensitive

3.4.6 Diacritics Option

[168]    FTDiacriticsOption    ::=    ("diacritics" "insensitive")
| ("diacritics" "sensitive")

[Definition: A diacritics option modifies token and phrase matching by specifying how diacritics are considered. ]

There are two possible diacritics options:

  1. The option "diacritics" "insensitive" matches tokens and phrases with and without diacritics. Whether diacritics are written in the query or not is not considered.

  2. The option "diacritics" "sensitive" matches tokens and phrases only if they contain the diacritics as they are written in the query.

The default is "diacritics insensitive".

The effect of the diacritics options is also influenced by the query's default collation (see Section 2.1.1 Static ContextXQ and Section 4.4 Default Collation DeclarationXQ). The following table summarizes how these interact.

Diacritics Matrix
Diacritics option \ Default collation UCC (Unicode Codepoint Collation) CDS (some generic diacritics-sensitive collation) CDI (some generic diacritics-insensitive collation)
diacritics insensitive UCC comparison, but without considering diacritics diacritics-insensitive variant of CDS if it exists, else error CDI
diacritics sensitive UCC CDS diacritics-sensitive variant of CDI if it exists, else error

Note:

In this table, "else error" means "Otherwise, an error is raised: [err:FOCH0002]FO". The phrase "if it exists" is used, because the diacritics-sensitive collation CDS does not always have a diacritics-insensitive variant (and, even if one exists, it may not be possible to determine it algorithmically), and because the diacritics-insensitive collation CDI does not always have a diacritics-sensitive variant (and, even if one exists, it may not be possible to determine it algorithmically).

The following expression returns true, because the token "Véra" in the editor element is matched, as the acute accent is not considered in the comparison:

//book[@number="1"]//editor ftcontains "Vera" diacritics insensitive

This returns false, because the editor element does not contain the token "Vera" in this exact form, i.e. without any diacritics:

//book[@number="1"]/editors ftcontains "Vera" diacritics sensitive

3.4.7 Stop Word Option

[172]    FTStopWordOption    ::=    ("with" "stop" "words" FTStopWords FTStopWordsInclExcl*)
| ("without" "stop" "words")
| ("with" "default" "stop" "words" FTStopWordsInclExcl*)
[173]    FTStopWords    ::=    ("at" URILiteral)
| ("(" StringLiteral ("," StringLiteral)* ")")
[174]    FTStopWordsInclExcl    ::=    ("union" | "except") FTStopWords

[Definition: A stop word option controls matching of FTWords by specifying whether stop words are used or not. Stop words are tokens in the query that match any token in the text being searched. ] Normally a stop word matches exactly one token, but there may be implementation-defined conditions, under which a stop word may match a different number of tokens.

FTStopWords specifies the list of stop words either explicitly as a comma-separated list of string literals, or by the keyword at followed by a literal URI. If the URI specifies a list of stop words that is not found in the statically known stop word lists, an error is raised [err:FTST0008]. Whether the stop word list is resolved from the statically known stop word lists or given explicitly, no tokenization is performed on the stop words: they are used as they occur in the list.

The "with stop words" option specifies that if a token is within the specified collection of stop words, it is removed from the search and any token may be substituted for it. Stop words retain their position numbers and are counted in FTDistance and FTWindow searches.

Multiple stop word lists may be combined using "union" or "except". The keywords "union" and "except" are applied from left to right. If "union" is specified, every string occurring in the lists specified by the left-hand side or the right-hand side is a stop word. If "except" is specified, only strings occurring in the list specified by the left-hand side but not in the list specified by the right-hand side are stop words.

The "with default stop words" option specifies that an implementation-defined collection of stop words is used.

The "without stop words" option specifies that no stop words are used. This is equivalent to specifying an empty list of stop words.

The default is "without stop words".

Note:

Some implementations may apply stop word lists during indexing and be unable to comply with query-time requests to not apply those stop words. An implementation may still support stop-word options (and therefore not raise [err:FTST0006]) by applying any additional stop words specified in the query. Pre-application of irrevocable stop word lists falls under implementation-defined tokenization behavior in this case, and a query that specifies "without stop words" may still have some words ignored.

The following expression returns true, because the document contains the phrase "propagating few errors":

/books/book[@number="1"]//p ftcontains "propagation of errors"
with stemming with stop words ("a", "the", "of") 

Note the asymmetry in the stop word semantics: the property of being a stop word is only relevant to query terms, not to document terms. Hence, it is irrelevant for the above-mentioned match whether "few" is a stop word or not, and on the other hand we do not want the query above to match "propagation" followed by 2 stop words, or even a sequence of 3 stop words in the document.

The following expression returns false. In this case specifying "few" as a stop word has no effect, since "few" does not appear in the query. Although the words "propagating" and "errors" appear in the text being searched, the phrase "propagating errors" cannot be matched, since that phrase does not occur.

/books/book[@number="1"]//p ftcontains "propagating errors" 
with stop words ("few")

The following expression returns false, because "of" is not in the p element between "propagating" and "errors":

/books/book[@number="1"]//p ftcontains "propagation of errors" 
with stemming without stop words

The following expression uses the stop words list specified at the URL. Assuming that the specified stop word list contains the word "then", this query is reduced to a query on the phrase "planning X conducting", allowing any token as a substitute for X. It returns a book element, because its content element contains "planning then conducting". It would also return the book if the phrases "planning and conducting" and "planning before conducting" had been in its content:

doc("http://bstore1.example.com/full-text.xml")
/books/book[count(.//content ftcontains "planning then 
conducting" with stop words at 
"http://bstore1.example.com/StopWordList.xml")>0]

The following expression returns books containing "planning then conducting", but not does not return books containing "planning and conducting", since it is exempting "then" from being a stop word:

doc("http://bstore1.example.com/full-text.xml")
/books/book[count(.//content ftcontains "planning then conducting"
with stop words at "http://bstore1.example.com/StopWordList.xml"
except ("the", "then"))>0]

3.4.8 Extension Option

[Definition: An extension option is a match option that acts in an implementation-defined way. ]

[177]    FTExtensionOption    ::=    "option" QName StringLiteral

An extension option consists of an identifying QName and a StringLiteral. Typically, a particular option will be recognized by some implementations and not by others. The syntax is designed so that option declarations can be successfully parsed by all implementations.

The QName of an extension option must resolve to a namespace URI and local name, using the statically known namespaces.

Note:

There is no default namespace for options.

Each implementation recognizes an implementation-defined set of namespace URIs used to denote extension options.

If the namespace part of the QName is not a namespace recognized by the implementation as one used to denote extension option, then the extension option is ignored.

Otherwise, the effect of the extension option, including its error behavior, is implementation-defined. For example, if the local part of the QName is not recognized, or if the StringLiteral does not conform to the rules defined by the implementation for the particular extension option, the implementation may choose whether to report an error, ignore the extension option, or take some other action.

Implementations may impose rules on where particular extension options may appear relative to other match options, and the interpretation of an option declaration may depend on its position.

An extension option must not be used to change the syntax accepted by the processor, or to suppress the detection of static errors. However, it may be used without restriction to modify the set of tokens in the query or how they are matched against tokens in the text being searched. An extension option has the same scope as other match options.

The following examples illustrate several possible uses for extension options:

This extension option is set as part of the static context of all full-text expressions in the module and might be used to ensure that queries are insensitive to Arabic short-vowels.

declare namespace exq = "http://example.org/XQueryImplementation";

declare ft-option option exq:diacritics "short-vowel insensitive"

This extension option applies only to the matching in the full-text selection in which it is found and might be used to specify how compound words should be matched.

declare namespace exq = "http://example.org/XQueryImplementation";

//para[. ftcontains
         ("Kinder" ftand "Platz" distance exactly 1 words)
         with stemming
         option exq:compounds "distance=1" ]

3.5 Logical Full-Text Operators

Full-text selections can be combined with the logical connectives ftor (full-text or), ftand (full-text and), not in (mild not), and ftnot (unary full-text not).

[145]    FTOr    ::=    FTAnd ( "ftor" FTAnd )*
[146]    FTAnd    ::=    FTMildNot ( "ftand" FTMildNot )*
[147]    FTMildNot    ::=    FTUnaryNot ( "not" "in" FTUnaryNot )*
[148]    FTUnaryNot    ::=    ("ftnot")? FTPrimaryWithOptions

3.5.1 Or-Selection

[Definition: An or-selection combines two full-text selections using the ftor operator.]

An or-selection finds all matches that satisfy at least one of the operand full-text selections.

The following expression returns the book element written by "Millicent":

//book[.//author ftcontains "Millicent" ftor "Voltaire"]

3.5.2 And-Selection

[Definition: An and-selection combines two full-text selections using the ftand operator.]

An and-selection finds matches that satisfy all of the operand full-text selections simultaneously. A match of an and-selection is formed by combining matches for each of the operand full-text selections as described in 4.2.7.2 FTAnd.

For example, "usability" ftand "testing" will find two matches in //book[@number="1"]/title: each of the two matches for the FTWords selection "usability" (the two occurrences of "usability" in the string value of the title element) is combined with the single match for the FTWords "testing" (only one occurrence of "testing" in the title). Since the above and-selection has at least one match, the following expression will return "true".

//book[@number="1"]/title ftcontains ("usability" ftand "testing")

The following expression returns false, because "Millicent" and "Montana" are not contained by the same author element in any book element:

//book/author ftcontains "Millicent" ftand "Montana"

No author element in any book element contains both "Millicent" and "Montana". Therefore, for any such author element, there are either one match for the FTWords "Millicent" and zero matches for the FTWords "Montana", or vice versa, or no matches for both of them. In any of these cases, the and-selection will have zero matches.

3.5.3 Mild-Not Selection

[Definition: A mild-not selection combines two full-text selections using the not in operator.]

The not in operator is a milder form of the operator combination ftand ftnot. The selection A not in B matches a token sequence that matches A, but not when it is a part of a match of B. In contrast, A ftand ftnot B only finds matches when the token sequence contains A and does not contain B.

As an example, consider a search for "Mexico" not in "New Mexico". This may return, among others, a document which is all about "Mexico" but mentions at the end that "New Mexico was named after Mexico". The occurrence of "Mexico" in "New Mexico" is not considered, but other occurrences of "Mexico" are matched. Note that this document would not be matched by the full-text selection "Mexico" ftand ftnot "New Mexico".

A match to a mild-not selection must contain at least one token that satisfies the first condition and does not satisfy the second condition. If it contains a token that satisfies both the first and the second condition, the token is not considered as a match.

The following expression returns true, because "usability" appears in the title and the p elements and the token within the phrase "Usability Testing" in the title element is not considered:

/books/book ftcontains "usability" not in "usability testing"

Operands of a mild-not selection may not contain a full-text selection that evaluates to an AllMatches that contains a StringExclude. Such full-text selections are not-selection and FTWords with a cardinality constraint using at most, from ... to, and exactly occurrences ranges. If such an expression is encountered, an error [err:FTDY0017] is raised.

3.5.4 Not-Selection

[Definition: A not-selection is a full-text selection starting with the prefix operator ftnot.]

A not-selection selects matches that do not satisfy the operand full-text selection. Details about how such matches are constructed are given in 4.2.7.3 FTUnaryNot.

The following expression returns the empty sequence, because all book elements contain "usability":

//book[. ftcontains ftnot "usability"]

The following expression returns true, because book elements contain "information" and "retrieval" but not "information retrieval":

//book ftcontains "information" ftand
"retrieval" ftand ftnot "information retrieval"

The following expression returns book elements containing "web site usability" but not "usability testing":

//book[. ftcontains "web site usability" ftand 
ftnot "usability testing"]

3.6 Positional Filters

[157]    FTPosFilter    ::=    FTOrder | FTWindow | FTDistance | FTScope | FTContent

[Definition: Positional filters are postfix operators that serve to filter matches based on various constraints on their positional information.]

Recall that the grammar rule for FTSelection allows an arbitrary number of positional filters to follow an FTOr. Multiple adjacent positional filters are applied from left to right, i.e., the first filter is applied to the result of the FTOr, the second is applied to the result of that first application, and so on.

3.6.1 Ordered Selection

[158]    FTOrder    ::=    "ordered"

[Definition: An ordered selection consists of a full-text selection followed by the postfix operator "ordered".] An ordered selection constrains the order of tokens and phrases to be the same as the order in which they are written in the operand selection.

The default is unordered. Unordered is in effect when ordered is not specified in the query. Unordered cannot be written explicitly in the query.

An ordered selection selects matches which satisfy the operand full-text selection and which also satisfy the following constraint: the order that the matching tokens or phrases have in the text being searched is the same order that the corresponding query tokens or phrases have in the operand selection. In both cases, the ordering is determined from the minimum start positions of the contituent tokens.

The following expression returns true, because titles of book elements contain "web site" and "usability" in the order in which they are written in the query, i.e., "web site" must precede "usability":

//book/title ftcontains ("web site" ftand "usability") ordered

The following expression returns false, because although "Montana" and "Millicent" both appear in the book element, they do not appear in the order they are written in the query:

//book[@number="1"] ftcontains ("Montana" ftand "Millicent") ordered

3.6.2 Window Selection

[159]    FTWindow    ::=    "window" AdditiveExpr FTUnit
[161]    FTUnit    ::=    "words" | "sentences" | "paragraphs"

[Definition: A window selection consists of a full-text selection followed by one of the (complex) postfix operators derived from FTWindow.] A window selection selects matches which satisfy the operand full-text selection and for which the matched tokens and phrases, more precisely the individual StringIncludes of that match, are found within a number of FTUnits (words, sentences, and paragraphs). The number of FTUnits is specified by an AdditiveExpr that is converted as though it were an argument to a function w