W3C

XQuery 1.0 and XPath 2.0 Full-Text 1.0

W3C Working Draft 18 May 2007

This version:
http://www.w3.org/TR/2007/WD-xpath-full-text-10-20070518/
Latest version:
http://www.w3.org/TR/xpath-full-text-10/
Previous versions:
http://www.w3.org/TR/2006/WD-xquery-full-text-20060501/ http://www.w3.org/TR/2005/WD-xquery-full-text-20051103/ http://www.w3.org/TR/2005/WD-xquery-full-text-20050915/ http://www.w3.org/TR/2005/WD-xquery-full-text-20050404/ http://www.w3.org/TR/2004/WD-xquery-full-text-20040709/
Editors:
Sihem Amer-Yahia, AT&T Labs - Research
Chavdar Botev, Invited Expert
Stephen Buxton, Mark Logic Corporation
Pat Case, Library of Congress
Jochen Doerre, IBM
Mary Holstege, Mark Logic Corporation
Jim Melton, Oracle
Michael Rys, Microsoft
Jayavel Shanmugasundaram, Invited Expert

This document is also available in these non-normative formats: XML.


Abstract

This document defines the syntax and formal semantics of XQuery 1.0 and XPath 2.0 Full-Text 1.0 which is a language that extends XQuery 1.0 [XQuery 1.0: An XML Query Language] and XPath 2.0 [XML Path Language (XPath) 2.0] with full-text search capabilities.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is a Last Call Working Draft for review by W3C Members and other interested parties. This document was produced following the procedures set out for the W3C Process and was defined jointly by the XSL Working Group and the XML Query Working Group (both part of the XML Activity). It is designed to be read in conjunction with the following documents: [XQuery 1.0: An XML Query Language], [XQuery 1.0 and XPath 2.0 Full-Text Requirements], and and [XQuery 1.0 and XPath 2.0 Full-Text Use Cases].

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document defines a language for expressing full-text queries on XML documents; the language is specified in the form of extensions to both XPath 2.0 and XQuery 1.0. Organizations and individuals should review this document to determine the degree to which the language specified meets the needs of the full-text community. The Working Groups believe that this work is essentially complete and intend to advance it as soon as possible.

This is the sixth version of this document. Since the last version was published several technical and editorial changes have been made. Among the most significant changes are: The formal semantics diagrams have been redrawn. A conformance statement has been added. XML Schemas that together define the XML representation of XQuery 1.0 and XPath 2.0 Full-Text have been added, along with a stylesheet to transform that XML representation to the ordinary XQuery syntax. Section 3 has been significantly restructured for clarity and readability. The semantics of nesting FTDistance selections have been made more useful. The semantics for FTMildNot now properly handle phrases. See Appendix J Change Log for more information on these and other changes.

Of the XQuery 1.0 and XPath 2.0 Full Text documents, only this document, XQuery 1.0 and XPath 2.0 Full-Text 1.0, is a Last Call document. The XQuery and XPath Full-Text Requirements [XQuery 1.0 and XPath 2.0 Full-Text Requirements], although not on the Recommendation track, is being republished concurrently with this document in order to demonstrate the degree to which this document satisfies those Requirements. The XQuery Full-Text Use Cases [XQuery 1.0 and XPath 2.0 Full-Text Use Cases] document, although not on the Recommendation track, is being republished concurrently with this document in order to illustrate various use cases that guided the design of the XQuery 1.0 and XPath 2.0 Full Text specification.

Public Last Call comments on this document and its open issues are invited. Comments on this document are due by 22 June 2007. Comments on this document should be made in W3C's public Bugzilla system for this specification (instructions can be found at http://www.w3.org/XML/2005/04/qt-bugzilla). When entering comments, select the Product named "XPath / XQuery / XSLT", the Component named "Full Text", and the Version named "Last Call drafts". This repository includes open issues recorded by the Query Working Group as well as by members of the public. If access to the Bugzilla system is not feasible, you may send your comments to the W3C XSLT/XPath/XQuery mailing list, public-qt-comments@w3.org It will be very helpful if you include the string [FT] in the subject line of your comment, whether made in Bugzilla or in email. Each Bugzilla entry and email message should contain only one comment. Archives of the comments and responses are available at http://lists.w3.org/Archives/Public/public-qt-comments/.

This document was produced by groups operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the XML Query Working Group and also maintains a public list of any patent disclosures made in connection with the deliverables of the XSL Working Group; those pages also include instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents

1 Introduction
    1.1 Full-Text Search and XML
    1.2 Organization of this document
    1.3 A word about namespaces
2 Full-Text Extensions to XQuery and XPath
    2.1 Processing Model
    2.2 Full-text Contains Expression
        2.2.1 Description
        2.2.2 Examples
    2.3 Score Variables
        2.3.1 Using Weights Within a Scored FTContainsExpr
    2.4 Extensions to the Static Context
3 Full-Text Selections
    3.1 Primary Full-Text Selections
    3.2 Search Tokens and Phrases
    3.3 Match Options
        3.3.1 Case Option
        3.3.2 Diacritics Option
        3.3.3 Stemming Option
        3.3.4 Thesaurus Option
        3.3.5 Stop Word Option
        3.3.6 Language Option
        3.3.7 Wildcard Option
        3.3.8 Extension Option
    3.4 Logical Full-Text Operators
        3.4.1 Or-Selection
        3.4.2 And-Selection
        3.4.3 Mild-Not Selection
        3.4.4 Not-Selection
    3.5 Positional Filters
        3.5.1 Ordered Selection
        3.5.2 Window Selection
        3.5.3 Distance Selection
        3.5.4 Scope Selection
        3.5.5 Anchoring Selection
    3.6 Cardinality Selection
    3.7 Ignore Option
    3.8 Extension Selections
4 Semantics
    4.1 Tokenization
        4.1.1 Examples
        4.1.2 Representations of Tokenized Text and Matching
    4.2 Evaluation of FTSelections
        4.2.1 AllMatches
            4.2.1.1 Formal Model
            4.2.1.2 Examples
            4.2.1.3 XML representation
        4.2.2 XML Representation
        4.2.3 The evaluate function
        4.2.4 Formal semantics functions
        4.2.5 FTWords
        4.2.6 Match Options Semantics
            4.2.6.1 Types
            4.2.6.2 High-Level Semantics
            4.2.6.3 Formal Semantics Functions
            4.2.6.4 FTCaseOption
            4.2.6.5 FTDiacriticsOption
            4.2.6.6 FTStemOption
            4.2.6.7 FTThesaurusOption
            4.2.6.8 FTStopWordOption
            4.2.6.9 FTLanguageOption
            4.2.6.10 FTWildCardOption
        4.2.7 Full-Text Operators Semantics
            4.2.7.1 FTOr
            4.2.7.2 FTAnd
            4.2.7.3 FTUnaryNot
            4.2.7.4 FTMildNot
            4.2.7.5 FTOrder
            4.2.7.6 FTScope
            4.2.7.7 FTContent
            4.2.7.8 FTWindow
            4.2.7.9 FTDistance
            4.2.7.10 FTTimes
    4.3 XQuery 1.0 and XPath 2.0 Full-Text 1.0 and Scoring Expressions
        4.3.1 FTContainsExpr
            4.3.1.1 Semantics of FTContainsExpr
        4.3.2 Scoring
        4.3.3 Example
5 Conformance
    5.1 Minimal Conformance
    5.2 Optional Operators and Match Options
        5.2.1 FTMildNot Operator
        5.2.2 FTUnaryNot Operator
        5.2.3 FTUnit and FTBigUnit
        5.2.4 FTOrder Operator
        5.2.5 FTScope Operator
        5.2.6 FTWindow Operator
        5.2.7 FTDistance Operator
        5.2.8 FTTimes Operator
        5.2.9 FTContent Operator
        5.2.10 FTCaseOption
        5.2.11 FTStopwordOption
        5.2.12 FTLanguageOption
        5.2.13 FTIgnoreOption
        5.2.14 Scoring

Appendices

A EBNF for XQuery 1.0 Grammar with Full-Text extensions
    A.1 Terminal Symbols
    A.2 Extra-grammatical Constraints
B EBNF for XPath 2.0 Grammar with Full-Text extensions
    B.1 Terminal Symbols
C Static Context Components
D Error Conditions
E XML Syntax (XQueryX) for XQuery 1.0 and XPath 2.0 Full-Text 1.0
    E.1 XQueryX representation of XQuery 1.0 and XPath 2.0 Full-Text 1.0
    E.2 XQueryX stylesheet for XQuery 1.0 and XPath 2.0 Full-Text 1.0
    E.3 XQueryX for XQuery 1.0 and XPath 2.0 Full-Text 1.0 example
        E.3.1 Example
            E.3.1.1 XQuery solution in XQuery 1.0 and XPath 2.0 Full-Text 1.0 Use Cases:
            E.3.1.2 A Solution in Full-Text XQueryX:
            E.3.1.3 Transformation of Full-Text XQueryX Solution into XQuery Full-Text
F References
    F.1 Normative References
    F.2 Non-normative References
G Acknowledgements (Non-Normative)
H Glossary (Non-Normative)
I Checklist of Implementation-Defined Features (Non-Normative)
J Change Log (Non-Normative)


1 Introduction

This document defines the language and the formal semantics of XQuery 1.0 and XPath 2.0 Full-Text 1.0. This language is designed to meet the requirements identified in W3C XQuery and XPath Full-Text Requirements [XQuery 1.0 and XPath 2.0 Full-Text Requirements] and to support the queries in the W3C XQuery Full-Text Use Cases [XQuery 1.0 and XPath 2.0 Full-Text Use Cases].

XQuery 1.0 and XPath 2.0 Full-Text 1.0 extends the syntax and semantics of XQuery 1.0 and XPath 2.0.

1.1 Full-Text Search and XML

As XML becomes mainstream, users expect to be able to search their XML documents. This requires a standard way to do full-text search, as well as structured searches, against XML documents. A similar requirement for full-text search led ISO to define the SQL/MM-FT [SQL/MM] standard. SQL/MM-FT defines extensions to SQL to express full-text searches providing functionality similar to that defined in this full-text language extension to XQuery 1.0 and XPath 2.0.

XML documents may contain highly structured data (fixed schemas, known types such as numbers, dates), semi-structured data (flexible schemas and types), markup data (text with embedded tags), and unstructured data (untagged free-flowing text). Where a document contains unstructured or semi-structured data, it is important to be able to search using Information Retrieval techniques such as scoring and weighting.

Full-text search is different from substring search in many ways:

  1. A full-text search searches for tokens and phrases rather than substrings. A substring search for news items that contain the string "lease" will return a news item that contains "Foobar Corporation releases the 20.9 version ...". A full-text search for the token "lease" will not.

  2. There is an expectation that a full-text search will support language-based searches which substring search cannot. An example of a language-based search is "find me all the news items that contain a token with the same linguistic stem as "mouse" (finds "mouse" and "mice"). Another example based on token proximity is "find me all the news items that contain the tokens "XML" and "Query" allowing up to 3 intervening words.

  3. Full-text search must address the vagaries and nuances of language. Search results are often of varying usefulness. When you search a web site for cameras that cost less than $100, this is an exact search. There is a set of cameras that matches this search, and a set that does not. Similarly, when you do a string search across news items for "mouse", there is only 1 expected result set. When you do a full-text search for all the news items that contain the token "mouse", you probably expect to find news items containing the token "mice", and possibly "rodents", or possibly "computers". Not all results are equal. Some results are more "mousey" than others. Because full-text search may be inexact, we have the notion of score or relevance. We generally expect to see the most relevant results at the top of the results list.

    As XQuery and XPath evolve, they may apply the notion of score to querying structured data. For example, when making travel plans or shopping for cameras, it is sometimes useful to get an ordered list of near matches in addition to exact matches. If XQuery and XPath define a generalized inexact match, we expect XQuery and XPath to utilize the scoring framework provided by XQuery and XPath Full-Text.

The following definitions apply to full-text search:

  1. [Definition: Full-text queries are performed on tokens and phrases. Tokens and phrases are produced via tokenization.] Informally, tokenization breaks a character string into a sequence of words, units of punctuation, and spaces.

  2. [Definition: A token is defined as a character, n-gram, or sequence of characters returned by a tokenizer as a basic unit to be searched. Each instance of a token consists of one or more consecutive characters. Beyond that, tokens are implementation-defined.] Note that consecutive tokens need not be separated by either punctuation or space, and tokens may overlap. [Definition: A phrase is an ordered sequence of any number of tokens. Beyond that, phrases are implementation-defined.]

    Note:

    In some natural languages, tokens and words can be used interchangeably.

  3. Tokenization enables functions and operators that operate on a part or the root of the token (e.g., wildcards, stemming).

    Tokenization enables functions and operators which work with the relative positions of tokens (e.g., proximity operators).

    Tokenization also uniquely identifies sentences and paragraphs in which tokens appear. [Definition: A sentence is an ordered sequence of any number of tokens. Beyond that, sentences are implementation-defined. A tokenizer is not required to support sentences.] [Definition: A paragraph is an ordered sequence of any number of tokens. Beyond that, paragraphs are implementation-defined. A tokenizer is not required to support paragraphs.] Whatever a tokenizer for a particular language chooses to do, it must preserve the containment hierarchy: paragraphs contain sentences, which contain tokens.

    The tokenizer has to process two codepoint equal strings in the same way, i.e., it should identify the same tokens. Everything else about the behavior of the tokenizer is implementation-defined.

  4. This specification focuses on functionality that serves all languages. It also selectively includes functionalities useful within specific families of languages. For example, searching within sentences and paragraphs is useful to many western languages and to some non-western languages, so that functionality is incorporated into this specification.

  5. Some XML elements represent semantic markup, e.g., <title>. Others represent formatting markup, e.g., <b> to indicate bold. Semantic markup serves well as token boundaries. Some formatting markup serves well as token boundaries, for example, paragraphs are most commonly delimited by formatting markup. Other formatting markup may not serve well as token boundaries. Implementations are free to provide implementation-defined ways to differentiate between the markup's effect on token boundaries during tokenization.

Certain aspects of language processing are described in this specification as implementation-defined or implementation-dependent.

  • [Definition: Implementation-defined indicates an aspect that may differ between implementations, but must be specified by the implementor for each particular implementation.]

  • [Definition: Implementation-dependent indicates an aspect that may differ between implementations, is not specified by this or any W3C specification, and is not required to be specified by the implementor for any particular implementation.]

1.2 Organization of this document

This document is organized as follows. We first present a high level syntax for the XQuery 1.0 and XPath 2.0 Full-Text 1.0 language along with some examples. Then, we present the syntax and examples of the basic primitives in the XQuery 1.0 and XPath 2.0 Full-Text 1.0 language. This is followed by the semantics of the XQuery 1.0 and XPath 2.0 Full-Text 1.0 language. The appendix contains a section that provides an EBNF for the XPath 2.0 Grammar with Full-Text extensions, an EBNF for XQuery 1.0 Grammar with Full-Text extensions, acknowledgements and a glossary.

1.3 A word about namespaces

Certain namespace prefixes are predeclared by XQuery 1.0 and, by implication, by this specification, and bound to fixed namespace URIs. These namespace prefixes are as follows:

  • xml = http://www.w3.org/XML/1998/namespace

  • xs = http://www.w3.org/2001/XMLSchema

  • xsi = http://www.w3.org/2001/XMLSchema-instance

  • fn = http://www.w3.org/2005/xpath-functions

  • local = http://www.w3.org/2005/xquery-local-functions

In addition to the prefixes in the above list, this document uses the prefix err to represent the namespace URI http://www.w3.org/2005/xqt-errors, This namespace prefix is not predeclared and its use in this document is not normative. Error codes that are not defined in this document are defined in other XQuery 1.0 and XPath 2.0 specifications, particularly [XML Path Language (XPath) 2.0] and [XQuery 1.0 and XPath 2.0 Functions and Operators].

Finally, this document uses the prefix fts to represent a namespace containing a number of functions used in this document to describe the semantics of XQuery 1.0 and XPath 2.0 Full-Text functions. There is no requirement that these functions be implemented, therefore no URI is associated with that prefix.

2 Full-Text Extensions to XQuery and XPath

XQuery 1.0 and XPath 2.0 Full-Text extends the languages of XQuery 1.0 and XPath 2.0 in three ways. It:

  1. Adds a new expression called FTContainsExpr;

  2. Enhances the syntax of FLWOR expressions in XQuery 1.0 and for expressions in XPath 2.0 with optional score variables; and

  3. Adds static context declarations for full-text match options to the query prolog.

Additionally, it extends the data model and processing models in various ways.

2.1 Processing Model

As part of the External Processing that is described in the XQuery Processing Model, when an XML document is parsed into an Infoset/PSVI and ultimately into a XQuery Data Model instance, a full-text process called tokenization is usually executed.

Tokenization, in general terms, is the process of converting a text string into smaller units that are used in query processing. Those units, called tokens, are the most basic text units that a full-text search can refer to. Full-text operators typically work on sequences of token occurrences found in the target text (nodes) of a search. These token occurrences are characterized by unique identifiers that capture the relative position of the token inside the string, the relative position of the sentence containing the token, and the relative position of the paragraph containing the token.

Tokenization, including the definition of the term "words", SHOULD be implementation-defined. Implementations SHOULD expose the rules and sample results of tokenization as much as possible to enable users to predict and interprete the results of tokenization. Tokenization MUST only conform to these constraints:

  1. Each word MUST consist of one or more consecutive characters;

  2. The tokenizer MUST preserve the containment hierarchy (e.g., paragraphs contain sentences, which contain words); and

  3. The tokenizer MUST, when tokenizing two equal strings, identify the same tokens in each.

A sample tokenization is used for the examples in this document. The results might be different for other tokenizations.

A full-text contains expression (2.2 Full-text Contains Expression), evaluated within the normal Query Processing (XQuery Processing Model), is composed of several parts:

  1. An XPath 2.0 or XQuery 1.0 expression (RangeExpr) that specifies the sequence of items to be searched. [Definition: Those items are called the search context.]

  2. The full-text selection to be applied (3 Full-Text Selections). Full-text selections are, syntactically and semantically, fully composable and contain:

    • Required:

    • Optional:

      • Match options, such as indicators for case sensitivity and stop words (3.3 Match Options);

      • Boolean full-text operators, that compose a full-text selection from simpler full-text selections (3.4 Logical Full-Text Operators);

      • Other full-text operators that are constraints on the positions of matches, such as indicators for distance between tokens and for the cardinality of matches (3.5 Positional Filters and 3.6 Cardinality Selection); and

      • The weighting information. Each individual search term in a full-text selection may be annotated with optional weight information. This information may be used during the evaluation of the full-text selections to calculate scoring, information that quantifies the relevance of the result to the given search criteria.

  3. An optional XPath 2.0 or XQuery 1.0 expression (UnionExpr) that specifies the set of nodes, descendents of the RangeExp, which contents may be ignored for the purpose of determining a match during the search (3.7 Ignore Option).

The results of the evaluation of the full-text selection operators are instances of the AllMatches model, which complements the XQuery Data Model (XDM) for processing full-text queries. An AllMatches instance describes all possible solutions to the full-text query for a given search context item. Each solution is described by a Match instance. A Match instance contains the tokens from the search context that must be included (described using StringInclude instances which model the positive terms) and the tokens from search context item that must be excluded (described using StringExclude instances which model the negative terms). Each negative or positive term is modeled as a tuple: the position of the query word or phrase in the full-text selection, and a TokenInfo structure that describes a consecutive sequence of token occurrences in the text string which match the query word or phrase.

Processing Model Extensions

Figure 1 provides a schematic overview of the XQuery 1.0 and XPath 2.0 Full-Text processing steps that are discussed in detail below. Some of these steps are completely outside the domain of XQuery; in Figure 1, these are depicted outside the black line that represents the boundaries of the language. The diagram only shows the central pieces of the XQuery Processing Model (see Section 2.2 Processing ModelXQ), however zooms in on the Execution Engine where the processing of the Full-Text extensions takes place. The full-text processing steps are labeled as FTn within the diagram and are referenced within the text.

Like all XQuery expressions, an FTContainsExpr returns an XDM Instance (see Fig. 1). With the exception of FTWords, which consumes TokenInfos, all full-text selections are closed under the AllMatches data model, i.e., their input and output are AllMatches instances. Tokenization normally occurs at the time of parsing of the original XML documents, for example, during the Data Model Generation process (see Figure 1). But here it may also occur "on-the-fly" transforming an XDM instance into TokenInfos, which ultimately get converted into AllMatches instances by the evaluation of full-text selections. Thus, the evaluation of nested full-text and XQuery expressions instances moves back and forth between these two models.

The resulting AllMatches instance obtained by the evaluation of a Full Text expression is converted into a Boolean value before being returned to the enclosing XPath or XQuery operation as follows. If at least one member of the disjunction contains only positive terms then value returned is true. If all members of the disjunction contain negative terms the result is false.

Weighting information, in an implementation-dependent fashion, may be used when calculating the scoring information computed and made available by FTContainsExpr to the optional score construct.

Given the components of a given Full Text expression, the evaluation algorithm will proceed according to the following steps, also referenced in the processing model diagram as steps FTn (see Fig. 1):

  1. Evaluate the search context expression, resulting in the set of search context items; (FT1 provides the evaluation of any XPath 2.0 or XQuery 1.0 expressions that generates or modifies the search context, as well as the query string(s) in a partially evaluated full-text selection)

  2. Evaluate the (optional) ignore expression, resulting in the set of ignored nodes and virtually delete the ignore nodes from the search context nodes tree. (Included in FT1)

  3. Apply the tokenization algorithm to query string(s).

  4. For each search context item:

    1. Apply the tokenization algorithm in order to extract potentially matching terms together with their positional information. This step results in a sequence of token occurrences.

    2. Evaluate the simple "FTWord" operators in the full-text selection against the tokenized input. This results in a set of AllMatches instances. (FT3)

    3. Evaluate the rest of the full-text selection operator tree in a bottom up fashion. At each step the AllMatches instance produced by the previous steps are given as input, and a new instance of the AllMatches is obtained as output. At each step the FTMatchOptions are controlling the semantics of the application of the FTWords operator. (FT4)

  5. Convert the AllMatches instance into a Boolean value. (FT5)

    The additional scoring information (also part of FT5) that is produced by the evaluation of the Full Text expression is implementation-dependent and is not specified in this document. The scoring information is made available at the same time the Boolean value is returned.

Section 3 Full-Text Selections describes the syntax and the informal semantics of Full Text operators. Their formal semantics as well as the formal definition of the AllMatches data model are given in Section 4 Semantics.

2.2 Full-text Contains Expression

[Definition: A full-text contains expression is a expression that evaluates a sequence of nodes against a full-text selection. ]

As a syntactic construct a full-text contains expression (grammar symbol: FTContainsExpr) behaves like a comparison expression (see Section 3.5.2 General ComparisonsXQ). This grammar rule introduces FTContainsExpr.

[50]    ComparisonExpr    ::=    FTContainsExpr ( (ValueComp
| GeneralComp
| NodeComp) FTContainsExpr )?

A full-text contains expression may be used anywhere a ComparisonExpr may be used. The ftcontains operator has higher precedence than other comparison operators, so the results of ftcontains expressions may be compared without enclosing them in parentheses.

2.2.1 Description

[51]    FTContainsExpr    ::=    RangeExpr ( "ftcontains" FTSelection FTIgnoreOption? )?

A full-text contains expression returns a Boolean value. It returns true if there is some node in the RangeExpr that, after tokenization, matches the full-text selection FTSelection. See Section 3 Full-Text Selections for more details. For the purpose of determining a match, certain descendants of nodes (identified by FTIgnoreOption) in the RangeExpr may be ignored, as specified in Section 3.7 Ignore Option.

An XQuery 1.0 and XPath 2.0 Full-Text processor SHOULD try to use the information available in xml:lang for processing of collations, as well as the various match options defined in Section 3.3 Match Options.

2.2.2 Examples

The following example in XQuery 1.0 Full-Text returns the author of each book with a title containing a token with the same root as dog and the token cat.

for $b in /books/book
where $b/title ftcontains ("dog" with stemming) ftand "cat" 
return $b/author

The same example in XPath 2.0 Full-Text is written as:


/books/book[title ftcontains ("dog" with stemming) ftand "cat"]/author

This example selects books where either the title contains the token dog and the token cat and the content does not contain a token with the same root as train, or where the title fails to have one of the matching tokens but the content does:

/books/book[title ftcontains "dog" ftand "cat" ne
            content ftcontains ("train" with stemming)]

2.3 Score Variables

Besides specifying a match of a full-text search as a Boolean condition, full-text search applications typically also have the ability to associate scores with the results. [Definition: Scores express the relevance of those results to the full-text search conditions.]

XQuery 1.0 and XPath 2.0 Full-Text extends the languages of XQuery 1.0 and XPath 2.0 further by adding optional score variables to the for and let clauses of FLWOR expressions.

The production for the extended for clause follows.

[35]    ForClause    ::=    "for" "$" VarName TypeDeclaration? PositionalVar? FTScoreVar? "in" ExprSingle ("," "$" VarName TypeDeclaration? PositionalVar? FTScoreVar? "in" ExprSingle)*
[37]    FTScoreVar    ::=    "score" "$" VarName

When a score variable is present in a for clause the evaluation of the expression following the in keyword not only needs to determine the result sequence of the expression, i.e., the sequence of items which are iteratively bound to the for variable. It must also determine in each iteration the relevance "score" value of the current item and bind the score variable to that value.

The semantics of scoring and how it relates to second-order functions is discussed in Section 4.3.2 Scoring.

In the following example book elements are determined that satisfy the condition [content ftcontains "web site" ftand "usability" and .//chapter/title ftcontains "testing"]. The scores assigned to the book elements are returned.

for $b score $s 
    in /books/book[content ftcontains "web site" ftand "usability" 
                   and .//chapter/title ftcontains "testing"]
return $s

XPath 2.0 Full-Text extends the language of XPath 2.0 in the for expression in the same way: with optional score variables. The example above is also a legal example of the XPath 2.0 extension.

Scores are typically used to order results, as in the following, more complete example.

for $b score $s 
    in /books/book[content ftcontains "web site" ftand "usability"]
where $s > 0.5
order by $s descending
return <result>  
          <title> {$b//title} </title> 
          <score> {$s} </score> 
       </result>

Note that the score applies to the entire for expression. In the following example, two separate full-text contains expressions are used to select the matching paragraphs. There is still just one score for each para returned. The highest scoring paragraphs will be returned first:

for $p score $s in //book[title ftcontains "software"]/para[. ftcontains "usability"]
     order by $s descending
  return $p

The following more elaborate example uses multiple score variables to return the matching paragraphs ordered so that those from the highest scoring books precede those from the lowest scoring books, where the highest scoring paragraphs of each book are returned before the lower scoring paragraphs of that book:

for $b score $score1 in //book[title ftcontains "software"]
    order by $score1 descending
return
    for $p score $score2 in $b/para[. ftcontains "usability"]
       order by $score2 descending
    return $p

The score variable is bound to a value which reflects the relevance of the match criteria in the full-text selections to the nodes in the respective RangeExprs. The calculation of relevance is implementation-dependent, but score evaluation must follow these rules:

  1. Score values are of type xs:double in the range [0, 1].

  2. For score values greater than 0, a higher score must imply a higher degree of relevance

Similarly to their use in a for clause, score variables may be specified in a let clause. A score variable in a let clause is also bound to the score of the expression evaluation, but in the let clause one score is determined for the complete result. The let variable may be dropped from the let clause, if the score variable is present.

The production for the extended let clause follows.

[38]    LetClause    ::=    (("let" "$" VarName TypeDeclaration?) | ("let" "score" "$" VarName)) ":=" ExprSingle ("," (("$" VarName TypeDeclaration?) | FTScoreVar) ":=" ExprSingle)*

While when using the score option in a for clause the expression following the in keyword has the dual purpose of filtering, i.e., driving the iteration, and determining the scores, it is possible to separately specify expressions for filtering and scoring by combining a simple for clause with a let clause that uses scoring. The following is an example of this.

for $b in /books/book[.//chapter/title ftcontains "testing"]
let score $s := $b/content ftcontains "web site" ftand "usability" 
order by $s descending
return <result score="{$s}">{$b}</result>

This example returns book elements with chapter titles that contain "testing". Along with the book elements scores are returned. These scores, however, reflect whether the book content contains "web site" and "usability".

Note that it is not a requirement of the score of an FTContainsExpr to be 0, if the expression evaluates to false, nor to be non-zero, if the expression evaluates to true. Hence, in the example above it is not possible to infer the Boolean value of the FTContainsExpr in the let clause from the calculated score of a returned result element. For instance, an implementation may want to assign a non-zero score to a book that contained only "web site", but not "usability", as this may be considered more relevant than a book that does not contain either of both.

The expression ExprSingle assigned to the score variable is passed to the scoring algorithm and is not evaluated directly. The scoring algorithm calculates the score value based on the passed expression (not on the value returned by evaluating the expression). The set of supported expressions is implementation-defined.

The use of score variables introduces a second-order aspect to the evaluation of expressions which cannot be emulated by (first-order) XQuery functions. Consider the following replacement of the clause let score $s := FTContainsExpr

let $s := score(FTContainsExpr)

where a function score is applied to some FTContainsExpr. If the function score were first-order, it would only be applied to the result of the evaluation of its argument, which is one of the Boolean constants true or false. Hence, there would be at most two possible values such a score function would be able to return and no further differentiation would be possible.

2.3.1 Using Weights Within a Scored FTContainsExpr

[Definition: Scoring may be influenced by adding weight declarations to search tokens, phrases, and expressions.] Syntactically weight declarations are introduced in the FTSelection production, described in Section 3 Full-Text Selections.

The effect of weights on the result score is implementation-dependent. However, weight declarations must follow these rules:

  1. Weights in an FTContainsExpr are significant only in relation to each other; and

  2. When no explicit weight is specified, the default weight is 1.0.

  3. The weight must be between 0.0 and 1000.0 inclusive.

Weight declarations in an FTContainsExpr for which no scores are evaluated are ignored.

The following example illustrates how different weights can be used for different search terms.

for $b in /books/book
let score $s := $b/content ftcontains ("web site" weight 0.5)
                                ftand ("usability" weight 2)
return <result score="{$s}">{$b}</result>

2.4 Extensions to the Static Context

The XQuery Static Context is extended by a component for each of the full-text match options. Thus, the default of a match option in a query may be changed by providing a setting in the static context using the following declaration syntax.

[6]    Prolog    ::=    ((DefaultNamespaceDecl | Setter | NamespaceDecl | Import) Separator)* ((VarDecl | FunctionDecl | OptionDecl | FTOptionDecl) Separator)*
[14]    FTOptionDecl    ::=    "declare" "ft-option" FTMatchOptions

Match options modify the match semantics of full-text expressions. They are described in detail in Section 3.3 Match Options. When a match option is specified explicitly in a query, that setting overrides the setting of the respective match option in the static context.

3 Full-Text Selections

This section describes the full-text selections which contain the full-text operators in a full-text contains expression (FTContainsExpr), as well as the match options which modify the matching semantics of the full-text selections. In the following the syntax for each type of full-text selection is given together with an informal statement of its meaning.

[Definition: A full-text selection specifies the possible full-text search conditions. ]

[144]    FTSelection    ::=    FTOr FTPosFilter* ("weight" RangeExpr)?

As shown in the grammar, a full-text selection consists of search conditions possibly involving logical operators (FTOr) followed by an arbitrary number of positional filters (FTPosFilter) optionally followed by a "weight" value which is specified using a range expression. The RangeExpr is evaluated, as if it were an argument to a function with an expected type "xs:double"; it must be between 0.0 and 1000.0 inclusive.

The syntax and semantics of the individual full-text selection operators follow.

This XML document fragment is the source document for examples in this section.

<book number="1">
  <title shortTitle="Improving Web Site Usability">Improving  
      the Usability of a Web Site Through Expert Reviews and
      Usability Testing</title>
   <author>Millicent Marigold</author>
   <author>Montana Marigold</author>
   <editor>Véra Tudor-Medina</editor>
   <content>
     <p>The usability of a Web site is how well the  
         site supports the users in achieving specified  
         goals. A Web site should facilitate learning,  
         and enable efficient and effective task  
         completion, while propagating few errors.
     </p>
     <note>This book has been approved by the Web Site  
         Users Association.
     </note>
   </content>
 </book>

Tokenization is implementation-defined. A sample tokenization is used for the examples in this section. This sample tokenization uses white space, punctuation and XML tags as word-breakers and <p> for paragraph boundaries. The results may be different for other tokenizations.

The first five tokens in this example using the sample tokenization would be "Improving", "the", "usability", "of", and "a".

Unless stated otherwise, the results assume a case-insensitive match.

3.1 Primary Full-Text Selections

[150]    FTPrimary    ::=    (FTWords FTTimes?) | ("(" FTSelection ")") | FTExtensionSelection

[Definition: A primary full-text selection is the basic form of a full-text selection. It specifies words and phrases as search conditions (FTWords), optionally followed by a cardinality constraint (FTTimes). An FTSelection in parentheses is also a primary full-text selection.]

3.2 Search Tokens and Phrases

[151]    FTWords    ::=    FTWordsValue FTAnyallOption?
[152]    FTWordsValue    ::=    Literal | ("{" Expr "}")
[154]    FTAnyallOption    ::=    ("any" "word"?) | ("all" "words"?) | "phrase"

FTWords finds matches that contain the specified tokens and phrases.

FTWords consists of two parts: a mandatory FTWordsValue part and an optional FTAnyallOption part. FTWordsValue specifies the tokens and phrases that must be contained in the matches. FTAnyallOption specifies how containment is checked.

The FTWordsValue is converted as though it were an argument to a function with the expected type of "xs:string*".

In general, the tokens and phrases in FTWordsValue are specified using a nested XQuery expression. To simplify notation, the enclosing braces may be omitted if FTWordsValue consists of a single literal.

The following rules specify how the containment of the strings from the FTWordsValue sequence is checked. First, every string is tokenized into a sequence of tokens as described in Section 4.1 Tokenization. Then, FTAnyallOption is checked.

If FTAnyallOption is "any", the sequence of tokens for every string is considered as a phrase, i.e. the tokens must occur consecutively in the text in the specified order. If the sequence contains more than one string, the different strings are considered to be alternatives, i.e. the resulting matches must contain at least one of the generated phrases.

If FTAnyallOption is "all", the sequence of tokens for every string is considered as a phrase. The resulting matches must contain all of the generated phrases.

If FTAnyallOption is "phrase", the tokens from all the strings are concatenated in a single sequence, which is considered as a phrase. The resulting matches must contain the generated phrase.

If FTAnyallOption is "any word", the tokens from all the strings are combined into a single set. The resulting matches must contain at least one of the tokens in the set.

If FTAnyallOption is "all words", the tokens from all the strings are combined into a single set. The resulting matches must contain all of the tokens in the set.

If the FTWordsValue evaluates to a single string, the use of "any", "all", and "phrase" in FTAnyallOption produces the same results.

If FTAnyallOptions is omitted, "any" is the default.

The following expression returns the book element whose number is 1, because its title element contains the token "Expert":

/book[@number="1" and ./title ftcontains "Expert"]

The following expression returns the book element whose number is 1, because its title element contains the phrase "Expert Reviews":

/book[@number="1" and ./title ftcontains "Expert Reviews"]

The following expression returns the book element whose number is 1, because its title element contains two tokens "Expert" and "Reviews":

/book[@number="1" and ./title ftcontains {"Expert",
"Reviews"} all]

The following expression returns false, because the p element doesn't contain the phrase "Web Site Usability" although it contains all of the tokens in the phrase:

/book[@number="1"]//p ftcontains "Web Site Usability"

The following expression returns book numbers of book elements by "Marigold" with a title about "Web Site Usability", sorting them in descending score order:

for $book in /book[.//author ftcontains "Marigold"] 
let score $score := $book/title ftcontains "Web Site Usability" 
where $score > 0.8 
order by $score descending
return $book/@number

3.3 Match Options

Full-text match options modify the matching behaviour of the primary full-text selection to which they are applied.

[149]    FTPrimaryWithOptions    ::=    FTPrimary FTMatchOptions?
[165]    FTMatchOptions    ::=    FTMatchOption+
[166]    FTMatchOption    ::=    FTLanguageOption
| FTWildCardOption
| FTThesaurusOption
| FTStemOption
| FTCaseOption
| FTDiacriticsOption
| FTStopwordOption
| FTExtensionOption

[Definition: Match options modify the set of tokens in the query, or how they are matched against tokens in the text.]

[Definition: Each of the seven alternatives of production FTMatchOption corresponds to one match option group. ] The match options from any given group are mutually exclusive, i.e., only one of these settings can be in effect, whereas match options of different groups can be combined freely.

Note that, along with the syntax rules above, there is an extra-grammatical constraint, multiple-match-options , which needs to be considered, if multiple match options are specified. It states that within a single FTMatchOptions at most one match option of any given match option group may be specified. For example, if the FTCaseOption "lowercase" is specified, then "uppercase" cannot also be specified as part of the same FTMatchOptions.

Although match options only take effect in the application of FTWords, the syntax also allows to specify match options that modify the non-primitive full-text selection "(" FTSelection ")". Such a higher-level match option provides a default for the respective match option group for any embedded FTPrimary, just as the static context components corresponding to the match option groups provide default match options for the whole query. Details about these context components, including their default values, are given in Appendix C Static Context Components.

In other words, there is a tuple of seven effective match options, one from each group, which are propagated from top to bottom in the query syntax tree. For the top-level query the seven values are given by the static context and at each FTPrimary the locally (like postfix operators) specified match options may override these propagated values. Thus, any occurrence of an FTWords in a query is associated with seven effective match options, one from each group, that influence its matching.

The order in which effective match options for an FTWords are applied is subject to some constraints:

  1. The Language Option must be applied first

  2. The Stemming Option must be applied before the Case Option and the Diacritics Option

Aside from these constraints, the full order of the application of match options is implementation-defined. [Definition: This order is called the match option application order.]

More information on their semantics is given in 4.2.6 Match Options Semantics.

If no match options declarations are present in the prolog and the implementation does not define any overwriting of the static context components for the match options, the query:

/book/title ftcontains "usability" 

is, assuming "de" is the implementation-defined default language, equivalent to the query:

/book/title ftcontains "usability" case insensitive 
    diacritics insensitive 
    without stemming without thesaurus  
    without stop words language "de" without wildcards

We describe each match option group in more detail in the following sections.

3.3.1 Case Option

[167]    FTCaseOption    ::=    ("case" "insensitive")
| ("case" "sensitive")
| "lowercase"
| "uppercase"

[Definition: A case option modifies the matching of tokens and phrases by specifying how uppercase and lowercase characters are considered.]

There are four possible character case options:

  1. Using the option "case insensitive" tokens and phrases are matched, regardless of the case of characters of the query tokens and phrases.

  2. Using the option "case sensitive" tokens and phrases are matched, if and only if the case of their characters is the same as written in the query.

  3. Using the option "lowercase" tokens and phrases are matched, if and only if they match the query without regard to character case, but contain only lowercase characters.

  4. Using the option "uppercase" tokens and phrases are matched, if and only if they match the query without regard to character case, but contain only uppercase characters.

The default is "case insensitive".

The following table summarizes the interactions between the case match options and the use of the default collation.

Case Matrix
Default collation options/Case options UCC (Unicode Codepoint Collation) CCS (some generic case-sensitive collation) CCI (some generic case-insensitive collation)
insensitive compare as if both lower case-insensitive variant of CCS if it exists, else error CCI
sensitive UCC CCS case-sensitive variant of CCI if it exists, else error
lowercase lowercase(Expr) + UCC lowercase(Expr) + CCS CCI
uppercase uppercase(Expr) + UCC uppercase(Expr) + CCS CCI

Note:

In this table, "else error" means "Otherwise, an error is raised: [err:FOCH0002]FO". The phrase "if it exists" is used, because the case-sensitive collation CCS does not always have a case-insensitive variant (and, even if one exists, it may not be possible to determine it algorithmically), and because the case-insensitive collation CCI does not always have a case-sensitive variant (and, even if one exists, it may not be possible to determine it algorithmically).

Note:

Using the "lowercase" (respectively "uppercase") option is equivalent to using the option "case sensitive", while converting the query strings to their lowercase (respectively uppercase) form before matching.

The following expression returns false, because the title element doesn't contain "usability" in lower-case characters:

/book[@number="1"]/title ftcontains "Usability" lowercase 

The following expression returns true, because the character case is not considered:

/book[@number="1"]/title ftcontains "usability" 
case insensitive 

3.3.2 Diacritics Option

[168]    FTDiacriticsOption    ::=    ("diacritics" "insensitive")
| ("diacritics" "sensitive")

[Definition: A diacritics option modifies token and phrase matching by specifying how diacritics are considered. ]

There are two possible diacritics options:

  1. The option "diacritics" "insensitive" matches tokens and phrases with and without diacritics. Whether diacritics are written in the query or not is not considered.

  2. The option "diacritics" "sensitive" matches tokens and phrases only if they contain the diacritics as they are written in the query.

The default is "diacritics insensitive".

The following table summarizes the interactions between the diacritics match options and the use of the default collations.

Diacritics Matrix
Default collation options/Diacritics options UCC (Unicode Codepoint Collation) CDS (some generic diacritics-sensitive collation) CDI (some generic diacritics-insensitive collation)
insensitive UCC comparison, but without considering diacritics diacritics-insensitive variant of CDS if it exists, else error CDI
sensitive UCC CDS diacritics-sensitive variant of CDI if it exists, else error

Note:

In this table, "else error" means "Otherwise, an error is raised: [err:FOCH0002]FO". The phrase "if it exists" is used, because the diacritics-sensitive collation CDS does not always have a diacritics-insensitive variant (and, even if one exists, it may not be possible to determine it algorithmically), and because the diacritics-insensitive collation CDI does not always have a diacritics-sensitive variant (and, even if one exists, it may not be possible to determine it algorithmically).

The following expression returns true, because the token "Véra" in the editor element is matched, as the acute accent is not considered in the comparison:

/book[@number="1"]//editor ftcontains "Vera" diacritics insensitive

This returns false, because the editor element does not contain the token "Vera" in this exact form, i.e. without any diacritics:

/book[@number="1"]/editors ftcontains "Vera" diacritics sensitive

3.3.3 Stemming Option

[169]    FTStemOption    ::=    ("with" "stemming") | ("without" "stemming")

[Definition: A stemming option modifies token and phrase matching by specifying whether stemming is applied or not. ]

The "with stemming" option specifies that matches may contain tokens that have the same stem as the tokens and phrases written in the query. It is implementation-defined what a stem of a token is.

The "without stemming" option specifies that the tokens and phrases are not stemmed.

It is implementation-defined whether the stemming is based on an algorithm, dictionary, or mixed approach.

The default is "without stemming".

The following expression returns true, because the title of the specified book contains "improving" which has the same stem as "improve":

/book[@number="1"]/title ftcontains "improve" with stemming 

3.3.4 Thesaurus Option

[170]    FTThesaurusOption    ::=    ("with" "thesaurus" (FTThesaurusID | "default"))
| ("with" "thesaurus" "(" (FTThesaurusID | "default") ("," FTThesaurusID)* ")")
| ("without" "thesaurus")
[171]    FTThesaurusID    ::=    "at" URILiteral ("relationship" StringLiteral)? (FTRange "levels")?
[143]    URILiteral    ::=    StringLiteral

[Definition: A thesaurus option modifies token and phrase matching by specifying whether a thesaurus is used or not.] If thesauri are used, the thesaurus option specifies information to locate the thesauri either by default or through a URI reference. It also states the relationship to be applied and how many levels within the thesaurus to be traversed.

The value of the FTThesaurusID must be a URILiteral.

Thesauri add related tokens and phrases to the search. Thus, the user may narrow, broaden, or otherwise modify the search using synonyms, hypernyms (more generic terms), etc. The search is performed as though the user has specified all related search tokens and phrases in a disjunction (FTOr).

Note:

A thesaurus may be standards-based or locally-defined. It may be a traditional thesaurus, or a taxonomy, soundex, ontology, or topic map. How the thesaurus is represented is implementation-dependent.

FTThesaurusID specifies the relationship sought between tokens and phrases written in the query and terms in the thesaurus and the number of levels to be queried in hierarchical relationships by including an FTRange "levels". If no levels are specified, the default is to query all levels in hierarchical relationships.

Relationships include, but are not limited to, the relationships and their abbreviations presented in [ISO 2788] and their equivalents in other languages. The set of relationships supported by an implementation is implementation-defined, but implementations SHOULD support the relationships defined in [ISO 2788]. The following list of terms have the meanings defined in [ISO 2788]. If a query specifies thesaurus relationships or levels not supported by the thesaurus, the behavior is implementation-defined.

  1. equivalence relationships (synoymns): PREFERRED TERM (USE), NONPREFERRED USED FOR TERM (UF);

  2. hierarchical relationships: BROADER TERM (BT), NARROWER TERM (NT), BROADER TERM GENERIC (BTG), NARROWER TERM GENERIC (NTG), BROADER TERM PARTITIVE (BTP), NARROWER TERM PARTITIVE (NTP), TOP Terms (TT); and

  3. associative relationships: RELATED TERM (RT).

The "with thesaurus" option specifies that string matches include tokens that can be found in one of the specified thesauri.

The "without thesaurus" option specifies that no thesaurus will be used.

The "with default thesaurus" option specifies that a system-defined default thesaurus with a system-defined relationship is used. The default thesaurus may be used in combination with other explicitly specified thesauri.

The default is "without thesaurus".

The following expression returns true, because it finds a content element containing "tasks" which the thesaurus identified as a synonym for "duties":

count(.//book/content ftcontains "duties" with
thesaurus at "http://bstore1.example.com/UsabilityThesaurus.xml"
relationship "UF")>0

The following expression returns book elements, because it finds a content element containing "web site components", and narrower terms "navigation" and "layout":

doc("http://bstore1.example.com/full-text.xml")
/books/book[count(./content ftcontains "web site components" with
thesaurus at "http://bstore1.example.com/UsabilityThesaurus.xml"
relationship "NT" at most 2 levels)>0]

Assuming that there is a locally defined thesaurus that contains soundex capabilities, the following query returns a book element containing "Marigold" which sounds which sound like "Merrygould":

doc("http://bstore1.example.com/full-text.xml")
/books/book[count(. ftcontains "Merrygould" with thesaurus at
"http://bstore1.example.com/UsabilitySoundex.xml" relationship
"sounds like")>0]

3.3.5 Stop Word Option

[172]    FTStopwordOption    ::=    ("with" "stop" "words" FTRefOrList FTInclExclStringLiteral*)
| ("without" "stop" "words")
| ("with" "default" "stop" "words" FTInclExclStringLiteral*)
[173]    FTRefOrList    ::=    ("at" URILiteral)
| ("(" StringLiteral ("," StringLiteral)* ")")
[174]    FTInclExclStringLiteral    ::=    ("union" | "except") FTRefOrList

[Definition: A stop word option controls word matching by specifying whether stop words are used or not. Stop words are tokens in the query that match any token in the text. ] Normally a stop word matches exactly one token, but there may be implementation-defined conditions, under which a stop word may match a different number of tokens.

FTRefOrList specifies the list of stop words either explicitly as a comma-separated list of string literals, or by the keyword at followed by a literal URI. If the URI specifies a list of stop words that is not found in the statically known stop word lists, an error is raised [err:FTST0008]. Whether the stop word list is resolved from the statically known stop word lists or given explicitly, no tokenization is performed on the stop words: they are used as they occur in the sequence.

The "with stop words" option specifies that if a token is within the specified collection of stop words, it is removed from the search and any token may be substituted for it. Stop words retain their position numbers and are counted in FTDistance and FTWindow searches.

Multiple stop word lists may be combined using "union" or "except". The keywords "union" and "except" are applied from left to right. If "union" is specified, every string occurring in the lists specified by the left-hand side or the right-hand side is a stop word. If "except" is specified, only strings occurring in the list specified by the left-hand side but not in the list specified by the right-hand side are stop words.

The "with default stop words" option specifies that an implementation-defined collection of stop words is used.

The "without stop words" option specifies that no stop words are used. This is equivalent to specifying an empty list of stop words.

The default is "without stop words".

Note:

Stop word lists may be applied during indexing. If applied during indexing asking for stop words to not be used during a query, will have no effect.

The following expression returns true, because the document contains the phrase "propagating few errors":

/book[@number="1"]//p ftcontains "propagation of errors"
with stemming with stop words ("a", "the", "of") 

Note the asymmetry in the stop word semantics: the property of being a stop word is only relevant to query terms, not to document terms. Hence, it is irrelevant for the above-mentioned match whether "few" is a stop word or not, and on the other hand we do not want the query above to match "propagation" followed by 2 stop words, or even a sequence of 3 stop words in the document.

The following expression returns false, because "of" is not in the p element between "propagating" and "errors":

/book[@number="1"]//p ftcontains "propagation of errors" 
with stemming without stop words

The following expression uses the stop words list specified at the URL. Assuming that the specified stop word list contains the "then", this query is reduced to a query on the phrase "planning X conducting", allowing any token as a substitute for X. It returns a book element, because its content element contains "planning then conducting". It would also return the book if the phrases "planning and conducting" and "planning before conducting" had been in its content:

doc("http://bstore1.example.com/full-text.xml")
/books/book[count(.//content ftcontains "planning then 
conducting" with stop words at 
"http://bstore1.example.com/StopWordList.xml")>0]

The following expression returns books containing "planning then conducting", but not does not return books containing "planning and conducting", since it is exempting "then" from being a stop word:

doc("http://bstore1.example.com/full-text.xml")
/books/book[count(.//content ftcontains "planning then conducting"
with stop words at "http://bstore1.example.com/StopWordList.xml"
except ("the then"))>0]

3.3.6 Language Option

[175]    FTLanguageOption    ::=    "language" StringLiteral

[Definition: A language option modifies token matching by specifying the language of search tokens and phrases.]

The StringLiteral following the keyword language designates one language. It must be castable to "xs:language"; otherwise, an error is raised: [err:XPTY0004]XP.

The "language" option influences tokenization, stemming, and stop words in an implementation-defined way. The "language" option MAY influence the behavior of other match options in an implementation-defined way.

The set of standardized language identifiers are defined in [BCP 47]. The set of valid language identifiers among the standardized set is implementation-defined. An implementation MAY choose to use private extensions introduced by a singleton 'x' for additional language identifiers, or other singletons for registered extensions as described in sec. 2.2.6 of [BCP 47]. It is implementation-defined what additional language identifiers, if any, are valid. If an invalid language identifier is specified, then the behavior is implementation-defined. If the implementation chooses to raise an error in that case, it must raise [err:FTST0009].

The default language is specified in the static context.

When an XQuery 1.0 and XPath 2.0 Full-Text processor evaluates text in a document that is governed by an xml:lang attribute and the portion of the full-text query doing that evaluation contains an FTLanguageOption that specifies a different language that the language specified by the governing xml:lang attribute, the language-related behavior of that full-text query is implementation-defined.

This is an example where the language option is used to select the appropriate stop word list:

/book[@number="1"]//editor ftcontains "salon de the"
with default stop words language "fr"

3.3.7 Wildcard Option

[176]    FTWildCardOption    ::=    ("with" "wildcards") | ("without" "wildcards")

[Definition: A wildcard option modifies token and phrase matching by specifying whether wildcards are used or not.]

When the "with wildcards" option is used, wildcard indicators (represented by periods (.)) and qualifiers may be appended to or inserted into the query tokens. If the period is at the beginning of a query token, the wildcard is a prefix wildcard. If the period is at the end of a query token, it is a suffix wildcard. If the period is inserted into a query token, it is an infix wildcard.

Each indicator and qualifier in a query token will match zero or more characters within a token in the text, as described below. The number of characters matched depends on the qualifier. Qualifiers available are none, question mark, asterisk, plus sign, and two numbers separated by a comma, both enclosed by curly braces.

  1. If a period is present, but there are no qualifiers, one character in the text will match.

  2. If a period is followed by a question mark (.?), zero or one characters in the text will match.

  3. If a period is followed by an asterisk (.*), zero or more characters will match.

  4. If a period is followed by a plus sign (.+), one or more characters will match.

  5. If a period is followed by two numbers separated by a comma, both enclosed by curly braces (.{n,m}), a specified range of characters (at least n characters and no more than m characters) will match.

When "with wildcards" is present and an indicator or qualifier character is intended to be taken literally (as itself), that character must be preceded by ("escaped by") a backslash (\). For example, a period (.) that is intended to be a sentence terminator or a decimal point must be preceded by a backslash so that it is not interpreted to be an indicator. Similarly a question mark (?), asterisk (*), or plus sign (+) that is intended to be interpreted as an ordinary text character must be preceded by a backslash so that it is not interpreted to be an indicator.

The "without wildcards" option finds tokens without recognizing wildcard indicators and qualifiers. Periods, question marks, asterisks, plus signs, and two numbers separated by a comma, both enclosed by curly braces, are always recognized as ordinary text characters.

The default is "without wildcards".

Note: Wildcard indicators and qualifiers may be token boundaries. How text with wildcard indicators and qualifiers is tokenized is implementation-defined.

The expression returns true, because the title element contains "improving":

/book[@number="1"]/title ftcontains "improv.*" with
wildcards

The following expression returns true, because the title element contains "site":

/book[@number="1"]/title ftcontains ".?site" with
wildcards

The following expression returns true, because the p element contains "well":

/book[@number="1"]/p ftcontains "w.ll" with
wildcards

The following expression returns false, because the p element does not contain "w.ll":

/book[@number="1"]/p ftcontains "w.ll" without wildcards

3.3.8 Extension Option

[Definition: An extension option is a match option that acts in an implementation-defined way. ]

[177]    FTExtensionOption    ::=    "option" QName StringLiteral

An extension option consists of an identifying QName and a StringLiteral. Typically, a particular option will be recognized by some implementations and not by others. The syntax is designed so that option declarations can be successfully parsed by all implementations.

The QName of an option must resolve to a namespace URI and local name, using the statically known namespaces.

Note:

There is no default namespace for options.

Each implementation recognizes an implementation-defined set of namespace URIs used to denote extension options.

If the namespace part of the QName is not a namespace recognized by the implementation as one used to denote extension option, then the extension option is ignored.

Otherwise, the effect of the extension option, including its error behavior, is implementation-defined. For example, if the local part of the QName is not recognized, or if the StringLiteral does not conform to the rules defined by the implementation for the particular extension option, the implementation may choose whether to report an error, ignore the extension option, or take some other action.

Implementations may impose rules on where particular extension options may appear relative to other match options, and the interpretation of an option declaration may depend on its position.

An extension option must not be used to change the syntax accepted by the processor, or to suppress the detection of static errors. However, it may be used without restriction to modify the set of tokens in the query or how they are matched against tokens in the text. An extension option has the same scope as other match options.

The following examples illustrate several possible uses for extension options:

This extension option is set as part of the static context of all full-text expressions in the module and might be used to ensure that queries are insensitive to Arabic short-vowels.

declare namespace exq = "http://example.org/XQueryImplementation";

declare ft-option option exq:diacritics "short-vowel insensitive"

This extension option applies only to the matching in the full-text selection in which it is found and might be used to specify how compound words should be matched.

declare namespace exq = "http://example.org/XQueryImplementation";

//para[. ftcontains "Kinder" ftand "Platz" 
        distance 1 words with stemming option exq:compounds "distance=1"

3.4 Logical Full-Text Operators

Full-text selections can be combined with the logical connectives ftor (full-text or), ftand (full-text and), not in (mild not), and ftnot (unary full-text not).

[145]    FTOr    ::=    FTAnd ( "ftor" FTAnd )*
[146]    FTAnd    ::=    FTMildNot ( "ftand" FTMildNot )*
[147]    FTMildNot    ::=    FTUnaryNot ( "not" "in" FTUnaryNot )*
[148]    FTUnaryNot    ::=    ("ftnot")? FTPrimaryWithOptions

3.4.1 Or-Selection

[Definition: An or-selection combines two full-text selections using the ftor operator.]

An or-selection finds all matches that satisfy at least one of the operand full-text selections.

The following expression returns the book element written by "Millicent":

 /book[.//author ftcontains "Millicent" ftor
"Voltaire"] 

3.4.2 And-Selection

[Definition: An and-selection combines two full-text selections using the ftand operator.]

An and-selection finds matches that satisfy all of the operand full-text selections simultaneously. A match of an and-selection is formed by combining matches for each of the operand full-text selections as described in 4.2.7.2 FTAnd.

For example, "usability" ftand "testing" will find two matches in /book[@number="1"]/title: each of the two matches for the FTWords selection "usability" (the two occurrences of the token "usability" in the string value of the title element) is combined with the single match for the FTWords "testing" (only one occurrence of the token "testing" in the title). Since the above and-selection has at least one match, the following expression will return "true".

/book[@number="1"]/title ftcontains ("usability" ftand "testing")

The following expression returns false, because "Millicent" and "Montana" are not contained by the same author element in any book element:

/book/author ftcontains "Millicent" ftand "Montana"

No author element in any book element contains both "Millicent" and "Montana". Therefore, for any such author element, there are either one match for the FTWords "Millicent" and zero matches for the FTWords "Montana", or vice versa, or no matches for both of them. In any of these cases, the and-selection will have zero matches.

3.4.3 Mild-Not Selection

[Definition: A mild-not selection combines two full-text selections using the not in operator.]

The not in operator is a milder form of the operator combination ftand ftnot. The selection A not in B matches a token sequence that matches A, but not when it is a part of a match of B. In contrast, A ftand ftnot B only finds matches, when the token sequence contains A and does not contain B.

As an example, consider a search for "Mexico" not in "New Mexico". This may return, among others, a document which is all about "Mexico" but mentions at the end that "New Mexico was named after Mexico". The occurrence of "Mexico" in "New Mexico" is not considered, but other occurrences of "Mexico" are matched. Note that this document would not be matched by the full-text selection "Mexico" ftand ftnot "New Mexico".

A match to a mild-not selection must contain at least one token occurrence that satisfies the first condition and does not satisfy the second condition. If it contains a token occurrence that satisfies both the first and the second condition, the occurrence is not considered as a match.

The following expression returns true, because "usability" appears in the title and the p elements and the occurrence within the phrase "Usability Testing" in the title element is not considered:

/book ftcontains "usability" not in "usability
testing"

Operands of a mild-not selection may not contain a full-text selection that evaluates to an AllMatches that contains a StringExclude. Such full-text selections are not-selection and FTWords with a cardinality constraint using at most, from ... to, and exactly occurrences ranges.

3.4.4 Not-Selection

[Definition: A not-selection is a full-text selection starting with the prefix operator ftnot.]

A not-selection selects matches that do not satisfy the operand full-text selection. Details about how such matches are constructed are given in 4.2.7.3 FTUnaryNot.

The following expression returns the empty sequence, because all book elements contain "usability":

/book[. ftcontains ftnot "usability"]

The following expression returns true, because book elements contain "information" and "retrieval" but not "information retrieval":

/book ftcontains "information" ftand
"retrieval" ftand ftnot "information retrieval"

The following expression returns book elements containing "web site usability" but not "usability testing":

/book[. ftcontains "web site usability" ftand 
ftnot "usability testing"]

3.5 Positional Filters

[157]    FTPosFilter    ::=    FTOrder | FTWindow | FTDistance | FTScope | FTContent

[Definition: Positional filters are postfix operators that serve to filter matches based on various constraints on their positional information.]

Recall that the grammar rule for FTSelection allows an arbitrary number of positional filters to follow an FTOr. Multiple adjacent positional filters are applied from left to right, i.e., the first filter is applied to the result of the FTOr, the second is applied to the result of that first application, and so on.

3.5.1 Ordered Selection

[158]    FTOrder    ::=    "ordered"

[Definition: An ordered selection consist of a full-text selection followed by the postfix operator "ordered".] An ordered selection controls the order of tokens and phrases to be the same as the order in which they are written in the operand selection.

The default is unordered. Unordered is in effect when ordered is not specified in the query. Unordered cannot be written explicitly in the query.

An ordered selection selects matches which satisfy the operand full-text selection and for which the order the matching tokens have in the text is the same order that the corresponding query tokens have in the operand selection.

The following expression returns true, because titles of book elements contain "web site" and "usability" in the order in which they are written in the query, i.e., "web site" must precede "usability":

/book/title ftcontains ("web site" ftand "usability")
ordered 

The following expression returns false, because although "Montana" and "Millicent" both appear in the book element, they do not appear in the order they are written in the query:

/book[@number="1"] ftcontains ("Montana" ftand
"Millicent") ordered 

3.5.2 Window Selection

[159]    FTWindow    ::=    "window" AdditiveExpr FTUnit
[161]    FTUnit    ::=    "words" | "sentences" | "paragraphs"

[Definition: A window selection consist of a full-text selection followed by one of the (complex) postfix operators derived from FTWindow.] A window selection selects matches which satisfy the operand full-text selection and for which the matched tokens and phrases, more precisely the individual StringIncludes of that match, are found within a number of FTUnits (words, sentences, and paragraphs). The number of FTUnits is specified by an AdditiveExpr that is converted as though it were an argument to a function with the expected type of "xs:integer".

A window selection may cross element boundaries. The size of the window is not affected by the presence or absence of element boundaries. Stop words are included in the computation of the window size whether they are ignored by the query or not.

A match of an FTSelection is considered a match within a window, if there exists a window of at most the given number of consecutive units (tokens, sentences, or paragraphs) in the document within which all StringIncludes of the match lie.

The following expression returns true, because "web", "site", and "usability" are within a window of 5 tokens in the title element:

/book/title ftcontains "web" ftand "site"
ftand "usability" window 5 words

The following expression returns true, because "web" and "site" in the order they are written in the query and either "usability" or "testing" are within a window of at most 10 tokens:

/book ftcontains ("web" ftand "site" ordered)
ftand ("usability" ftor "testing") window 10 words

The following expression returns true, because the title element contains "Web Site Usability". A similar query on the p element would not return true, because its occurrences of "web site" and "usability" are not within a window of 3:

/book//title ftcontains "web site" ftand
"usability" window 3 words

The following expression returns the empty sequence, because in the selected book element, there is no occurrence of "efficient" within a window of 3 tokens which would not also contain an occurrence of "and":

/book[@number="1" and . ftcontains "efficient" 
ftand ftnot "and" window 3 words]

In order to allow meaningful results for nested positional filters, e.g., a window selection embedded inside a distance selection, the resulting matches for window selections are formed from the input matches that satisfy the window constraint as follows. All StringIncludes of such a match are coerced into a single StringInclude that spans all token positions from the smallest to the largest position of any input StringIncludes. This is explained in more detail in Section 3.5.3 Distance Selection.

3.5.3 Distance Selection

[160]    FTDistance    ::=    "distance" FTRange FTUnit
[156]    FTRange    ::=    ("exactly" AdditiveExpr)
| ("at" "least" AdditiveExpr)
| ("at" "most" AdditiveExpr)
| ("from" AdditiveExpr "to" AdditiveExpr)

[Definition: A distance selection consist of a full-text selection followed by one of the (complex) postfix operators derived from FTDistance.]

A distance selection selects matches which satisfy the operand full-text selection and for which the matched tokens and phrases satisfy the specified distance conditions. Distance is specified in units of FTUnits (words, sentences, and paragraphs). The number of intervening FTUnits is specified in the integer value of FTRange.

FTRange specifies a range of integer values, providing a minimum and maximum value. Each one of the AdditiveExpr specified in an FTRange is converted as though it were an argument to a function with the expected parameter type of "xs:integer".

Let the value of the first (or only) operand be M. If "from" is specified, let the value of the second operand be N. A distance selection may cross element boundaries when computing distance.

The following rule applies to the computation of distance:

  • Zero words (sentences, paragraphs) means adjacent tokens (sentences, paragraphs).

If "exactly" is specified, then the range is the closed interval [M, M]. If "at least" is specified, then the range is the half-closed interval [M, unbounded). If "at most" is specified, then the range is the closed interval [0, M]. If "from-to" is specified, then the range is the closed interval [M, N]. Note: If M is greater then N, the range is empty.

Here are some examples of FTRanges:

  1. 'exactly 0' specifies the range [0, 0].

  2. 'at least 1' specifies the range [1,unbounded].

  3. 'at most 1' specifies the range [0, 1].

  4. 'from 5 to 10' specifies the range [5, 10].

The distances computed by a distance selection are not affected by the presence or absence of element boundaries in the text. Stop words are counted in those computations whether they are ignored or not.

The following expression returns false,