This document is also available in these non-normative formats: XML.
Copyright © 2005 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
This document defines the syntax and formal semantics of XQuery 1.0 and XPath 2.0 Full-Text which is a language that extends XQuery 1.0 [XQuery 1.0: An XML Query Language] and XPath 2.0 [XML Path Language (XPath) 2.0] with full-text search capabilities.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document has been produced following the procedures set out for the W3C Process. This document was produced through the efforts of XML Query Working Group and the XSL Working Group (both part of the XML Activity). It is designed to be read in conjunction with the following documents: W3C XQuery and XPath Full-Text Requirements [XQuery and XPath Full-Text Requirements] and the W3C XQuery Full-Text Use Cases [XQuery 1.0 and XPath 2.0 Full-Text Use Cases].
This is the fourth version of this document. Since the last version was published, several technical and editorial changes have been made to all the sections of the document. Among the most significant changes are: a reformulation of FTIgnore, including alignment of the specifications in Section 3 and Section 4; a more complete normalization of the rules for matching, significantly simplifying the behavior; a thorough pass through the entire document to correct grammar, spelling, and punctuation, resulting in significantly higher document quality; the distance between sentences and between paragraphs has been respecified to align with the distance between words (that is, adjacent sentences or paragraphs now have a distance between them of zero sentences or paragraphs, respectively); and the addition of two new appendices, one summarizing the error codes used in the Full-Text document and the other summarizing all items specified in the document to be implementation-defined.
The text of the XQuery functions used to define the semantics have not been completely syntax checked; that continues to be an on-going activity.
This is a public W3C Working Draft for review by W3C members and other interested parties. Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
Public comments on this document and its open issues are invited. Comments should be entered into the last-call issue tracking system for this specification (instructions can be found at http://www.w3.org/XML/2005/04/qt-bugzilla). If access to that system is not feasible, you may send your comments to the W3C mailing list, public-qt-comments@w3.org (http://lists.w3.org/Archives/Public/public-qt-comments/) with "[FT]" at the beginning of the subject field of email messages involving such comments.
The patent policy for this document is specified in the 5 February 2004 W3C Patent Policy. Patent disclosures relevant to this specification may be found on the XML Query Working Group's patent disclosure page and the XSL Working Group's patent disclosure page. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) with respect to this specification should disclose the information in accordance with section 6 of the W3C Patent Policy.
1 Introduction
1.1 Full-Text Search and XML
1.2 Organization of this document
1.3 A word about namespaces
2 Full-Text Extensions to XQuery and XPath
2.1 Expression FTContainsExpr
2.1.1 FTContainsExpr Description
2.1.2 FTContainsExpr Examples
2.2 Score Variables
2.2.1 Using Weights Within a Scored FTContainsExpr
2.3 Extensions to the Static Context
3 FTSelections, FTMatchOptions, and FTIgnoreOption
3.1 FTSelection
3.1.1 FTSelection Example
3.1.2 FTWords
3.1.3 FTOr
3.1.4 FTAnd
3.1.5 FTMildNot
3.1.6 FTUnaryNot
3.1.7 FTOrder
3.1.8 FTScope
3.1.9 FTDistance
3.1.10 FTWindow
3.1.11 FTTimes
3.1.12 FTContent
3.2 FTMatchOptions
3.2.1 FTCaseOption
3.2.2 FTDiacriticsOption
3.2.3 FTStemOption
3.2.4 FTThesaurusOption
3.2.5 FTStopwordOption
3.2.6 FTLanguageOption
3.2.7 FTWildCardOption
3.3 FTIgnoreOption
4 Semantics
4.1 Introduction
4.2 Nested XQuery 1.0 and XPath 2.0 Expressions
4.2.1 Left-hand Side of a FTContainsExpr
4.2.2 FTWords
4.2.3 FTRangeSpec
4.2.4 FTStopWordOption
4.2.5 FTThesaurusOption
4.2.6 FTLanguageOption
4.2.7 Tokenization
4.3 Evaluation of FTSelections
4.3.1 AllMatches
4.3.1.1 Formal Model
4.3.1.2 Examples
4.3.1.3 XML representation
4.3.1.4 Match and AllMatches Normal Form
4.3.1.5 The normalizeAllMatches function
4.3.2 FTSelections
4.3.2.1 XML Representation
4.3.2.2 The evaluate function
4.3.2.3 Formal semantics functions
4.3.2.4 FTWords
4.3.2.5 FTOr
4.3.2.6 FTAnd
4.3.2.7 FTUnaryNot
4.3.2.8 FTMildNot
4.3.2.9 FTOrder
4.3.2.10 FTScope
4.3.2.11 FTContent
4.3.2.12 FTDistance
4.3.2.13 FTWindow
4.3.2.14 FTTimes
4.3.3 Match Options Semantics
4.3.3.1 Types
4.3.3.2 High-Level Semantics
4.3.3.3 Formal Semantics Functions
4.3.3.4 FTCaseOption
4.3.3.5 FTDiacriticsOption
4.3.3.6 FTStemOption
4.3.3.7 FTStopWordOption
4.3.3.8 FTLanguageOption
4.3.3.9 FTWildCardOption
4.4 XQuery 1.0 and XPath 2.0 Full-Text and Scoring Expressions
4.4.1 FTContainsExpr
4.4.1.1 Semantics of FTContainsExpr
4.4.1.2 Example
4.4.2 Scoring
A EBNF for XQuery 1.0 Grammar with Full-Text extensions
A.1 Terminal Symbols
B EBNF for XPath 2.0 Grammar with Full-Text extensions
B.1 Terminal Symbols
C Static Context Components
D Error Conditions
E References
E.1 Normative References
E.2 Non-normative References
F Acknowledgements (Non-Normative)
G Glossary (Non-Normative)
H Checklist of Implementation-Defined Features (Non-Normative)
I Issues List (Non-Normative)
J Change Log (Non-Normative)
This document defines the language and the formal semantics of XQuery 1.0 and XPath 2.0 Full-Text. This language is designed to meet the requirements identified in W3C XQuery and XPath Full-Text Requirements [XQuery and XPath Full-Text Requirements] and to support the queries in the W3C XQuery Full-Text Use Cases [XQuery 1.0 and XPath 2.0 Full-Text Use Cases].
XQuery 1.0 and XPath 2.0 Full-Text extends the syntax and semantics of XQuery 1.0 and XPath 2.0.
As XML becomes mainstream, users expect to be able to search their XML documents. This requires a standard way to do full-text search, as well as structured searches, against XML documents. A similar requirement for full-text search led ISO to define the SQL/MM-FT [SQL/MM] standard. SQL/MM-FT [SQL/MM] defines extensions to SQL to express full-text searches providing similar functionality as does this full-text language extension to XQuery 1.0 and XPath 2.0.
XML documents may contain highly-structured data (numbers, dates), unstructured data (untagged free-flowing text), and semi-structured data (text with embedded tags). Where a document contains unstructured or semi-structured data, it is important to be able to search using Information Retrieval techniques such as scoring and weighting.
Full-text search is different from substring search in many ways:
A full-text search searches for words and phrases rather than substrings. A substring search for news items that contain the string "lease" will return a news item that contains "Foobar Corporation releases the 20.9 version ...". A full-text search for the word "lease" will not.
There is an expectation that a full-text search will support language-based searches which substring search cannot. An example of a language-based search is "find me all the news items that contain a word with the same linguistic stem as "mouse" (finds "mouse" and "mice"). Another example based on word proximity is "find me all the news items that contain the words "XML" and "Query" allowing up to 3 intervening words.
Full-text search must address the vagaries and nuances of language. Search results are often of varying usefulness. When you search a web site for cameras that cost less than $100, this is an exact search. There is a set of cameras that matches this search, and a set that does not. Similarly, when you do a string search across news items for "mouse", there is only 1 expected result set. When you do a full-text search for all the news items that contain the word "mouse", you probably expect to find news items containing the word "mice", and possibly "rodents", or possibly "computers". Not all results are equal. Some results are more "mousey" than others. Because full-text search may be inexact, we have the notion of score or relevance. We generally expect to see the most relevant results at the top of the results list.
As XQuery and XPath evolve, they may apply the notion of score to querying structured data. For example, when making travel plans or shopping for cameras, it is sometimes useful to get an ordered list of near matches in addition to exact matches. If XQuery and XPath define a generalized inexact match, we expect XQuery and XPath to utilize the scoring framework provided by XQuery and XPath Full-Text.
The following definitions apply to full-text search:
Full-text queries are performed on text which has been tokenized, i.e., broken into a sequence of words, units of punctuation, and spaces.
A word is defined as a character, n-gram, or sequence of characters returned by a tokenizer as a basic unit to be searched. Each instance of a word consists of one or more consecutive characters. Beyond that, words are implementation-defined. Note that consecutive words need not be separated by either punctuation or space, and words may overlap. A phrase is a sequence of ordered words which may contain any number of words.
Tokenization enables functions and operators which work with the relative positions of words (e.g., proximity operators). It also uniquely identifies sentences and paragraphs in which words appear. It enables functions and operators which operate on a part or the root of the word (e.g., wildcards, stemming). Whatever a tokenizer for a particular language chooses to do, it must preserve the containment hierarchy: paragraphs contain sentences which contain words. The tokenizer has to evaluate two equal strings in the same way, i.e., it should identify the same tokens. Everything else is implementation-defined.
This specification focuses on functionality that serves all languages. It also selectively includes functionalities useful within specific families of languages. For example, searching within sentences and paragraphs is useful to many western languages and to some non-western languages, so that functionality is incorporated into this specification.
Some XML elements represent semantic markup, e.g., <title>. Others represent formatting markup, e.g., <b> to indicate bold. Semantic markup serves well as token boundaries, while formatting markup sometimes does not. Implementations are free to provide implementation-defined ways to differentiate between the markup's effect on token boundaries during tokenization.
This document is organized as follows. We first present a high level syntax for the XQuery 1.0 and XPath 2.0 Full-Text language along with some examples. Then, we present the syntax and examples of the basic primitives in the XQuery 1.0 and XPath 2.0 Full-Text language. This is followed by the semantics of the XQuery 1.0 and XPath 2.0 Full-Text language. The appendix contains a section that provides an EBNF for the XPath 2.0 Grammar with Full-Text extensions, an EBNF for XQuery 1.0 Grammar with Full-Text extensions, a list of issues, acknowledgements and a glossary.
Certain namespace prefixes are predeclared by XQuery 1.0 and, by implication, by this specification, and bound to fixed namespace URIs. These namespace prefixes are as follows:
xml = http://www.w3.org/XML/1998/namespace
xs = http://www.w3.org/2001/XMLSchema
xsi = http://www.w3.org/2001/XMLSchema-instance
fn = http://www.w3.org/2005/xpath-functions
xdt = http://www.w3.org/2005/xpath-datatypes
local = http://www.w3.org/2005/xquery-local-functions
In addition to the prefixes in the above list, this document uses the prefix err to represent the namespace URI http://www.w3.org/2005/xqt-errors, This namespace prefix is not predeclared and its use in this document is not normative. Error codes that are not defined in this document are defined in other XQuery 1.0 and XPath 2.0 specifications, particularly [XML Path Language (XPath) 2.0] and [XQuery 1.0 and XPath 2.0 Functions
and Operators].
Finally, this document uses the prefix fts to represent a namespace containing a number of functions used in this document to describe the semantics of XQuery 1.0 and XPath 2.0 Full-Text functions. There is no requirement that these functions be implemented, therefore no URI is associated with that prefix.
XQuery 1.0 and XPath 2.0 Full-Text extends the languages of XQuery 1.0 and XPath 2.0 in three ways. It:
Adds a new expression called FTContainsExpr;
Enhances the syntax of FLWOR expressions in XQuery 1.0 and for expressions in XPath 2.0 with optional score variables; and
Adds static context declarations for full-text match options to the query prolog.
XQuery 1.0 and XPath 2.0 Full-Text extends the languages of XQuery 1.0 and XPath 2.0 by adding the expression FTContainsExpr. An FTContainsExpr is similar to a comparison expression (see Section 3.5.2 General ComparisonsXQ). This grammar rule introduces FTContainsExpr.
| [50] | ComparisonExpr |
::= | FTContainsExpr ( (ValueComp |
An FTContainsExpr may be used anywhere a ComparisonExpr may be used. FTContainsExprs have higher precedence than comparison operators, so the results of FTContainsExpr may be compared without enclosing them in parentheses.
| [51] | FTContainsExpr |
::= | RangeExpr ( "ftcontains" FTSelection FTIgnoreOption? )? |
An FTContainsExpr returns a Boolean value. It returns true, if there is some node in RangeExpr that matches FTSelection. For the purpose of determining a match, certain descendants of nodes in RangeExpr may be ignored, as specified in FTIgnoreOption.
FTSelections are composed of the following ingredients:
Words and phrases that are the strings to be found as matches;
Match options, such as indicators for case sensitivity and stop words;
Boolean operators, that compose an FTSelection from simpler FTSelections; and
Constraints on the positions of matches, such as indicators for distance between words and for the cardinality of matches.
The following example in extended XQuery 1.0 returns the author of each book with a title containing a word with the same root as dog and the word cat.
for $b in /books/book
where $b/title ftcontains ("dog" with stemming) && "cat"
return $b/author
The same example in extended XPath 2.0 is written as:
/books/book[title ftcontains ("dog" with stemming) && "cat"]/author
Besides specifying a match of a full-text search as a Boolean condition, full-text search applications typically also have the ability to associate scores with the results. Such scores express the relevance of those results to the full-text search conditions.
XQuery 1.0 and XPath 2.0 Full-Text extends the languages of XQuery 1.0 and XPath 2.0 further by adding optional score variables to the for and let clauses of FLWOR expressions.
The production for the extended for clause follows.
| [35] | ForClause |
::= | "for" "$" VarName TypeDeclaration? PositionalVar? FTScoreVar? "in" ExprSingle ("," "$" VarName TypeDeclaration? PositionalVar? FTScoreVar? "in"
ExprSingle)* |
| [37] | FTScoreVar |
::= | "score" "$" VarName |
When a score variable is present in a for clause the evaluation of the expression following the in keyword not only needs to determine the result sequence of the expression, i.e., the sequence of items which are iteratively bound to the for variable. It must also determine in each iteration the relevance "score" value of the current item and bind the score variable to that value.
In the following example book elements are determined that satisfy the condition [content ftcontains "web site" && "usability" and .//chapter/title ftcontains "testing"]. The scores assigned to the book elements are returned.
for $b score $s
in /books/book[content ftcontains "web site" && "usability"
and .//chapter/title ftcontains "testing"]
return $s
XPath 2.0 Full-Text extends the language of XPath 2.0 in the for expression in the same way: with optional score variables. The example above is also a legal example of the XPath 2.0 extension.
Scores are typically used to order results, as in the following, more complete example.
for $b score $s
in /books/book[content ftcontains "web site" && "usability"]
where $s > 0.5
order by $s descending
return <result>
<title> {$b//title} </title>
<score> {$s} </score>
</result>
The score variable is bound to a value which reflects the relevance of the match criteria in the FTSelections to the nodes in the respective RangeExprs. The calculation of relevance is implementation-dependent, but score evaluation must follow these rules:
Score values are of type xs:float in the range [0, 1].
For score values greater than 0, a higher score must imply a higher degree of relevance
Similar to their use in a for clause, score variables may be specified in a let clause. A score variable in a let clause is also bound to the score of the expression evaluation, but in the let clause one score is determined for the complete result. The let variable may be dropped from the let clause, if the score variable is present.
The production for the extended let clause follows.
| [38] | LetClause |
::= | (("let" "$" VarName TypeDeclaration? FTScoreVar?) | ("let" "score" "$" VarName)) ":=" ExprSingle ("," (("$" VarName TypeDeclaration? FTScoreVar?) | FTScoreVar)
":=" ExprSingle)* |
While the score option in a for clause conveniently allows to specify that the filtering expression, which drives the iteration, is at the same time the expression that determines the scores, it is possible to separate the filtering from the scoring expression using the let clause syntax. The following is an example of this.
for $b in /books/book[.//chapter/title ftcontains "testing"]
let score $s := $b/content ftcontains "web site" && "usability"
order by $s descending
return <result score="{$s}">{$b}</result>
This example returns book elements with chapter titles that contain "testing". Along with the book elements scores are returned. These scores, however, reflect whether the book content contains "web site" and "usability".
Note that it is not a requirement of the score of an FTContainsExpr to be 0, if the expression evaluates to false, nor to be non-zero, if the expression evaluates to true. Hence, in the example above it is not possible to infer the Boolean value of the FTContainsExpr in the let clause from the calculated score of a returned result element. For instance, an implementation may want to assign a non-zero score to a book that contained only "web site", but not "usability", as this
may be considered more relevant than a book that does not contain either of both.
The use of score variables introduces a second-order aspect to the evaluation of expressions which cannot be emulated by (first-order) XQuery functions. Consider the following replacement of the clause let score $s := FTContainsExpr
let $s := score(FTContainsExpr)
where a function score is applied to some FTContainsExpr. If the function score were first-order, it would only be applied to the result of the evaluation of its argument, which is one of the Boolean constants true or false. Hence, there would be at most two possible values such a score function would be able to return and no further differentiation would be possible.
Scoring may be influenced by adding weight declarations to individual search words, phrases, and expressions. Weight declarations are described in detail in Section 3.1.
for $b in /books/book
let score $s := $b/content ftcontains ("web site" weight 0.2)
&& ("usability" weight 0.8)
return <result score="{$s}">{$b}</result>
The effect of weights on the result score is implementation-dependent. However, weight declarations must follow these rules:
Weights in an FTContainsExpr are significant only in relation to each other; and
When no explicit weight is specified, the default weight is 0.5.
Weight declarations in an FTContainsExpr for which no scores are evaluated are ignored.
The XQuery Static Context is extended by a component for each of the full-text match options. Thus, the default of a match option in a query may be changed by providing a setting in the static context using the following declaration syntax.
| [6] | Prolog |
::= | ((DefaultNamespaceDecl | Setter | NamespaceDecl | Import) Separator)* ((VarDecl | FunctionDecl | OptionDecl | FTOptionDecl) Separator)* |
| [14] | FTOptionDecl |
::= | "declare" "ft-option" FTMatchOption |
Match options modify the match semantics of full-text expressions. They are described in detail in Section 3.2 FTMatchOptions. When a match option is specified explicitly in a query, that setting overrides the setting of the respective match option in the static context.
This section describes FTSelection which contains the full-text operators in the FTContainsExpr, and the match options in FTMatchOptions which modify the matching semantics of the full-text selection expressions.
The FTSelection production specifies the possible full-text search conditions.
| [144] | FTSelection |
::= | FTOr (FTMatchOption | FTProximity)* ("weight" DecimalLiteral)? |
The syntax and semantics of the individual full-text selection operators follow.
This XML document fragment is the source document for examples in this section.
Tokenization is implementation-defined. A sample tokenization is used for the example sin this section. The results may be different for other tokenizations.
Unless stated otherwise, the results assume a case-insensitive match.
<book number="1">
<title shortTitle="Improving Web Site Usability">Improving
the Usability of a Web Site Through Expert Reviews and
Usability Testing</title>
<author>Millicent Marigold</author>
<author>Montana Marigold</author>
<editor>Véra Tudor-Medina</editor>
<content>
<p>The usability of a Web site is how well the
site supports the users in achieving specified
goals. A Web site should facilitate learning,
and enable efficient and effective task
completion, while propagating few errors.
</p>
<note>This book has been approved by the Web Site
Users Association.
</note>
</content>
</book>
FTWords specifies the words and phrases that are being searched as the left-hand side argument of FTContainsExpr.
| [150] | FTWords |
::= | (Literal | VarRef | ContextItemExpr | FunctionCall | ("{" Expr "}")) FTAnyallOption? |
The right-hand side of the above production must evaluate to a sequence of string values or nodes of type "xs:string". The result is then atomized into a sequence of strings which is tokenized into a sequence of words and phrases. If the atomized sequence is not a subtype of "xs:string*", an error is raised: [err:XPTY0004]XP.
If the "any" option is specified, a match occurs, if and only if at least one word or phrase in the sequence has a match in the searched text.
If the "all" option is specified, a match occurs, if and only if all of the words and phrases in the sequence are matched in the searched text.
If the "phrase" option is specified, the sequence of words and phrases is used to create a single phrase by concatenating the words and phrases and interleaving whitespace. A match occurs, if and only if the resulting phrase is matched in the searched text.
If the "any word" option is specified, a match occurs, if and only if at least one word in the sequence of words and phrases is matched in the searched text.
If the "all word" option is specified, a match occurs, if and only if all words in the sequence of words and phrases are matched in the searched text.
If no option is specified, "any" is the default.
If the result is a single string, "any", "all", and "phrase" are equivalent.
/book[@number="1" and ./title ftcontains "Expert"]
returns the book element whose number is 1, because its title element contains the word "Expert".
/book[@number="1" and ./title ftcontains "Expert Reviews"]
returns the book element whose number is 1, because its title element contains the phrase "Expert Reviews".
/book[@number="1" and ./title ftcontains {"Expert",
"Reviews"} all]
returns the book element whose number is 1, because its title element contains two words "Expert" and "Reviews".
/book[@number="1"]//p ftcontains "Web Site Usability"
returns false, because the p element doesn't contain the phrase "Web Site Usability" although it contains all of the words in the phrase.
for $book in /book[.//author ftcontains "Marigold"] let score $score := $book/title ftcontains "Web Site Usability" where $score > 0.8 order by $score descending return $book/@number
returns book numbers of book elements by "Marigold" with a title about "Web Site Usability" sorting them in descending score order.
| [145] | FTOr |
::= | FTAnd ( "||" FTAnd )* |
FTOr finds matches that satisfy at least one of the selection criteria.
A match must satisfy at least one of the FTSelection criteria.
/book[.//author ftcontains "Millicent" || "Voltaire"]
returns the book element written by "Millicent".
| [146] | FTAnd |
::= | FTMildnot ( "&&" FTMildnot )* |
FTAnd finds matches that satisfy both of the selection criteria.
A match must satisfy all of the FTSelection criteria which are specified by one or more FTMildNot expressions.
/book[@number="1"]/title ftcontains ("usability" && "testing")
returns true, since the book title contains "usability" and "testing".
/book/author ftcontains "Millicent" && "Montana"
returns false, because "Millicent" and "Montana" are not contained by the same author element in any book element.
| [147] | FTMildnot |
::= | FTUnaryNot ( "not" "in" FTUnaryNot )* |
FTMildNot is a milder form of && ! (and not). 'a not in b' matches an expression that contains "a", but not when it is a part of "b". For example, a search for "Mexico" not in "New Mexico" returns, amon others, a document which is all about "Mexico" but mentions at the end that "New Mexico was named after Mexico", which would not be returned by an "and not" search.
A match to FTMildNot must contain at least one word occurrence that satisfies the first condition and does not satisfy the second condition. If it contains a word occurrence that satisfies both the first and the second condition, the occurrence is not considered as a result.
/book ftcontains "usability" not in "usability testing"
returns true, because "usability" appears in the title and the p elements and the occurrence within the phrase "Usability Testing" in the title element is not considered.
The right-hand side of a FTMildNot may not contain an FTSelection that evaluates to an AllMatches that contains a StringExclude. Such FTSelections are FTUnaryNot and FTTimes with at most, from-to, and exactly occurrences ranges.
| [148] | FTUnaryNot |
::= | ("!")? FTWordsSelection |
FTUnaryNot finds matches that do not satisfy the selection criteria.
/book[. ftcontains ! "usability"]
returns the empty sequence, because all book elements contain "usability".
/book ftcontains "information" && "retrieval" && ! "information retrieval"
returns true, because book elements contain "information" and "retrieval" but not "information retrieval".
/book[. ftcontains "web site usability" && !"usability testing"]
return book elements containing "web site usability" but not "usability testing".
| [152] | FTOrderedIndicator |
::= | "ordered" |
FTOrder controls the order of words and phrases to be the same as the order in which they are written in the query.
The default is unordered. Unordered is in effect when ordered is not specified in the query. Unordered cannot be written explicitly in the query.
FTOrder finds matches which must satisfy the nested selection condition and the match must contain the words in the order specified in the query.
/book/title ftcontains ("web site" && "usability")
ordered
returns true, because titles of book elements contain "web site" and "usability" in the order in which they are written in the query, i.e., "web site" must precede "usability".
/book[@number="1"]/title ftcontains ("Montana" &&
"Millicent") ordered
returns false, because although "Montana" and "Millicent" appear in the title element, they do not appear in the order they are written in the query.
| [170] | FTScope |
::= | ("same" | "different") FTBigUnit |
| [172] | FTBigUnit |
::= | "sentence" | "paragraph" |
FTScope finds words and phrases contained in the same or a different scope.
Possible scopes are sentences and paragraphs.
By default, there are no restriction on the scope of the matches.
If two words appear in the same sentence and in different sentences, then both same sentence and different sentence return true. The same is true for same paragraph and different paragraph.
/book ftcontains "usability" && "Marigold" same sentence
returns false, because the words "usability" and "Marigold" are not contained within the same sentence.
/book ftcontains "usability" && "Marigold" different sentence
returns true, because the words "usability" and "Marigold" are contained within different sentences.
/book[. ftcontains "usability" && "testing" same paragraph]
returns a book element, because it contains "usability" and "testing" in the same paragraph.
/book[. ftcontains "site" && "errors" same sentence]
returns a book element, because "site" and "errors" appear in the same sentence.
Some subtle relationships between FTScope and FTDistance will be discussed in Section 4.
| [167] | FTDistance |
::= | "distance" FTRange FTUnit |
| [166] | FTRange |
::= | ("exactly" UnionExpr) |
| [171] | FTUnit |
::= | "words" | "sentences" | "paragraphs" |
FTDistance finds matches by specifying the distance between words and phrases in FTUnit (words, sentences, and paragraphs). The number of intervening FTUnits is specified in the integer value of FTRange.
FTRange specifies a range of integer values, providing a minimum and maximum value. Each UnionExpr in an FTRange must evaluate (after atomization) to a singleton sequence with an atomic value of type "xs:integer". Otherwise, an error is raised [err:XPTY0004]XP.
Let the value of the first (or only) UnionExpr be M. If "from" is specified, let the value of the second UnionExpr be N. FTDistance may cross element boundaries when computing distance.
The following rule applies to FTDistance:
Zero words (sentences, paragraphs) means adjacent words (sentences, paragraphs).
If "exactly" is specified, then the range is the closed interval [M, M]. If "at least" is specified, then the range is the half-closed interval [M, unbounded). If "at most" is specified, then the range is the closed interval [0, M]. If "from-to" is specified, then the range is the closed interval [M, N].
Here are some examples of FTRanges:
'exactly 0' specifies the range [0, 0].
'at least 1' specifies the range [1,unbounded].
'at most 1' specifies the range [0, 1].
'from 5 to 10' specifies the range [5, 10].
The distances computed by FTDistance are not affected by the presence or absence of element boundaries in the text. Stop words are counted in those computations whether they are ignored or not.
/book ftcontains ("information" &&
"retrieval") not in ("information" && "retrieval"
distance at least 11 words)
returns false, because "information" and "retrieval" are more than at least 11 words apart.
/book ftcontains "web" && "site" && "usability" distance at most 2 words
returns true, because "web", "site", and "usability" have at most 2 intervening words between them.
/book[. ftcontains "web site" && "usability" distance at most 1 words]/title
returns the book title. A similar query for the p element would return false because "web site" and "usability" have two intervening words between them.
| [168] | FTWindow |
::= | "window" UnionExpr FTUnit |
FTWindow finds matches within a number of FTUnits (words, paragraphs, and phrases). The number of FTUnits is specified as an integer.
FTWindow may cross element boundaries. The size of the window is not affected by the presence or absence of element boundaries. Stop words are included in those computations whether they are ignored or not.
UnionExpr must evaluate to an atom of type "xs:integer".
A match of an FTSelection is considered a match within a window, if there exists a window of the given number of consecutive units (words, sentences, or paragraphs) in the document within which the match lies.
/book/title ftcontains "web" && "site" && "usability" window 5 words
returns true, because "web", "site", and "usability" are within a window of 5 words in the title element.
/book ftcontains ("web" && "site" ordered)
&& ("usability" || "testing") window 10 words
returns true, because "web" and "site" in the order they are written in the query and either "usability" or "testing" are within a window of at most 10 words.
/book//title ftcontains "web site" && "usability" window 3 words
returns true, because the title element contains "Web Site Usability". A similar query on the p element would not return true, because its occurrences of "web site" and "usability" are not within a window of 3.
/book[@number="1" and . ftcontains "efficient" && ! "and" window 3 words]
returns the empty sequence, because in the selected book element, there is no occurrence of "efficient" within a window of 3 words which would not also contain an occurrence of "and".
| [169] | FTTimes |
::= | "occurs" FTRange "times" |
FTTimes finds matches in which an FTSelection occurs a specified number of times.
FTTimes limits the number of different occurrences of FTSelection, within the specified range.
In the document fragment "very very big":
The FTSelection "very big" has 1 occurrence consisting of the second "very" and "big".
The FTSelection "very && big" has 2 occurrences; one consisting of the first "very" and "big", and the other containing the second "very" and "big".
The FTSelection "very || big" has 3 occurrences.
The FTSelection ! "small" has 1 occurrence.
/book[. ftcontains "usability" occurs at least 2 times]/@number
returns book numbers because book elements contain 2 or more occurrences of "usability".
/book[@number="1" and title ftcontains "usability" || "testing" occurs at most 3 times]
returns the empty sequence, because there are 4 occurrences of "usability" || "testing" in the designated title.
/book ftcontains "usability" occurs at least 2 times
returns true, because the book element contains 3 occurrences of "usability" in its title element although its p element contains only 1 occurrence.
| [164] | FTContent |
::= | ("at" "start") | ("at" "end") | ("entire" "content") |
FTContent finds matches in which the words and phrases are the first, last or all of the words and phrases in the tokenized string value of the element being searched.
The "at" "start" option finds matches in which the words or phrases are the first words or phrases in the tokenized string value of the element being searched.
The "at" "end" option finds matches in which the words or phrases are the last words or phrases in the tokenized string value of the element being searched.
The "entire" content" option finds matches in which the words or phrases are the entire content of the tokenized string value of the element being searched.
/books//title[. ftcontains "improving the usability of a web site" at start]
returns each title element starting with the phrase "improving the usability of a web site".
/books//p[. ftcontains "propagat*" && "few errors" distance at most 2 words at end]
returns each p element ending with the phrase "propagating few errors".
/books//note[. ftcontains "this site has been approved by the web site users association" entire content]
returns each note element whose entire content is "this site has been approved by the web site users association".
FTMatchOptions modify the operational semantics of the FTSelection on which they are applied.
| [153] | FTMatchOption |
::= | FTCaseOption |
FTMatchOptions set environments for the matching options of FTSelection. If a match option isn't specified explicitly in the query, its value is given by its static context component. Details about these context components, including their default values, are given in Appendix C Static Context Components.
If no match options declarations are present in the prolog and the implementation does not define any overwriting of the static context components for the match options, the query:
/book/title ftcontains "usability"
is equivalent to the query
/book/title ftcontains "usability" case insensitive
diacritics insensitive
without stemming without thesaurus
without stop words language "none" without wildcards
FTMatchOptions are applied in the order in which they are written in the query. More information on their semantics is given in 4.3.3 Match Options Semantics.
We describe each match option in more detail in the following sections.
| [154] | FTCaseOption |
::= | "lowercase" |
FTCaseOption modifies words and phrases matching by specifying how upper and lower charcters are considered.
FTCaseOption influences the way FTWords is applied.
There are four possible character case options:
The option "uppercase" matches words and phrases with uppercase characters, regardless of the case of characters of the words and phrases as they are written in the query.
The option "lowercase" matches words and phrases with lowercase characters, regardless of the case of characters of the words and phrases as they are written in the query.
The option "case" "insensitive" matches the uppercase and lowercase characters of words and phrases. The case of characters as they are written in the query is not considered.
The option "case" "sensitive" matches the case of the characters in words and phrases as they are written in the query.
The default is "case insensitive".
The following table summarizes the interactions between the case match options and the use of the default collations.
| Default collation options/Case options | UCC (Unicode Codepoint Collation) | CCS (some generic case-sensitive collation) | CCI (some generic case-insensitive collation) |
| insensitive | compare as if both lower | case-insensitive variant of CCS if it exists, else error | CCI |
| sensitive | UCC | CCS | case-sensitive variant of CCI if it exists, else error |
| uppercase | uppercase(Expr) + UCC | uppercase(Expr) + CSS | CCI |
| lowercase | lowercase(Expr) + UCC | lowercase(Expr) + CSS | CCI |
Note:
In this table, "else error" means "Otherwise, an error is raised: [err:FOCH0002]FO". The phrase "if it exists" is used, because the case-sensitive collation CCS does not always have a case-insensitive variant (and, even if one exists, it may not be possible to determine it algorithmically), and because the case-insensitive collation CCI does not always have a case-sensitive variant (and, even if one exists, it may not be possible to determine it algorithmically).
/book[@number="1"]/title ftcontains "Usability" lowercase
returns false, because the title element doesn't contain "usability" in lower-case characters.
/book[@number="1"]/title ftcontains "usability" case insensitive
returns true, because the character case is not considered.
| [155] | FTDiacriticsOption |
::= | ("with" "diacritics") |
FTDiacriticsOption modifies word and phrase matching by specifying how diacritics are considered.
There are four possible diacritics options:
The option "with" "diacritics" matches words and phrases with diacritics, regardless of whether the diacritics are written in the query.
The option "without" "diacritics" matches words and phrases without diacritics, regardless of whether the diacritics are written in the query.
The option "diacritics" "insensitive" matches words and phrases with and without diacritics. Whether diacritics are written in the query or not is not considered.
The option "diacritics" "sensitive" matches words and phrases only if they contain the diacritics as they are written in the query.
The default is "diacritics insensitive".
The following table summarizes the interactions between the diacritics match options and the use of the default collations.
| Default collation options/Diacritics options | UCC (Unicode Codepoint Collation) | CDS (some generic diacritics-sensitive collation) | CDI (some generic diacritics-insensitive collation) |
| insensitive | compare as if with and without | diacritics-insensitive variant of CDS if it exists, else error | CDI |
| sensitive | UCC | CDS | diacritics-sensitive variant of CDI if it exists, else error |
| with diacritics | "resume diacritic insensitive" not in "resume" | "resume diacritic insensitive" not in "resume" | CDI |
| without diacritics | "resume" not in "resume diacritic sensitive" | "resume" not in "resume diacritic sensitive" | CDI |
Note:
In this table, "else error" means "Otherwise, an error is raised: [err:FOCH0002]FO". The phrase "if it exists" is used, because the diacritics-sensitive collation CDS does not always have a diacritics-insensitive variant (and, even if one exists, it may not be possible to determine it algorithmically), and because the diacritics-insensitive collation CDI does not always have a diacritics-sensitive variant (and, even if one exists, it may not be possible to determine it algorithmically).
/book[@number="1"]//editor ftcontains "Vera" with diacritics
returns true, because the editor element contains the word "Vera" with an acute accent.
/book[@number="1"]/editors ftcontains "Véra" without diacritics
returns false, because the editor element does not contain the word "Vera" without an acute accent.
| [156] | FTStemOption |
::= | ("with" "stemming") | ("without" "stemming") |
FTStemOption modifies word and phrase matching by specifying whether stemming is applied or not.
FTStemOption influences the way FTWords is applied. It produces a disjunction of the query words by expanding the words into the list of words that share the same stem. By definition, the query words are included in that disjunction.
The "with stemming" option specifies that matches may contain words that have the same stem as the words and phrases written in the query. It is implementation-defined what a stem of a word is.
The "without stemming" option specifies that the words and phrases are not stemmed.
It is implementation-defined whether the stemming is based on an algorithm, dictionary, or mixed approach.
The default is "without stemming".
/book[@number="1"]/title ftcontains "improve" with stemming
returns true, because the title of the spekcified book contains "improving" which has the same stem as "improve".
| [157] | FTThesaurusOption |
::= | ("with" "thesaurus" (FTThesaurusID | "default")) |
| [158] | FTThesaurusID |
::= | "at" StringLiteral ("relationship" StringLiteral)? (FTRange "levels")? |
FTThesaurusOption modifies word and phrase matching by specifying whether a thesaurus is used or not. If thesauri are used, it locates the thesauri by default or URI reference. It also states the relationship to be applied and how many levels within the thesaurus to be traversed..
FTThesaurusOption influences the way FTWords is applied.
The StringLiteral following the keyword at in FTThesaurusID is of the form of a URI Reference.
Thesauri add related words and phrases to the search. Thus, the user may narrow, broaden, or otherwise modify the search using synonyms, hypernyms (more generic terms), etc. The search is performed as though the user has specified all related search words and phrases in a disjunction (FTOr).
Note:
A thesaurus may be standards-based or locally-defined. It may be a traditional thesaurus, or a taxonomy, soundex, ontology, or topic map. How the thesaurus is represented is implementation-dependent.
FTThesaurusID specifies the relationship sought between words and phrases written in the query and terms in the thesaurus and the number of levels to be queried in hierarchical relationships by including an FTRange "levels". If no levels are specified, the default is to query all levels in hierarchical relationships.
Relationships include, but are not limited to, the relationships and their abbreviations presented in [ISO 2788] and their equivalents in other languages:
equivalence relationships (synoymns): PREFERRED TERM (USE), NONPREFERRED USED FOR TERM (UF);
hierarchical relationships: BROADER TERM (BT), NARROWER TERM (NT), BROADER TERM GENERIC (BTG), NARROWER TERM GENERIC (NTG), BROADER TERM PARTITIVE (BTP), NARROWER TERM PARTITIVE (NTP), TOP Terms (TT); and
associative relationships: RELATED TERM (RT).
The "with thesaurus" option specifies that string matches include words that can be found in one of the specified thesauri.
The "without thesaurus" option specifies that no thesaurus will be used.
The "with default thesaurus" option specifies that a system-defined default thesaurus with a system-defined relationship is used. The default thesaurus may be used in combination with other explicitly specified thesauri.
The default is "without thesaurus".
count(.//book/content ftcontains "duties" with thesaurus at "http://bstore1.example.com/UsabilityThesaurus.xml" relationship "synonyms")>0
returns true, because it finds a content element containing "tasks" which the thesaurus identified as a synonym for "duties".
doc("http://bstore1.example.com/full-text.xml")
/books/book[count(./content ftcontains "web site components" with
thesaurus at "http://bstore1.example.com/UsabilityThesaurus.xml"
relationship "narrower terms" at most 2 levels)>0]
returns book elements, because it finds a content element containing "web site components", and narrower terms "navigation" and "layout".
doc("http://bstore1.example.com/full-text.xml")
/books/book[count(. ftcontains "Merrygould" with thesaurus at
"http://bstore1.example.com/UsabilitySoundex.xml" relationship
"sounds like")>0]
returns a book element containing "Marigold which sounds which sound like "Merrygould".
| [159] | FTStopwordOption |
::= | ("with" "stop" "words" FTRefOrList FTInclExclStringLiteral*) |
| [160] | FTRefOrList |
::= | ("at" StringLiteral) |
| [161] | FTInclExclStringLiteral |
::= | ("union" | "except") FTRefOrList |
FTStopWordOption controls word matching by specifying whether stop words are used or not.
FTStopWordOption influences the way FTWords is applied.
FTRefOrList specifies the list of stop words either explicitly as a comma-separated list of string literals, or by a URI following the keyword at. If a URI is used, it must point to a sequence of string atoms or nodes of type "xs:string". In both cases, no tokenization is performed on the strings: they are used as they occur in the sequence.
The "with stop words" option specifies that if a word is within the specified collection of stop words, it is removed from the search and any word may be substituted for it. Stop words retain their position numbers and are counted in FTDistance and FTWindow searches.
Stop word lists can be combined using the usual semantics of "except" and "union".
The "with default stop words" option specifies that an implementation-defined collection of stop words is used.
The "without stop words" option specifies that no stop words are used. This is equivalent to specifying an empty list of stop words.
The default is "without stop words".
/book[@number="1"]//p ftcontains "propagation of errors"
with stemming with stop words ("a", "the", "of")
returns true, because the document contains the phrase "propagating few errors".
Note the asymmetry in the stop word semantics: the property of being a stop word is only relevant to query terms, not to document terms. Hence, it is irrelevant for the above-mentioned match whether "few" is a stop word or not, and on the other hand we do not want the query above to match "propagation" followed by 2 stop words, or even a sequence of 3 stop words in the document.
/book[@number="1"]//p ftcontains "propagation of errors" with stemming without stop words
returns false, because "of" is not in the p element between "propagating" and "errors".
doc("http://bstore1.example.com/full-text.xml")
/books/book[count(.//content ftcontains "planning then
conducting" with stop words at
"http://bstore1.example.com/StopWordList.xml")>0]
uses the stop words list specified at the URL. Assuming that the specified stop word list contains the word "then", this query is reduced to a query on the phrase "planning X conducting", allowing any word as a substitute for X. It returns a book element, because its content element contains "planning then conducting". It would have also returned the book if the phrases "planning and conducting" and "planning before conducting" if they had been in its
content.
doc("http://bstore1.example.com/full-text.xml")
/books/book[count(.//content ftcontains "planning then conducting"
with stop words at "http://bstore1.example.com/StopWordList.xml"
except ("the then"))>0]
returns books containing "planning then conducting", but not does not return books containing "planning and conducting", since it is exempting "then" from being a stop word.
| [162] | FTLanguageOption |
::= | "language" StringLiteral |
FTLanguageOption modifies word matching by specifying the language of search words and phrases.
FTLanguageOption influences the way FTWords is applied.
The StringLiteral following the keyword language designates one language. It must either be castable to "xs:language", or be the value "none". Otherwise, an error is raised: [err:XPTY0004]XP.
The "language" option influences tokenization, stemming, and stop words.
If the language "none" option is specified, no language selected.
The set of valid language identifiers is implementation-defined.
By default, there is no language selected.
/book[@number="1"]//editor ftcontains "salon de the" with default stop words language "fr"
This is an example where the language option is used to select the appropriate stop word list.
| [163] | FTWildCardOption |
::= | ("with" "wildcards") | ("without" "wildcards") |
FTWildCardOption modifies word and phrase matching by specifying whether wildcards are used or not.
FTWildCardOption influences the way FTWords is applied.
In addition to specifying the "with wildcards"' option, indicators (represented by periods (.)) and qualifiers are appended to or inserted into words being searched. Zero or more characters replace each indicator and qualifier.
Indicators are mandatory. When the "with wildcards"' option is present, one or more periods (.) must be appended at the beginning or end of words or inserted into words. If the period is at the beginning of a word, the wildcard is a prefix wildcard. If the period is at the end of a word, it is a suffix wildcard. If the period is inserted into a word, it is an infix wildcard.
When the "with wildcards" option and one or more periods (.) appended to or inserted into words are present, characters are appended or inserted at each of the periods. Any characters may be appended or inserted except newline characters (#xA), return characters (#xD), and tab characters (#x9). The number of characters depends on the qualifier. Qualifiers available are none, question mark, asterisk, plus sign, and two numbers separated by a comma, both enclosed by curly braces.
If a period is present, but no qualifiers, one character is appended or inserted.
If a period is followed by a question mark (.?), zero or one characters are appended or inserted.
If a period is followed by an asterisk (.*), zero or more characters are appended or inserted.
If a period is followed by a plus sign (.+), one or more characters are appended or inserted.
If a period is followed by two numbers separated by a comma, both enclosed by curly braces (.{n,m}), a specified range of characters is appended or inserted.
The "without wildcards" option finds words without recognizing wildcard indicators and qualifiers. Periods, question marks, asterisks, plus signs, and two numbers separated by a comma, both enclosed by curly braces recognized as regular characters.
The default is "without wildcards".
/book[@number="1"]/title ftcontains "improv.*" with wildcards
returns true, because the title element contains "improving".
/book[@number="1"]/title ftcontains ".?site" with wildcards
returns true, because the title element contains "site".
/book[@number="1"]/p ftcontains "w.ll" with wildcards
returns true, because the p element contains "well".
| [173] | FTIgnoreOption |
::= | "without" "content" UnionExpr |
FTIgnoreOption specifies a set of element nodes whose content are ignored. Ignored nodes are identified by the XQuery expression UnionExpr. Let N1, N2, ..., Nk be the sequence of nodes of the search context. The expression UnionExpr is evaluated in the context of each node Ni being searched. That is, the search context expression of the ftcontains predicate creates a new focus for the evaluation of the UnionExpr given with FTIgnoreOption, similar to the creation of the dynamic context of a path expression E1/E2 or a filter expression E1[E2] (see Section 2.1.2 Dynamic ContextXQ).
Now, let I1, I2, ..., In be the sequence of items that UnionExpr evaluates to. For each Ni (i=1..k) a copy is made that omits each node Ij (j=1..n) that is not Ni. Those copies form the new search context. If UnionExpr evaluates to an empty sequence no nodes are omitted.
In the following fragment, if .//annotation is ignored, "Web Usability" will be found 2 times: once in the title element and once in the editor element. The 2 occurrences in the 2 annotation elements are ignored. On the other hand, "expert" will not be found, as it appears only in an annotation element.
<book>
<title>Web Usability and Practice</title>
<author>Montana <annotation> this author is an expert in Web Usability</annotation>
Marigold
</author>
<editor>Véra Tudor-Medina on Web <annotation> best editor on Web Usability</annotation>
Usability
</editor>
</book>
By default, no element content is ignored.
This section describes the formal semantics of XQuery 1.0 and XPath 2.0 Full-Text. The figure below shows how XQuery 1.0 and XPath 2.0 Full-Text integrates with XQuery 1.0 and XPath 2.0.
The following diagram represents the interaction of XQuery 1.0 and XPath 2.0 Full-Text with the rest of XQuery 1.0 and XPath 2.0 languages. It specifies how full-text expression can be nested within XQuery 1.0 and XPath 2.0 expressions and vice versa.
Arrow 1 represents the composability of the XQuery 1.0 and XPath 2.0 expressions by showing that XQuery 1.0 expressions are nested inside FTSelections and evaluated to a sequence of items.
Arrow 2 shows how Regular XQuery expressions can be nested inside FTSelections by evaluating them to a sequence of items and then converting them to a tokenized text. The process is described in Nested XQuery and XPath Expressions.
Arrow 3 represents the composability of FTSelections. The composability is achived by evaluating the FTSelections to AllMatches. Each FTSelection operates on zero or more AllMatches and returns AllMatches. The process is described in the Evaluation of FTSelections section.
Arrow 4 shows how the result of the evaluation of XQuery 1.0 and XPath 2.0 Full-Text and scoring expressions needs to be integrated in the XPath and XQuery model. The section XQuery 1.0 and XPath 2.0 Full-Text and scoring expressions describes how this is achieved.
The functions and schemas defined in this section are considered to be within the fts: namespace. These functions and schemas are used only for describing the semantics. There is no requirement that these functions and schemas be implemented, so there is no URI is associated with the fts: prefix.
XQuery 1.0 and XPath 2.0 expressions can be nested inside FTContainsExprs.
Nested XQuery 1.0 and XPath 2.0 expressions are evaluated to a sequence of items before the evaluation of FTContainsExpr. The sequence of items must satisfy certain constraints depending on the context in which it is used. These constraints are described below.
Full-text queries are performed on text which has been tokenized, i.e., broken into a sequence of words, units of punctuation, and spaces. The tokenization is applied on the string value of the evaluation of the left-hand side of the FTContainsExpr expression.
The XQuery 1.0 and XPath 2.0 expression nested inside an FTWords must evaluate to a sequence of string values after applying atomization. Otherwise, an error is raised: [err:XPTY0004]XP. Then, FTWords performs tokenization on the string values from the sequence.
The XQuery 1.0 and XPath 2.0 expression, or expressions in the case of a "from-to" range must evaluate to a singleton sequence of integers after applying atomization. Otherwise, an error is raised: [err:XPTY0004]XP. The resulting integer values are treated as boundaries for the range.
The XQuery 1.0 and XPath 2.0 expression. Otherwise, an error is raised: [err:XPTY0004]XP. The resulting string values are treated as stop words for which any word may substituted during string matching.
The XQuery 1.0 and XPath 2.0 expression must evaluate to a sequence of string values after applying atomization. Otherwise, an error is raised: [err:XPTY0004]XP. The resulting string values are treated as names of thesauri to use during string matching.
The XQuery 1.0 and XPath 2.0 expression must evaluate to either an empty sequence or a singleton sequence of a string value or an empty sequence after applying atomization. Otherwise, an error is raised: [err:XPTY0004]XP. The resulting string value is treated as a language identifier. It specifies the language of the words and phrases in the query.
[Definition: Tokenization is the process of converting a string to a sequence of TokenInfos.]
A [Definition: TokenInfo is the identity of a word occurrence inside an XML document. ] Each TokenInfo is associated with:
the word it identifies: word
the unique identifier that captures the relative position of the word in the document order: pos
the relative position of the sentence containing the word: sentence
the relative position of the paragraph containing the word: para
The tokenization is performed by the formal semantics functions.
function fts:getTokenInfo(
$searchContext as node(),
$matchOptions as fts:FTMatchOptions,
$searchToken as fts:TokenInfo)
as fts:Tokeninfo*
The above function returns the TokenInfos in nodes in $searchContext that match the search string in $searchToken when using the match options in $matchOptions . The match options that occur at the beginning of the list should be applied before match options that occur later in the list.
function fts:getSearchTokenInfo(
$searchString as xs:string,
$matchOptions as fts:FTMatchOptions)
as fts:Tokeninfo*
The above function tokenizes the search string $searchString and returns a sequence of TokenInfos that describes the sequence of tokens in the search string. If $searchString is the empty string, the function is required to return the empty sequence.
This document fragment is the source document for examples in this section. Tokenization is implementation-defined. A sample tokenization is used for the examples in this section. The results might be different for other tokenizations.
Unless stated otherwise, the results assume a case-insensitive match.
<offers>
<offer id="1000" price="10000">
Ford Mustang 2000, 65K, excellent condition, runs
great, AC, CC, power all
</offer>
<offer id="1001" price="8000">
Honda Accord 1999, 78K, A/C, cruise control, runs
and looks great, excellent condition
</offer>
<offer id="1005" price="5500">
Ford Mustang, 1995, 150K highway mileage, no rust,
excellent condition
</offer>
</offers>
In this sample tokenization, words are delimited by punctuation and whitespace symbols. The relative position numbers of the TokenInfos are shown below in parenthesis.
The word "Ford" will be assigned a TokenInfo with relative position of 1.
The word "Mustang" will be assigned a TokenInfo with relative position of 2.
The word "2000" will be assigned a TokenInfo with a relative position of 3.
Relative position numbers are assigned sequentially through the end of the document.
The relative positions of the TokenInfos are shown below in parentheses.
<offers>
<offer id="1000" price="10000">
Ford(1) Mustang(2) 2000(3), 65K(4), excellent(5)
condition(6), runs(7) great(8), AC(9), CC(10),
power(11) all(12)
</offer>
<offer id="1001" price="8000">
Honda(13) Accord(14) 1999(15), 78K(16), A(17)/C(18),
cruise(19) control(20), runs(21) and(22) looks(23)
great(24), excellent(25) condition(26)
</offer>
<offer id="1005" price="5500">
Ford(27) Mustang(28), 1995(29), 150K(30) highway(31)
mileage(32), little(33) rust(34), excellent(35)
condition(36)
</offer>
</offers>
The relative positions of paragraphs are determined similarly. In this sample tokenization, the paragraph delimiters are start tags, end tags, and end of line characters.
The words in the first element will be assigned relative paragraph number 1.
The words from the next element will be assigned relative paragraph number 2.
Relative paragraph numbers are assigned sequentially through the end of the document.
The relative positions of sentences are determined similarly using sentence delimiters.
The "sequence of nodes" in the XQuery 1.0 and XPath 2.0 Data Model is inadequate to support fully composable FTSelection. Full-text operations, such as FTSelections, operate on linguistic units, such as positions of words, and which are not captured in the XQuery 1.0 and XPath 2.0 Data Model (XDM).
XQuery 1.0 and XPath 2.0 Full-Text adds relative word, sentence, and paragraph position numbers via AllMatches. AllMatches make FTSelections fully composable.
An [Definition: AllMatches describes the possible results of an FTSelection.] The UML Static Class diagram of AllMatches is shown on the diagram given below.
The AllMatches object contains zero or more Matches.
Each [Definition: Match describes one result to the FTSelection.] The result is described in terms of zero or more StringIncludes and zero or more StringExcludes
[Definition: StringIncludes and StringExcludes are known collectively as StringMatch, which describes a possible match of a search token with a word in a document.] The queryString attribute of StringMatch stores the search token. The queryPos attribute specifies the position of this search token in the query. This attribute is needed for FTOrders. The matched document word is described in the TokenInfo associated with the StringMatch.
[Definition: A StringInclude is a StringMatch that describes a TokenInfo that must be contained in the document.]
[Definition: A StringExclude is a StringMatch that describes a TokenInfo that must not be contained in the document.]
Intuitively, AllMatches specifies the TokenInfos that a node contains and does not contains to satisfy an FTSelection.
The AllMatches structure resembles the Disjunctive Normal Form (DNF) in propositional and first-order logic. The AllMatches is a disjunction of Matches. Each Match is a conjunction of StringIncludes, and StringExcludes.
The simplest example of an FTSelection is an FTWords such as "Mustang". The AllMatches corresponding to this FTWords is given below.
As shown, the AllMatches consists of two Matches. Each Match represents one possible result of the FTWords "Mustang". The result represented by the first Match,represented as StringInclude, contains the word "Mustang" at position 2. The result described by the second Match contains the word "Mustang" at position 28.
A more complex example of an FTSelection is an FTWords such as "Ford Mustang". The AllMatches for this FTWords is given below.
There are two possible results for this FTWords, and these are represented by the two Matches. Each of the Matches requires two words to be matched. The first Match is obtained by matching "Ford" at position 1 and matching "Mustang" at position 2. Similarly, the second Match is obtained by matching "Ford" at position 27 and "Mustang" at position 28.
An even more complex example of an FTSelection is an FTSelection such as "Mustang" && ! "rust" that searches for "Mustang" but not "rust". The AllMatches for this FTSelection is given below.
This example introduces StringExclude. StringExclude corresponds to negation in DNF. It specifies that the result described by the corresponding Match must not match the word at the specified position. In this example, the first Match specifies that "Mustang" is matched at position 2, and that the word "rust" at position 34 is not matched .
AllMatches has a well-defined hierarchical structure. Therefore, the AllMatches can be easily modeled in XML. This XML representation and those which follow formally describe the semantics of FTSelections. For example, the XML representation of AllMatches formally specifies how an FTSelection operates on zero or more AllMatches to produce a resulting AllMatches.
The XML schema for representing AllMatches is given below.
<xs:schema
xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified"
attributeFormDefault="unqualified">
<xs:complexType name="AllMatches">
<xs:sequence>
<xs:element name="match"
type="fts:Match"
minOccurs="0"
maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="stokenNum" type="xs:string" use="required" />
</xs:complexType>
<xs:complexType name="Match">
<xs:sequence>
<xs:element name="stringInclude"
type="fts:StringMatch"
minOccurs="0"
maxOccurs="unbounded"/>
<xs:element name="stringExclude"
type="fts:StringMatch"
minOccurs="0"
maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="StringMatch">
<xs:sequence>
<xs:element name="tokenInfo" type="fts:TokenInfo"/>
</xs:sequence>
<xs:attribute name="queryString"
type="xs:string"
use="required"/>
<xs:attribute name="queryPos"
type="xs:integer"
use="required"/>
</xs:complexType>
<xs:complexType name="TokenInfo">
<xs:attribute name="word"
type="xs:string"
use="required"/>
<xs:attribute name="pos"
type="xs:integer"
use="required"/>
<xs:attribute name="para"
type="xs:integer"
use="required"/>
<xs:attribute name="sentence"
type="xs:integer"
use="required"/>
</xs:complexType>
</xs:schema>
The stokenNum attribute in AllMatches. is related to the representation of the semantics as XQuery functions. Therefore, it is not considered part of the AllMatches model. The stokenNum attribute stores the number of search tokens used when evaluating the AllMatches. This value is used to compute the correct value for the queryPos attribute in new StringMatches.
[Definition: A Match M is in Match Normal Form if and only if it satisfies the following properties]
[Definition: (Match minimality) M does not contain any duplicate StringIncludes or duplicate StringExcludes. ], and
[Definition: (Match non-contradiction) M does not contain a StringInclude and a StringExclude containing the same TokenInfo ]
Note: two StringMatches are duplicates of each other, if they have the same queryPos attribute value and their TokenInfos have the same pos attribute value. Testing for these attributes is sufficient, since all attributes in a TokenInfo are functionally dependent on the pos attribute and queryString depends on queryPos.
[Definition: (Match subsumption) We say that a Match M1 subsumes a Match object M2 if the following hold. ]
The set of StringIncludes in M2 is a subset of the set of StringIncludes in M1, and
The set of StringExcludes in M2 is a subset of the set of StringExcludes in M1.
[Definition: An AllMatches object A is in AllMatches Normal Form if and only if it satisfies the following properties. ]
Every Match M in A is in Match Normal Form, and
No Match M contained in A is subsumed by another Match M' contained in A.
In other words, in normal-form AllMatches the representations of the contained Matches can be viewed as sets, as opposed to multi-sets, of StringIncludes and StringExcludes. The representations of such AllMatches themselves can be considered as sets of alternatives of Matches, where Matches that are subsumed by others need not be represented, because such subsumed Matches only embody stronger conditions.
normalizeAllMatches functionThe helper function fts:normalizeAllMatches() is used to transform an AllMatches object into AllMatches Normal Form. The denotational semantics of FTSelections defined below assures as an invariant that any AllMatches produced as a result for an FTSelection is in AllMatches Normal Form.
The normalization of a AllMatches is conducted by normalizing each contained Match and then eliminating any Match subsumptions.
declare function fts:normalizeAllMatches(
$allMatches as fts:AllMatches)
as element(allMatches, fts:AllMatches) {
let $mSeq1 := for $m in $allMatches/match
return fts:normalizeMatch($m)
let $mSeq2 := fts:eliminateMatchSubsumption($mSeq1)
return <allMatches stokenNum="${$allMatches/@stokenNum}">
{$mSeq2}
</allMatches>
};
The normalization of a Match is conducted by eliminating the duplicate StringMatches and then eliminating contradictory Matches.
declare function fts:normalizeMatch(
$match as fts:Match)
as element(match, fts:Match) {
let $m1 := <match>
{fts:eliminateStrMatchDupl($match/*, ())}
</match>
return if fts:isMatchContradictory($m1) then ()
else $m1
};
declare function fts:eliminateStrMatchDupl(
$smSeq as fts:StringMatch*,
$resultSoFar as fts:StringMatch*)
as fts:StringMatch* {
if (fn:count($smSeq) eq 0) then $resultSoFar
else if (fts:containsStrMatch($resultSoFar, $smSeq[1])
then eliminateStrMatchDupl($smSeq[position() ge 2],
$resultSoFar)
else eliminateStrMatchDupl($smSeq[position() ge 2],
($resultSoFar, $smSeq[1]))
};
declare function fts:containsStrMatch(
$smSeq as fts:StringMatch*,
$strMatch as fts:StringMatch)
as xs:boolean {
if (fn:count($smSeq) eq 0) then fn:false()
else if (($smSeq[1] instance of element(stringInclude))
eq ($strMatch instance of element(stringInclude))
and $smSeq[1]/tokenInfo/@pos
eq $strMatch/tokenInfo/@pos
and $smSeq[1]/@queryPos
eq $strMatch/@queryPos)
then fn:true()
else containsStrMatch($smSeq[position() ge 2],
$strMatch)
};
declare function fts:isMatchContradictory(
$match as fts:Match)
as xs:boolean {
some $si in $match/stringInclude
satisfies
let $se := <stringExclude queryPos="{$si/@queryPos}"
queryString="{$si/@queryString}">
{$si/tokenInfo}
</stringExclude>
return fts:containsStrMatch($match/stringExclude,
$se)
};
The elimination of Match subsumption is defined as follows.
declare function fts:eliminateMatchSubsumption(
$matches as fts:Match*)
as fts:Match* {
for $m at $p in $matches
let $isNotMin :=
some $m1 in $matches/match[position() ne $p]
satisfies fts:isStrMatchSubset($m1/*, $m/*)
where fn:not($isNotMin)
return $m
};
declare function fts:isStrMatchSubset(
$smSeq1 as fts:StringMatch*,
$smSeq2 as fts:StringMatch*)
as xs:boolean{
if (fn:count($smSeq1) eq 0) then fn:true()
else if (fts:containsStrMatch($smSeq2, $smSeq1[1]))
then fts:isStrMatchSubset($smSeq1[position() ge 2],
$smSeq2)
else fn:false()
};
FTSelections are fully composable and may be nested arbitrarily under other FTSelections. Each FTSelection may be associated with match options (such as stemming and stop words) and score weights. Since score weights are solely interpreted by the formal semantics scoring function, they do not influence the semantics of FTSelections. Therefore, score weights are not considered in the formal semantics.
The XML representation of the FTSelections used in the fts:evaluate function closely follows the grammar of the language. It can be viewed as an XML representation of an abstract syntax tree (AST) of a parsed full-text query. Every FTSelection is represented as an XML element. Every nested FTSelection is represented as a nested descendant element. For binary FTSelections, e.g. FTAnd, the nested FTSelections are represented in
<left> and <right> descendant elements. For unary FTSelections, a <selection> descendant element is used. Additional characteristics of FTSelections, e.g., the distance unit for FTDistance, are stored in attributes.
<xs:schema
elementFormDefault="qualified"
attributeFormDefault="unqualified">
<xs:include schemaLocation="AllMatches.xsd" />
<xs:include schemaLocation="MatchOptions.xsd" />
<xs:complexType name="FTSelection">
<xs:sequence>
<xs:choice>
<xs:element name="FTWords" type="fts:FTWords"/>
<xs:element name="FTAnd" type="fts:FTAnd"/>
<xs:element name="FTOr" type="fts:FTOr"/>
<xs:element name="FTUnaryNot" type="fts:FTUnaryNot"/>
<xs:element name="FTMildNot" type="fts:FTMildNot"/>
<xs:element name="FTOrder" type="fts:FTOrder"/>
<xs:element name="FTScope" type="fts:FTScope"/>
<xs:element name="FTContent" type="fts:FTContent"/>
<xs:element name="FTDistance" type="fts:FTDistance"/>
<xs:element name="FTWindow" type="fts:FTWindow"/>
<xs:element name="FTTimes" type="fts:FTTimes"/>
</xs:choice>
<xs:element name="matchOption"
type="fts:FTMatchOption"
minOccurs="0"/>
<xs:element name="weight"
type="xs:float"
minOccurs="0"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="FTWords">
<xs:sequence>
<xs:element name="searchToken"
type="fts:TokenInfo"
minOccurs="0"
maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="type"
type="fts:FTWordsType"
use="required"/>
</xs:complexType>
<xs:complexType name="FTAnd">
<xs:sequence>
<xs:element name="left" type="fts:FTSelection"/>
<xs:element name="right" type="fts:FTSelection"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="FTOr">
<xs:sequence>
<xs:element name="left" type="fts:FTSelection"/>
<xs:element name="right" type="fts:FTSelection"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="FTUnaryNot">
<xs:sequence>
<xs:element name="selection" type="fts:FTSelection"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="FTMildNot">
<xs:sequence>
<xs:element name="selection" type="fts:FTSelection"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="FTOrder">
<xs:sequence>
<xs:element name="selection" type="fts:FTSelection"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="FTScope">
<xs:sequence>
<xs:element name="selection" type="fts:FTSelection"/>
</xs:sequence>
<xs:attribute name="type"
type="fts:ScopeType"
use="required"/>
<xs:attribute name="scope"
type="fts:ScopeSelector"
use="required"/>
</xs:complexType>
<xs:complexType name="FTContent">
<xs:sequence>
<xs:element name="selection" type="fts:FTSelection"/>
</xs:sequence>
<xs:attribute name="type"
type="fts:ContentMatchTyp