XQuery 1.0 and XPath 2.0 Full-Text

WD-xquery-full-text-20050915

W3C Working Draft

15 September 2005 http://www.w3.org/TR/2005/WD-xquery-full-text-20050915/ XML http://www.w3.org/TR/xquery-full-text/ http://www.w3.org/TR/2005/WD-xquery-full-text-20050404/ http://www.w3.org/TR/2004/WD-xquery-full-text-20040709/ Sihem Amer-Yahia AT&T Labs - Research sihem@research.att.com Chavdar Botev Invited Expert cbotev@cs.cornell.edu Stephen Buxton Oracle Corporation stephen.buxton@oracle.com Pat Case Library of Congress pcase@crs.loc.gov Jochen Doerre IBM doerre@de.ibm.com Darin McBeath Elsevier D.McBeath@elsevier.com Michael Rys Microsoft mrys@microsoft.com Jayavel Shanmugasundaram Invited Expert jai@cs.cornell.edu

This document defines the syntax and formal semantics of XQuery 1.0 and XPath 2.0 Full-Text which is a language that extends XQuery 1.0 and XPath 2.0 with full-text search capabilities.

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document has been produced following the procedures set out for the W3C Process. This document was produced through the efforts of XML Query Working Group and the XSL Working Group (both part of the XML Activity ). It is designed to be read in conjunction with the following documents: W3C XQuery and XPath Full-Text Requirements and the W3C XQuery Full-Text Use Cases .

This is the third version of this document. Since the last version technical and editorial changes have been made to all the sections of the document. Among the most significant changes are the introduction of a new, richer scoring syntax, new semantics and syntax for FTTimes, FTIgnore, FTCase, FTDiacritics, and FTWindow. Numerous issues were closed and four new issues were opened. Appendix was added listing static context components of the full-text extensions. See the new Appendix (Change Log) for more information on these and other changes.

The text of the XQuery functions used to define the semantics have not been completely syntax checked.

This is a public W3C Working Draft for review by W3C members and other interested parties. Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

The text of the XQuery functions used to define the semantics have not been completely syntax checked.

Public comments on this document and its open issues are invited. Comments should be entered into the last-call issue tracking system for this specification (instructions can be found at http://www.w3.org/XML/2005/04/qt-bugzilla). If access to that system is not feasible, you may send your comments to the W3C mailing list, public-qt-comments@w3.org ( http://lists.w3.org/Archives/Public/public-qt-comments/) with "[FT]" at the beginning of the subject field of email messages involving such comments.

The patent policy for this document is specified in the 5 February 2004 W3C Patent Policy. Patent disclosures relevant to this specification may be found on the XML Query Working Group's patent disclosure page and the XSL Working Group's patent disclosure page. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) with respect to this specification should disclose the information in accordance with section 6 of the W3C Patent Policy.

English EBNF

SA January 2004: First version of document before Feb F2F

SA 26 February 2004: Second version of document before Feb F2F meetings.

Introduction

This document defines the language and the formal semantics of XQuery 1.0 and XPath 2.0 Full-Text. This language is designed to meet the requirements identified in W3C XQuery and XPath Full-Text Requirements and the W3C XQuery Full-Text Use Cases .

XQuery 1.0 and XPath 2.0 Full-Text extends the syntax and semantics of XQuery 1.0 and XPath 2.0.

Full-Text Search and XML

XML documents may contain highly-structured data (numbers, dates), unstructured data (untagged free-flowing text), and semi-structured data (text with embedded tags). Where a document contains unstructured or semi-structured data, it is important to be able to search that data using Information Retrieval techniques such as full-text search. Full-text search is different from substring search in many ways:

A full-text search searches for phrases (a sequence of words) rather than substrings. A substring search for news items that contain the string "lease" will return a news item that contains "Foobar Corporation releases the 20.9 version ...". A full-text search for the phrase "lease" will not.

There is an expectation that a full-text search will support language- and token-based searches which substring search cannot. An example of a language-based search is "find me all the news items that contain a word with the same linguistic stem as "mouse" (finds "mouse" and "mice"). An example of a token-based search is "find me all the news items that contain the word "XML" within 3 words (tokens) of "Query".

Full-text search is subject to the vageries and nuances of language. The results it returns are often of varying usefulness. When you search a web site for all cameras that cost less than $100, this is an exact search. There is a set of cameras that match this search, and a set that do not. Similarly, when you do a string search across news items for "mouse", there is only 1 expected result set. When you do a full-text search for, say, all the news items that contain the word "mouse", you probably expect to find news items with the word "mice", and possibly "rodents" (or possibly "computers"!). But not all results are equal : some results are more "mousey" than others. Because full-text search can be inexact, we have the notion of score or relevance : we generally expect to see the most relevant results at the top of the results list. Of course, relevance is in the eye of the beholder. Note: as XQuery/XPath evolves, it may apply the notion of score to querying structured search. For example, when making travel plans or shopping for cameras, it is sometimes more useful to get an ordered list of near-matches. If XQuery/XPath defines a generalized inexact match, we assume that XQuery/XPath can utilize the scoring framework provided by the full-text language.

The following definitions apply to full-text search:

As XML becomes mainstream, users expect to be able to store and search all their documents in XML. This requires a standard way to do full-text search, as well as structured searches, against XML documents. A similar requirement for full-text search led ISO to define the SQL/MM-FT standard. SQL/MM-FT defines extensions to SQL to express full-text queries providing similar functionality as this full-text language extension to XQuery 1.0/XPath 2.0 does.

Full-text queries are performed on text which has been tokenized, i.e., broken into a sequence of words, units of punctuation, and spaces.

A word is defined as any character, n-gram, or sequence of characters returned by a tokenizer as a basic unit to be queried. Each instance of a word consists of one or more consecutive characters. Beyond that, words are implementation defined. Note that consecutive words need not be separated by either punctuation or space, and words may overlap. A phrase is a sequence of ordered words which can contain any number of words.

Tokenization enables functions and operators which work with the relative positions of words (e.g., proximity operators). It also uniquely identifies sentences and paragraphs in which words appear. Tokenization also enables functions and operators which operate on a part or the root of the word (e.g., wildcards, stemming). Whatever a tokenizer for a particular language chooses to do, it must preserve the containment hierarchy: paragraphs contain sentences which contain words. The tokenizer has to give the same answers for two equal strings, i.e., it should identify the same tokens. Everything else is implementation-defined. Sentences and paragraphs are important concepts in Western languages (which belong to a rather important market for a great many implementors of XQuery). So, we choose to keep the full-text primitives that make use of them. The specification does not want to impose any requirements on cross-language tokenizers.

This specification recognizes that some XML elements represent semantic markup, e.g., <title>. Others represent formatting markup, e.g., to indicate bold. Semantic markup serves well as token boundaries, while formatting markup sometimes do not. Implementations are free to provide ways to differentiate between the markup's effect on token boundaries during tokenization in an implementation-defined or implementation-dependent way.

This specification focuses on functionality that serves all languages. It also selectively includes functionalities useful within specific families of languages. For example, searching within sentences and paragraphs is useful to many western languages and to some non-western languages, so that functionality is incorporated into this specification.

Organization of this document

This document is organized as follows. We first present a high level syntax for the XQuery 1.0 and XPath 2.0 Full-Text language along with some examples. Then, we present the syntax and examples of the basic primitives in the XQuery 1.0 and XPath 2.0 Full-Text language. This is followed by the semantics of the XQuery 1.0 and XPath 2.0 Full-Text language. The appendix contains a section that provides an EBNF for the XPath 2.0 Grammar with Full-Text extensions, an EBNF for XQuery 1.0 Grammar with Full-Text extensions, a list of issues, acknowledgements and a glossary

A word about namespaces

Certain namespace prefixes are predeclared by XQuery 1.0 and, by implication, by this specification, and bound to fixed namespace URIs. These namespace prefixes are as follows:

xml = http://www.w3.org/XML/1998/namespace

xs = http://www.w3.org/2001/XMLSchema

xsi = http://www.w3.org/2001/XMLSchema-instance

fn = http://www.w3.org/2005/xpath-functions

xdt = http://www.w3.org/2005/xpath-datatypes

local = http://www.w3.org/2005/xquery-local-functions

In addition to the prefixes in the above list, this document uses the prefix err to represent the namespace URI http://www.w3.org/2005/xqt-errors, This namespace prefix is not predeclared and its use in this document is not normative. Error codes that are not defined in this document are defined in other XQuery 1.0 and XPath 2.0 specifications, particularly and .

Finally, this document uses the prefix fts to represent a namespace containing a number of functions used in this document to describe the semantics of various Full-Text operators. Because those functions are not required to be implemented by Full-Text implementations, there is no URI associated with the prefix.

Full-Text Extensions to XQuery and XPath

The languages of XQuery and XPath are extended in three ways. First, we introduce a new kind of expression, called FTContainsExpr. Second, the syntax of FLWOR expressions in XQuery and of for expressions in XPath is enhanced with optional score variables which allow to refer to the score of an evaluation. And third, static context declarations for full-text match options are added to the query prolog.

Expression FTContainsExpr

The XQuery and XPath Languages are extended by adding the expression FTContainsExpr. An FTContainsExpr is in many ways similar to a comparison expression (see ). This is the grammar rule which introduces FTContainsExpr.

An FTContainsExpr can be used anywhere a ComparisonExpr could be used in the original languages of XQuery and XPath. Moreover, an FTContainsExpr has higher precedence than the comparison operators, meaning that you can compare the Boolean result of such an expression without the need to enclose it in parentheses.

FTContainsExpr Description

An expression of the form FTContainsExpr returns a Boolean value. It returns true, if there is some node in RangeExpr that matches FTSelection. For the purpose of determining a match some parts of the structure dominated by nodes in RangeExpr may be ignored, as specified in FTIgnoreOption. The precise semantics of matching is described in Section .

Expressions of the form FTSelection are composed of the following ingredients:

words or combinations of words, that are the search strings to be found as matches

match options, such as case sensitivity or indication to use stop words

Boolean operators, that allow to compose an FTSelection from simpler FTSelections

constraints on the positions of matches, such as indication of match distance or window, or on the cardinality of matches.

FTContainsExpr Examples

The following example returns the author of each book whose title contains a word with the same root as dog and the word cat. for $b in /books/book where $b/title ftcontains ("dog" with stemming) && "cat" return $b/author

The same example in XPath 2.0: /books/book[title ftcontains ("dog" with stemming) && "cat"]/author

Score Variables

Besides specifying what constitutes a match of a full-text search as a Boolean condition, full-text search applications typically also require the ability to associate scores with the results. Such scores are meant to express grades of relevance of those results to the full-text search conditions. To this end we introduce score variables as follows.

The XQuery language is extended by adding optional score variables to the for and let clauses of FLWOR expressions. Let us consider the enhanced for clause at first.

When a score variable is present in a for clause the evaluation of the expression following the in keyword not only needs to determine the result sequence of the expression, i.e., the sequence of items which are used to iteratively bind the for variable to, but also for each such item a relevance "score" of the evaluation. This value is what the score variable gets bound to.

In the following example book elements are determined that satisfy the condition [content ftcontains "web site" && "usability" and .//chapter/title ftcontains "testing"]. The relevance of the book elements with respect to that query are returned. for $b score $s in /books/book[content ftcontains "web site" && "usability" and .//chapter/title ftcontains "testing"] return $s

Scores are typically used as an ordering criterion, like in the following, more complete example. for $b score $s in /books/book[content ftcontains "web site" && "usability"] where $s > 0.5 order by $s descending return <result> <title> {$b//title} </title> <score> {$s} </score> </result>

The score variable always gets bound to a value of type xs:float in the range [0, 1]. The value reflects the relevance of the match criteria in the FTSelections to the nodes in the respective RangeExprs. The way relevance is calculated is left implementation-dependent, but score evaluation must follow these rules:

Score values are of type xs:float in the range [0, 1].

For score values greater than 0, a higher score must imply a higher degree of relevance

Similar to their use in a for clause, score variables may be specified in a let clause. A score variable in a let clause again gets bound to the score of the expression evaluation, but in this case one score needs to be determined for the complete result. In the case of the let clause the syntax also allows to drop the let variable, if the score variable is present, as it is expected to be a common use case to be interested only in the score and not the value of an expression.

While the score option in a for clause conveniently allows to specify that the filtering expression which drives the iteration is at the same time the expression that determines the scores, it is possible to separate the filtering from the scoring expression using the let clause syntax. The following is an example of this. for $b in /books/book[.//chapter/title ftcontains "testing"] let score $s := $b/content ftcontains "web site" && "usability" order by $s descending return <result score="{$s}">{$b}</result> Here an iteration over book elements is defined, such that the chapter titles of those books satisfy the FTSelection "testing". These books are scored with respect to another condition, namely that their content elements contain "web site" and "usability".

Another aspect of scoring which we want to illuminate with this example is that it is not a requirement of the score of an FTContainsExpr to be 0, if the expression evaluates to false. In the example, note that a result element is produced even for books that do not satisfy the expression in the let clause. While for such books the score is likely to be 0, this need not be the case. An implementation may want to assign a non-zero score to a book that contained only "web site", but not "usability", as this may be considered more relevant than a book that does not contain either of both.

In XPath 2.0 we extend the for expression in the same way with optional score variables. The first example above is actually also a legal example of the XPath 2.0 extension.

The use of score variables introduces a second-order aspect to the evaluation of expressions which cannot be emulated by (first-order) XQuery functions. Consider the following replacement of the clause let score $s := FTContainsExpr

let $s := score(FTContainsExpr)

where a function score is applied to some FTContainsExpr. Being a first-order function score is only applied to the result of the evaluation of its argument, which is one of the Boolean constants true or false in our case. Hence, there can be at most two possible values score will return and no further differentiation is possible.

Using Weights Within a Scored FTContainsExpr

Scoring can be influenced by adding weight declarations to the individual search terms, like in the following example (detailed syntax is given in Section 3.1). for $b in /books/book let score $s := $b/content ftcontains ("web site" weight 0.2) && ("usability" weight 0.8) return <result score="{$s}">{$b}</result>

The effect of weights on the result score is also implementation-dependent. However, these two rules must be followed.

Only the relative values of the weights in an FTContainsExpr with respect to each other are significant.

When no explicit weight is specified, the default weight is 0.5

Weight declarations in an FTContainsExpr for which no scores are evaluated are ignored.

Extensions to the Static Context

The XQuery Static Context is extended by a component for each of the match options. Thus, the default of a match option in a query can be adjusted by providing a setting in the static context using the following declaration syntax. Match options are used to control the operational semantics of the actual full-text operations and are described in detail in Section . When a match option is specified in a query directly as described below, that setting overrides the setting of the respective match option in the static context.

FTSelection and FTMatchOptions FTSelection

This section describes FTSelection that gives the full-text selection expressions used in the FTContainsExpr, and the match options in FTMatchOptions that are used to adjust the matching semantics of the full-text selection expressions.

The FTSelection production specifies all permitted kinds of full-text search conditions.

In the following we will define the syntax and semantics of the individual full-text selection operators and provide some examples based on an example document presented in Section 3.1.1.

FTSelection Example

We will use the following XML document as an example throughout this section.

<book number="1"> <title shortTitle="Improving Web Site Usability">Improving the Usability of a Web Site Through Expert Reviews and Usability Testing</title> <author>Millicent Marigold</author> <author>Montana Marigold</author> <editor>Véra Tudor-Medina</editor> <content> The usability of a Web site is how well the site supports the users in achieving specified goals. A Web site should facilitate learning, and enable efficient and effective task completion, while propagating few errors. <note>This book has been approved by the Web Site Users Association. </note> </content> </book> FTWords

FTWords specifies the words and phrases that are being searched for in the searched text that is provided as the left-hand side argument of FTContainsExpr.

The right hand side Expr of the above production must evaluate to a sequence of string values or nodes of type "xs:string". The result of the Expr is then atomized into a sequence of strings which then is being tokenized into a sequence of phrases (see section 2.x.x for details). If the atomized sequence is not a subtype of "xs:string*", a type error [err:XPTY0004] is raised.

If the "any" option is specified, then a match occurs, if and only if at least one phrase in the sequence has a match in the searched text.

If the "all" option is specified, then a match occurs, if and only if all of the phrases in the sequence of phrases are matched in the searched text.

If the "phrase" option is specified, then the sequence of phrases is used to create a single phrase by concatenating the phrases and interleaving whitespace. A match occurs, if and only if the resulting phrase is matched in the searched text.

If the "any word" option is specified, then a match occurs, if and only if at least one word in the sequence of phrases is matched in the searched text.

If the "all word" option is specified, then a match occurs, if and only if all words in the sequence of phrases are matched in the searched text.

If no option is specified, then "any" is implied as default.

Note that if Expr results in a single string, the default and "any", "all" and "phrase" are equivalent.

If Expr results in the empty sequence or the tokenization results in a zero-length phrase, this is discussed in the issue zero-length-phrase (Cluster G, Issue 47).

Note: The results assume a case-insensitive match in the following expressions.

/book[@number="1" and ./title ftcontains "Expert"]

returns the book element, because the phrase "Expert" is contained in the title child.

/book[@number="1" and ./title ftcontains "Expert Reviews"]

returns the book element, because the phrase "Expert Reviews" is contained in the title child.

/book[@number="1" and ./title ftcontains {"Expert", "Reviews"} all]

also returns the book element, because the two phrases "Expert" and "Reviews" are both contained in the title child.

/book[@number="1"]//p ftcontains "Web Site Usability"

returns the empty sequence, because the p element in the book element doesn't contain the phrase "Web Site Usability" though it contains all of the words in the phrase.

for $book in /book[.//author ftcontains "Marigold"] let score $score := $book/title ftcontains "Web Site Usability" where $score > 0.8 order by $score descending return $book/@number

returns numbers of the most relevant book elements by Marigold with a title about "Web Site Usability" sorted by descending score.

FTOr

FTOr finds matches that satisfy at least one of the input selection criteria.

Any match should satisfy at least one of the FTSelection criteria.

/book[.//author ftcontains "Millicent" || "Voltaire"]

returns book elements written by "Millicent" or "Voltaire". The book element of our sample document is returned, because it it written by "Millicent".

FTAnd

FTAnd finds matches that satisfy simultaneously two selection criteria.

Any match must satisfy all of the FTSelection criteria which are specified by one or more FTUnaryNot expressions.

/book[@number="1"]/title ftcontains ("usability" && "testing") case insensitive

returns true for our sample document, because the text of the title element contains "usability" and "testing", if we ignore the letter case (see FTCaseOption for more details on case sensitivity).

/book[@number="1"]/author ftcontains "Millicent" && "Montana"

returns false, because "Millicent" and "Montana" are not contained by the same author element of the book element.

FTUnaryNot

FTUnaryNot finds matches that do not satisfy words and phrases that are being searched for in the searched text that is provided as the left-hand side argument of FTContainsExpr.

This is unary negation. Only one operand is required.

/book[. ftcontains "information" && "retrieval" && ! "information retrieval"]

returns book elements containing "information" and "retrieval" but not "information retrieval".

/book[. ftcontains "web site usability" && !"usability testing"]

returns book elements about "web site usability" but not "usability testing".

FTMildNegation

FTMildNegation is a milder form of "&& !". 'a mild not b' matches an expression that contains a on its own, and not just as part of b. For example, if I want to find articles that mention Mexico, I might search for ' "Mexico" not in "New Mexico" '. '"Mexico" not in "New Mexico"' matches any Expr that contains Mexico on its own. An Expr that contains "New Mexico" is not excluded from the result - it may mention "Mexico" as well. An Expr that contains "Mexico" only as part of the phrase "New Mexico" will not match ' "Mexico" not in "New Mexico".

A match to FTMildNegation must contain at least one word occurrence that satisfies the first condition and does not satisfy the second condition. If it contains a word occurrence that satisfies both the first and the second condition, the occurrence is not considered as a result.

/book[@number="1" and . ftcontains "usability" not in "usability testing"]

returns the book since occurrences of "usability" appear in the title and the p elements of the book, even if the occurrence within the phrase "Usability Testing" in the title element is not considered.

The right-hand side of a FTMildNegation cannot contain a FTSelection which evaluates to a AllMatches that contain a StringExclude as defined in the Formal Semantics section. Such FTSelections are FTUnaryNot and FTTimes with at-most, from-to, and exactly occurrencies range.

FTOrder

FTOrder enforces that the order of word occurrences in the match is the same as their order in the query.

By default, there are no restrictions on the order in which the query words are matched in the document.

FTOrder imposes such an order. A match must satisfy the nested selection condition and the match must contain the words in the order specified in the query.

/book[. ftcontains ("web site" && "usability") ordered]/title

returns titles of book elements that contain "web site" and "usability" in the order in which they appear in the query, i.e., "web site" must precede "usability".

/book[@number="1"]/title ftcontains ("Montana" && "Millicent") ordered

returns false, because although "Montana" and "Millicent" appear in the title element, they do not appear in the order specified in the query.

FTScope

FTScope specifies a condition on the scope of the occurrences of the matched words.

FTScope specifies whether any matched word in FTSelection should be directly contained in the same ('same') or different ('different') scope.

Possible scopes are sentence (e.g., delimited by ".", "!", or "?"), and paragraph (e.g., delimited by blank lines and EOLN/CR characters). Sentences and paragraphs are defined in the introduction.

By default, there are no restriction on the scope of the occurrences, i.e. they may occur in a sentence or a paragraph. FTScope is used to restrict this scope.

If two words appear in the same sentence and in different sentences then both 'same sentence' and 'different sentence' return true. The same thing applies to the 'paragraph' scope.

/book[@number="1" and . ftcontains "usability" && "Marigold" same sentence]

will not return the book element, because the words "usability" and "Marigold" are not contained within the same sentence.

/book[@number="1" and . ftcontains "usability" && "Marigold" different sentence]

will return the book element, because the words "usability" and "Marigold" are contained within different sentences.

/book[. ftcontains "usability" && "testing" same paragraph]

returns book elements mentioning "usability" and "testing" in the same paragraph.

/book[. ftcontains "site" && "errors" same sentence]

returns the book element, because "site" and "errors" appear in the same sentence. Note that the book is returned even though there is another occurrence of "site", namely the one in the title element, which does not appear in the same sentence as the occurrence of "errors".

Some subtle relationships between FTScope and FTDistance will be discussed in the semantics section.

FTDistance

FTDistance limits the distance in number of words, sentences, or paragraphs between consecutive occurrences of the words in FTSelection. These correspond to "word distance", "sentence distance", and "paragraph distance" forms of FTDistance.

FTRange specifies a range of integer values, providing a minimum and maximum value that defines the distance limits. Each UnionExpr in an FTRange must evaluate (after atomization) to a singleton sequence with an atomic value of type "xs:integer". Otherwise, a type error [err:XPTY0004] is raised.

Let the value of the first (or only) UnionExpr be M. If "from" is specified, let the value of the second UnionExpr be N.

FTDistance may cross element boundaries when computing distance:

Zero words means adjacent.

Zero sentences means the same sentence.

Zero paragraphs means the same paragraph.

If "exactly" is specified, then the range is the closed interval [M, M]. If "at least" is specified, then the range is the half-closed interval [M, unbounded). If "at most" is specified, then the range is the closed interval [0, M]. If "from" is specified, then the range is the closed interval [M, N]. For example:

'exactly 0' specifies the range [0, 0].

'at least 1' specifies the range [1,unbounded].

'at most 1' specifies the range [0, 1].

'from 5 to 10' specifies the range [5, 10].

The distances computed by FTDistance are not affected by the presence or absence of element boundaries in the text over which the distances are computed. Stop words are included in those computations.

/book[. ftcontains ("information" && "retrieval") not in ("information" && "retrieval" distance at least 11 words)]

returns book elements containing "information" and "retrieval" and discards those occurrences of the words that are more than 10 words apart.

/book[. ftcontains "web" && "site" && "usability" distance at most 2 words]/title

returns the titles of book elements mentioning "web", "site", and "usability" with at most 2 intervening words between consecutive occurrences of the words.

/book[@number="1" and . ftcontains "web site" && "usability" distance at most 1 words]/title

returns the title element; a similar query for the p element would return the empty sequence when stop words are not ignored, because its occurrences of "web site" and "usability" are only within a word distance of 2.

FTWindow

FTWindow allows to impose the constraint that a match must occur within a window of the document of a given size.

FTWindow limits the window size in units of words, sentences, or paragraphs.

FTWindow may cross element boundaries when computing window sizes.

UnionExpr must evaluate to an atom of type "xs:integer".

A match of an FTSelection is considered a match within a window, if there exists a window of the given number of consecutive units (words, sentences, or paragraphs) in the document within which the match lies.

/book[./title ftcontains "web" && "site" && "usability" window 5 words]/@number

returns the numbers of book elements containing "web", "site", and "usability" in their title within a window of 5 words.

/book[. ftcontains ("web" && "site" ordered) && ("usability" || "testing") window 10 words]

returns book elements that contain "web" and "site" in this order plus either "usability" or "testing" and all the matched words occur within a window of at most 10 words.

/book//*[. ftcontains "web site" && "usability" window 3 words]

returns the title element, because it contains "Web Site Usability"; the p element will not be returned, because its occurrences of "web site" and "usability" are not within a window of 3.

/book[@number="1" and . ftcontains "efficient" && ! "and" window 3 words]

returns the empty sequence, because in the selected book element there is no occurrence of "efficient" in a window of 3 words which would not also contain an occurrence of "and".

FTTimes

FTTimes controls the number of times a specified FTSelection must be matched.

FTTimes limits the number of different occurrences of FTSelection, which must be within the specified range.

An occurrence of the criterion is a distinct set of word occurrences that satisfies it.

The FTSelection '("very big")' has one occurrence in the text fragment "very very big": it consists of the second "very" and "big".

The FTSelection '"very" && "big"' has two occurrences in the text fragment "very very big": one consisting of the first "very" and "big", and the other containing the second "very" and "big".

The FTSelection '"very" || "big"' has 3 occurrences in "very very big".

The FTSelection '!"small"' has 1 occurrence in "very very big".

/book[. ftcontains "usability" occurs at least 2 times]/@number

returns the numbers of the book elements that contain 2 or more occurrences of "usability".

/book[@number="1" and title ftcontains "usability" || "testing" occurs at most 3 times]

returns false, because "usability" 3 occurrences and "testing" 1 occurrences; therefore, there are 4 occurrences of "usability" || "testing".

/book[@number="1" and . ftcontains "usability" occurs at least 2 times]

returns the book element, because its title element contains 3 occurrences of "usability" although its p element contains only one occurrence.

FTContent

FTContent finds matches when the words and phrases are the first, last or all of the words and phrases in the tokenized string value of the element that is being searched.

The "at" "start" option finds matches when the words or phrases are the first words or phrases in the tokenized string value of the element that is being searched.

The "at" "end" option finds matches when the words or phrases are the last words or phrases in the tokenized string value of the element that is being searched.

The "entire" content" option finds matches when the words or phrases are the entire content of the tokenized string value of the element that is being searched.

/books//title[. ftcontains "improving the usability of a web site" at start]

returns each title element starting with the phrase "improving the usability of a web site".

/books//p[. ftcontains "propagat*" && "few errors" distance at most 2 words at end]

returns each p element ending with the phrase "propagating few errors".

/books//note[. ftcontains "this site has been approved by the web site users association" entire content]

returns each note element where "this site has been approved by the web site users association" is the entire content of the tokenized string of that element.

FTMatchOptions

FTMatchOptions modify the operational semantics of the FTSelection they are applied on.

FTMatchOptions set an environment for the matching options of FTSelection. If a match option isn't specified directly in the query, its value is given by its static context component. Details about these context components, including their default values, are given in Appendix .

As a result of these default values of the match options, when no ft-option declarations are present, the query:

/book/title ftcontains "usability"

is equivalent to the query

/book/title ftcontains "usability" case insensitive diacritics insensitive without stemming without thesaurus without stop words language "none" without wildcards

FTMatchOptions are applied in the order in which they are given in the query. More information on their semantics is given in .

We illustrate each match option in more detail in the following sections.

FTCaseOption

FTCaseOption controls the way words are matched with regards to the letter case.

The option "lowercase" ("uppercase") specifies that only words in lower-case (upper-case) letters can be matched exactly. The option "case insensitive" specifies that matching word occurrences can have both small and capital letters; their case is ignored. The option "case sensitive" specifies that the case of the letters in the result must match the case of the letters in the word from the query.

The default is "case insensitive".

The following table summarizes the interaction between the case match option and the use of the default collation.

Case Matrix

Default collation options/Case options UCC (Unicode Codepoint Collation) CCS (some generic case-sensitive collation) CCI (some generic case-insensitive collation)

insensitive compare as if both lower case-insensitive variant of CCS if it exists, else error CCI

sensitive UCC CCS case-sensitive variant of CCI if it exists, else error

uppercase uppercase(Expr) + UCC uppercase(Expr) + CSS CCI

lowercase lowercase(Expr) + UCC lowercase(Expr) + CSS CCI

Case Matrix
Default collation options/Case options	UCC (Unicode Codepoint Collation)	CCS (some generic case-sensitive collation)	CCI (some generic case-insensitive collation)
insensitive	compare as if both lower	case-insensitive variant of CCS if it exists, else error	CCI
sensitive	UCC	CCS	case-sensitive variant of CCI if it exists, else error
uppercase	uppercase(Expr) + UCC	uppercase(Expr) + CSS	CCI
lowercase	lowercase(Expr) + UCC	lowercase(Expr) + CSS	CCI

In this table, "else error" means "Otherwise, an error [err:FOCH0002] is raised.". The phrase "if it exists" is used, because the case-sensitive collation CCS does not always have a case-insensitive variant (and, even if one exists, it may not be possible to determine it algorithmically), and because the case-insensitive collation CCI does not always have a case-sensitive variant (and, even if one exists, it may not be possible to determine it algorithmically).

/book[@number="1"]/title ftcontains "Usability" lowercase

returns false, because the title element doesn't contain "usability" (in lower case).

/book[@number="1"]/title ftcontains "usability" case insensitive

returns true, because the case of the letters is not considered.

FTDiacriticsOption

FTDiacriticsOption controls the way words are matched with regards to the use of diacritic symbols.

The option "with" ("without") "diacritics" specifies that only words that contain (do not contain) diacritics can be matched exactly. The option "diacritics insensitive" specifies that there are no restrictions on the matching word occurrences with regards to diacritic symbols: letters containing diacritics can be matched with their non-diacritics counterparts and vice versa. The option "diacritics sensitive" specifies that the diacritic symbols must match the symbols in the word from the query.

The default is "diacritics insensitive".

The following table summarizes the interaction between the diacritics match option and the use of the default collation.

Diacritics Matrix

Default collation options/Diacritics options UCC (Unicode Codepoint Collation) CDS (some generic diacritics-sensitive collation) CDI (some generic diacritics-insensitive collation)

insensitive compare as if with and without diacritics-insensitive variant of CDS if it exists, else error CDI

sensitive UCC CDS diacritics-insensitive variant of CDI if it exists, else error

with diacritics "resume diacritic insensitive" not in "resume" "resume diacritic insensitive" not in "resume" CDI

without diacritics "resume" not in "resume diacritic sensitive" "resume" not in "resume diacritic sensitive" CDI

Diacritics Matrix
Default collation options/Diacritics options	UCC (Unicode Codepoint Collation)	CDS (some generic diacritics-sensitive collation)	CDI (some generic diacritics-insensitive collation)
insensitive	compare as if with and without	diacritics-insensitive variant of CDS if it exists, else error	CDI
sensitive	UCC	CDS	diacritics-insensitive variant of CDI if it exists, else error
with diacritics	"resume diacritic insensitive" not in "resume"	"resume diacritic insensitive" not in "resume"	CDI
without diacritics	"resume" not in "resume diacritic sensitive"	"resume" not in "resume diacritic sensitive"	CDI

In this table, "else error" means "Otherwise, an error [err:FOCH0002] is raised.". The phrase "if it exists" is used, because the diacritics-sensitive collation CDS does not always have a diacritics-insensitive variant (and, even if one exists, it may not be possible to determine it algorithmically), and because the diacritics-insensitive collation CDI does not always have a diacritics-sensitive variant (and, even if one exists, it may not be possible to determine it algorithmically).

/book[@number="1"]//editor ftcontains "Vera" with diacritics

returns the editor element.

/book[@number="1"]/editors ftcontains "Véra" without diacritics

returns false.

FTStemOption

FTStemOption controls the use of stemming during string matching.

FTStemOption influences the way FTWords is applied. It produces a disjunction of the query words by expanding the words into the list of words that share the same stem. By definition, the query words are included in that disjunction.

When the "with stemming" option is present, string matches may also contain words that have the same stem as the query string. It is implementation-defined what a stem of a word is.

The clause "without stemming" turns off the use of stemming when words are matched.

It is implementation-defined whether the stemming will based on an algorithm, dictionary, or mixed approach.

The default is "without stemming".

/book[@number="1"]/title ftcontains "improve" with stemming

returns true, because it contains "improving" that has the same stem as "improve".

FTThesaurusOption

FTThesaurusOption controls the use of thesauri during string matching.

FTThesaurusOption influences the way FTWords is applied.

The StringLiteral following the keyword at in FTThesaurusID is of the form of a URI Reference.

The use of thesauri allows for substitutes in FTWords of any search token or sequence of such tokens (a phrase) with related tokens or phrases. The related tokens or phrases can be obtained using a thesaurus, taxonomy, soundex, ontology, or topic map. Thus, the user can narrow, broaden, or otherwise modify the search using synonyms, hypernyms (more generic terms), etc. The search is performed as though the user has specified all related search tokens in a disjunction (FTOr).

The thesauri used in an XQuery 1.0 and XPath 2.0 implementation may be standards-based or locally-defined.

It is implementation-defined how a thesaurus is represented. This includes files in a predefined format, or modules using a common interface.

Relationships include, but are not limited to, the relationship terms and their abbreviations presented in and their equivalents in other languages:

equivalence relationships (synoymns): PREFERRED TERM (USE), NONPREFERRED USED FOR TERM (UF);

hierarchical relationships: BROADER TERM (BT), NARROWER TERM (NT), BROADER TERM GENERIC (BTG), NARROWER TERM GENERIC (NTG), BROADER TERM PARTITIVE (BTP), NARROWER TERM PARTITIVE (NTP), TOP Terms (TT); and

associative relationships: RELATED TERM (RT).

FTThesaurusID allows to specify the number of levels to be queried in hierarchical relationships by including an FTRange "levels". If no levels are specified, the default is to query all levels in hierarchical relationships.

When the "with thesaurus" match option is specified, string matches also include words that can be found in one of the specified thesauri and that correspond to the query string.

The statement "without thesaurus" instructs the query engine not to use thesauri when matching words.

When the option "with default thesaurus" is specified, a system-defined default thesaurus with a system-defined relationship is used. The default thesaurus can also be used in combination with other explicitly specified thesauri.

The default is "without thesaurus".

doc("http://bstore1.example.com/full-text.xml") /books/book[count(.//introduction ftcontains "quote" with thesaurus at "http://bstore1.example.com/UsabilityThesaurus.xml" relationship "synonyms")>0]

finds all introductions which quote someone.

doc("http://bstore1.example.com/full-text.xml") /books/book[count(./content ftcontains "web site components" with thesaurus at "http://bstore1.example.com/UsabilityThesaurus.xml" relationship "narrower terms" at most 2 levels)>0]

finds all books with text on improving "web site components". Also finds books with the words "navigation" and "layout".

doc("http://bstore1.example.com/full-text.xml") /books/book[count(. ftcontains "Merrygould" with thesaurus at "http://bstore1.example.com/UsabilitySoundex.xml" relationship "sounds like")>0]

finds all books with words which sound like "Merrygould". This includes answers containing "Merigold".

FTStopwordOption

FTStopWordOption controls the use of stop words (frequent functional words such as "a", "an", "the" that are ignored) during string matching.

FTStopWordOption influences the way FTWords is applied.

FTRefOrList allows to specify the list of stop words either explicitly as a comma-separated list of string literals, or by a URI following the keyword at. If a URI is used, it must point to a sequence of string atoms or nodes of type "xs:string". In both cases, no tokenization is performed on the strings: they are used as they occur in the sequence.

When the "with stop words" option is used, if a query word is within the specified collection of stop words, it should be ignored. However, when the stop word appears in a query phrase, or other query operation that is sensitive to the distance of query tokens, the position of the stop word is not ignored. In such a case, the stop word will match any word in the document.

When the option "with default stop words" is used, an implementation-defined collection of stop words is used. Stop word lists can be combined using the usual semantics of "except" and "union".

The option "without stop words" turns off stop word processing. This is equivalent to specifying an empty list of stop words.

The default is "without stop words".

/book[@number="1"]//p ftcontains "propagation of errors" with stemming with stop words ("a", "the", "of")

returns true, because the document contains the matching tokens "propagating few errors". Note the asymmetry in the stop word semantics: the property of being a stop word is only relevant to query terms, not to document terms. Hence, it is irrelevant for the above-mentioned match whether "few" is a stop word or not, and on the other hand we do not want the query above to match "propagation" followed by 2 stop words, or even a sequence of 3 stop words in the document.

/book[@number="1"]//p ftcontains "propagation of errors" with stemming without stop words

returns false.

doc("http://bstore1.example.com/full-text.xml") /books/book[count(.//content ftcontains "planning then conducting" with stop words at "http://bstore1.example.com/StopWordList.xml")>0]

uses the stop words list specified at the URL. Assuming that the specified stop word list contains the word "then", this query is reduced to a query on the phrase "planning X conducting", allowing any word as a substitute for X.

doc("http://bstore1.example.com/full-text.xml") /books/book[count(.//content ftcontains "planning then conducting" with stop words at "http://bstore1.example.com/StopWordList.xml" except ("the then"))>0]

will find "planning then conducting" in the sample data, but not "planning and conducting", because it is exempting "then" from being a stop word.

FTLanguageOption

FTLanguageOption allows to specify the language of query words.

FTLanguageOption influences the way FTWords is applied.

The StringLiteral following the keyword language can only designate one language. It must either be castable to "xs:language", or be the value "none". Otherwise, a type error [err:XPTY0004] is raised.

Language can have implications in various aspect of string matching. This includes how the tokenization into words is performed, how stemming is performed, or which words can considered to be stop words. In particular, the language option may imply what are the default thesaurus/stop word sets.

If language "none" is specified, this means that there is no language selected; otherwise, it should be valid identifier of a language. The set of valid language identifiers is implementation-defined.

By default, there is no language selected.

/book[@number="1"]//editor ftcontains "salon de the" with default stop words language "fr"

This is an example where the language option is used to select the appropriate stop word list.

FTIgnoreOption

FTIgnoreOption specifies a set of element nodes whose content should be ignored. The set of nodes is identified by the XQuery expression Expr that should evaluate to a sequence of element nodes. This "ignore" is done recursively which means that ignored elements will also be searched and within those elements, other elements may be ignored.

FTIgnoreOption chabges the semantics of phrase matching. It does not have an impact on a single word search.

If FTIgnoreOption is specified, all the subtree directly contained by the elements is ignored for the purpose of searching a phrase at a given level. For example, "Web Site Usability" is matched by "Web Usability" if the option is "without content .//b". However, "Web Site" will not be matched. If the XQuery sub-expression evaluates to an empty sequence no words from element content are ignored.

FTIgnoreOption is applied recursively. For example, if the option is "without content .//b", "Web This is my Web Site Site Usability" is matched twice by "Web Site". Ignoring an element does not mean that it will not be searched, it means that it is ignored when searching its parent element. This is done recursively.

More generally, if .//notation is ignored, "Web Usability" will be found 5 times in the following fragment:

<book> <title>Web Usability and Practice</title> <author>Montana <annotation> this author is an expert in Web Usability</annotation> Marigold</author> <editor>Véra Tudor-Medina on Web <annotation> best editor on Web Usability</annotation>Usability</editor> <content> Web Usability is defined as how well the site supports the users in achieving specified goals. </content> </book>

By default element content is not ignored.

FTWildCardOption

FTWildCardOption controls the use of wildcards appending or inserting a character or sequence of characters to a word (or part of a word). It influences the way words in FTWords are interpreted.

In addition to specifying the "with wildcards"' option, indicators (represented by periods (.)) and qualifiers are appended to or inserted into words being searched. Zero or more characters replace each indicator and qualifier.

Indicators are mandatory. When the "with wildcards"' option is present, one or more periods (.) must be appended at the beginning or end of words or inserted into words. If the period is at the beginning of a word, the wildcard is a prefix wildcard. If the period is at the end of a word, it is a suffix wildcard. If the period is inserted into a word, it is an infix wildcard.

When the "with wildcards" option and one or more periods (.) appended to or inserted into words are present, characters are appended or inserted at each of the periods. Any characters may be appended or inserted except newline characters (#xA), return characters (#xD), and tab characters (#x9). The number of characters depends on the qualifier. Qualifiers available are none, question mark, asterisk, plus sign, and two numbers separated by a comma, both enclosed by curly braces.

If a period is present, but no qualifiers, any one character is appended or inserted.

If a period is followed by a question mark (.?), zero or one characters are appended or inserted.

If a period is followed by an asterisk (.*), zero or more characters are appended or inserted.

If a period is followed by a plus sign (.+), one or more characters are appended or inserted.

If a period is followed by two numbers separated by a comma, both enclosed by curly braces (.{n,m}), a specified range of characters is appended or inserted.

The option "without wildcards" finds words without recognizing wildcard indicators and qualifiers. Periods, question marks, asterisks, plus signs, and two numbers separated by a comma, both enclosed by curly braces recognized as regular characters.

The default is "without wildcards".

/book[@number="1"]/title ftcontains "improv.*" with wildcards

returns true, because it contains "improving".

/book[@number="1"]/title ftcontains ".?site" with wildcards

returns true, because it contains "site".

/book[@number="1"]/p ftcontains "w.ll" with wildcards

returns true, because it contains "well".

Semantics Introduction

This section describes the formal semantics of XQuery 1.0 and XPath 2.0 Full-Text. The figure below shows how XQuery 1.0 and XPath 2.0 Full-Text integrates with XQuery and XPath.

The arrow (1) represents the composability of the XQuery and XPath expressions. It is described in the XQuery language specification. Regular XQuery expressions can be nested inside FTSelections (arrow (2)) by evaluating them to a sequence of items and then converting them to a tokenized text; depending on the role they are used in a XQuery 1.0 and XPath 2.0 Full-Text expression. The process is described in Nested XQuery and XPath Expressions. Similarly to arrow (1), there is a full composability of FTSelections (arrow (3)). The composability is achived by evaluating FTSelections to AllMatches. Each FTSelection operates on zero or more AllMatches and returns AllMatches. The process is described in the Evaluation of FTSelections section. Finally, the result of the evaluation of XQuery 1.0 and XPath 2.0 Full-Text and scoring expressions needs to be integrated in the XPath and XQuery model (arrow (4)). The section XQuery 1.0 and XPath 2.0 Full-Text and scoring expressions describes how this is achieved.

All functions and schemata defined in this section are considered to be within the fts: namespace. These functions and schemata are used only for the purpose of describing the semantics. They need not be available directly to users, and there is no requirement that implementations should actually provide these functions and schemata. For this reason, no URI is associated with the fts: prefix.

Nested XQuery and XPath Expressions

The following section discusses the nesting of XQuery and XPath expressions inside FTContainsExpr.

The general rule is that the nested XQuery and XPath expressions are evaluated to a sequence of items before the evaluation of FTContainsExpr. The sequence of items must satisfy certain constraints depending on the context in which it is used. These constraints are described below.

Left-hand Side of a FTContainsExpr

Full-text queries are performed on text which has been tokenized, i.e., broken into a sequence of words, units of punctuation, and spaces. The tokenization is applied on the string value of the evaluation of the left-hand side of the FTContainsExpr expression.

FTWords

The XQuery expression nested inside an FTWords must evaluate to a sequence of string values after applying atomization (otherwise the entire FTSelection causes a type error [err:XPTY0004] to be raised). Then, FTWords performs an tokenization on the string values from the sequence.

FTRangeSpec

The XQuery expression (or expressions, in the case of a "from-to" range) must evaluate to a singleton sequence of integers after applying atomization (otherwise the entire FTSelection causes a type error [err:XPTY0004] to be raised). The resulting integer values are treated as boundaries for the corresponding range.

FTStopWordOption

The XQuery expression must evaluate to a sequence of string values after applying atomization (otherwise the entire FTSelection causes a type error [err:XPTY0004] to be raised). The resulting string values are treated as stop words that must be ignored during string matching.

FTThesaurusOption

The XQuery sub-expression must evaluate to a sequence of string values after applying atomization (otherwise, the entire FTSelection causes a type error [err:XPTY0004] to be raised). The resulting string values are treated as names of thesauri to use during string matching.

FTLanguageOption

The XQuery sub-expression must evaluate to either an empty sequence or a singleton sequence of a string value or an empty sequence after applying atomization (otherwise the entire FTSelection causes a type error [err:XPTY0004] to be raised). The resulting string value is treated as a language identifier specifying the language of the matched document/documents.

Tokenization

Tokenization is the process of converting a string to a sequence of TokenInfos.

A TokenInfo is the identity of a word occurrence inside an XML document. Each TokenInfo is associated with:

the word it identifies: word

a unique identifier that captures the relative position of the word in the document order: pos

the relative position of the sentence containing the word: sentence

the relative position of the paragraph containing the word: para

The tokenization is performed by the formal semantics functions:

function fts:getTokenInfo( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchToken as fts:TokenInfo) as fts:Tokeninfo*

The above function returns all the TokenInfos in nodes in $searchContext that match the search string in $searchToken when using the match options in $matchOptions. The match options that occur at the beginning of the list should be applied before match options that occur later in the list.

function fts:getSearchTokenInfo( $searchString as xs:string, $matchOptions as fts:FTMatchOptions) as fts:Tokeninfo*

The above function tokenizes the search string $searchString and returns a sequence of TokenInfo that describe the sequence of tokens in the search string.

A compliant implementation should provide implementations of the above functions.

As an illustration, consider the following XML fragment:

<offers> <offer id="1000" price="10000"> Ford Mustang 2000, 65K, excellent condition, runs great, AC, CC, power all </offer> <offer id="1001" price="8000"> Honda Accord 1999, 78K, A/C, cruise control, runs and looks great, excellent condition </offer> <offer id="1005" price="5500"> Ford Mustang, 1995, 150K highway mileage, no rust, excellent condition </offer> </offers>

If we assume that words are delimited by punctuation and whitespace symbols (as in English), the first word "Ford" from the first element content will be assigned a TokenInfo with relative position of 1, the word "Mustang" will be assigned a TokenInfo with relative position of 2, the word "2000" will be assigned a TokenInfo with a relative position of 3, and so on. The relative positions of the TokenInfos are shown below in parenthesis.

<offers> <offer id="1000" price="10000"> Ford(1) Mustang(2) 2000(3), 65K(4), excellent(5) condition(6), runs(7) great(8), AC(9), CC(10), power(11) all(12) </offer> <offer id="1001" price="8000"> Honda(13) Accord(14) 1999(15), 78K(16), A(17)/C(18), cruise(19) control(20), runs(21) and(22) looks(23) great(24), excellent(25) condition(26) </offer> <offer id="1005" price="5500"> Ford(27) Mustang(28), 1995(29), 150K(30) highway(31) mileage(32), little(33) rust(34), excellent(35) condition(36) </offer> </offers>

The relative positions of paragraphs are determined similarly. Assuming that the paragraph delimiters are start tag, end tag, and end of line characters, the words in the first element's content will be assigned a paragraph relative number 1, the words from the following element content will be assigned a relative number 2, and so on.

The relative positions of sentences are also determined similarly using sentence delimiters such as ".", "!", and "?".

Evaluation of FTSelections

The XQuery/XPath data model of a "sequence of nodes" is inadequate for fully composable FTSelections. The main reason is that full-text operations (such as FTSelections) operate on linguistic units, such as positions of words, and such information is not captured in the XQuery/XPath data model. We thus define AllMatches that allows for fully compositional FTSelections.

AllMatches Formal Model

An AllMatches object describes all the posible results an FTSelection. The UML Static Class diagram of AllMatches is shown on the diagram.

The AllMatches object contains zero or more Matches. Each Match describes one result to the FTSelection. The result is described in terms of zero or more StringIncludes and zero or more StringExcludes, which describe the TokenInfos that must be contained and respectively, those that must not be contained. Both StringInclude and StringExclude are of type StringMatch, which describes a possible match of a query search token with a document word. The queryString attribute of StringMatch contains the query search token that has been matched. The queryPos attribute specifies the position of this search token in the query (this attribute is needed for FTOrders). The TokenInfo associated with the StringMatch describes the word in the document that matches the query search token.

Intuitively, AllMatches specifies the TokenInfos that a node should contain, and the TokenInfos that a node should not contain, in order to satisfy an FTSelection

The AllMatches structure resembles the Disjunctive Normal Form (DNF) in propositional and first-order logic. The AllMatches is a disjunction of Matches. Each Match is a conjunction of positive "atoms", the StringIncludes, and negative "atoms", the StringExcludes.

Examples

Consider the FTWords "Mustang" evaluated over the sample document fragment in the previous section. The AllMatches corresponding to this FTWords is shown in figure below.

As shown, the AllMatches consists of two Matches. Each Match represents one possible result of the FTWords "Mustang". The result represented by the first Match contains (represented as StringInclude) the word "Mustang" at position 2. The result described by the second Match contains the word "Mustang" at position 28.

Let us now consider a more complex example. Consider the FTWords "Ford Mustang" evaluated over the XML fragment used above. The AllMatches for this FTWords is shown on the figure below.

There are two possible results of this FTWords, and these are represented by the two Matches. Each of the Matches requires two words to be matched. The result corresponding to the first Match is obtained by matching "Ford" at position 1 and matching "Mustang" at position 2. Similarly, the result described by the second Match is obtained by matching "Ford" at position 27 and "Mustang" at position 28.

Let us now consider a more sophisticated example of a AllMatches. Consider the FTSelection "Mustang" && ! "rust" that searches for nodes that contain "Mustang" but not "rust". The AllMatches for this FTSelection is shown in the figure below.

Observe the use of StringExclude. This is the component that corresponds to negation. It specifies that the result desribed by the corresponding Match should not match the word at the specified position. For instance, the first Match specifies the solution that "Mustang" should be matched at position 2, and "rust" should not be matched at position 34.

XML representation

AllMatches has a well-defined hierarchical structure. Therefore, the AllMatches can be easily modeled in XML. In subsequent sections, we will use this XML representation to formally describe the semantics of FTSelections. In particular, we will use the XML representation of AllMatches to formally specify how an FTSelection operates on zero or more AllMatches to produce a resulting AllMatches. We will also use the XML representation to specify the formal semantics of the FTContainsExpr.

The XML schema for representing AllMatches is given below:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified"> <xs:complexType name="AllMatches"> <xs:sequence> <xs:element name="match" type="fts:Match" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="stokenNum" type="xs:string" use="required" /> </xs:complexType> <xs:complexType name="Match"> <xs:sequence> <xs:element name="stringInclude" type="fts:StringMatch" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="stringExclude" type="fts:StringMatch" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> <xs:complexType name="StringMatch"> <xs:sequence> <xs:element name="tokenInfo" type="fts:TokenInfo"/> </xs:sequence> <xs:attribute name="queryString" type="xs:string" use="required"/> <xs:attribute name="queryPos" type="xs:integer" use="required"/> </xs:complexType> <xs:complexType name="TokenInfo"> <xs:attribute name="word" type="xs:string" use="required"/> <xs:attribute name="pos" type="xs:integer" use="required"/> <xs:attribute name="para" type="xs:integer" use="required"/> <xs:attribute name="sentence" type="xs:integer" use="required"/> </xs:complexType> </xs:schema>

Notice the use of the stokenNum attribute in AllMatches. This attribute was not previously discussed because it is related to the representation of the semantics as XQuery functions. Therefore, it is not considered part of the AllMatches model. Intuitively, the stokenNum attribute is used for keeping the number of search tokens used when evaluating the AllMatches. This value is used to compute the correct value for the queryPos attribute in new StringMatches.

FTSelections

In this section, we define the semantics of FTSelections. FTSelections are fully composable, and can be arbitrarily nested under other FTSelections. Also, each FTSelection can be associated with match options (such as stemming, stop words, etc.) and score weights. Since score weights are solely interpreted by the formal semantics scoring function, score weights do not influence the semantics of FTSelections in any way. We will thus not consider score weights when defining the formal semantics.

XML Representation

Here, we define the XML representation of the FTSelections as used in the fts:evaluate function. The XML representation closely follows the grammar of the language. It can be viewed as an XML representation of an abstract syntax tree (AST) of a parsed full-text search query. In general, every FTSelection is represented as an XML element. Every nested FTSelection is represented as a nested sub-element in the above XML element. For binary FTSelections (e.g. FTAnd) the nested FTSelections are represented in <left> and <left> sub-elements. For unary FTSelections, a <selection> sub-element is used. Additional, characteristics of FTSelections (e.g. the distance unit for FTDistance) are stored in attributes.

<xs:schema elementFormDefault="qualified" attributeFormDefault="unqualified"> <xs:include schemaLocation="AllMatches.xsd" /> <xs:include schemaLocation="MatchOptions.xsd" /> <xs:complexType name="FTSelection"> <xs:sequence> <xs:choice> <xs:element name="FTWords" type="fts:FTWords"/> <xs:element name="FTAnd" type="fts:FTAnd"/> <xs:element name="FTOr" type="fts:FTOr"/> <xs:element name="FTUnaryNot" type="fts:FTUnaryNot"/> <xs:element name="FTMildNot" type="fts:FTMildNot"/> <xs:element name="FTOrder" type="fts:FTOrder"/> <xs:element name="FTScope" type="fts:FTScope"/> <xs:element name="FTContent" type="fts:FTContent"/> <xs:element name="FTDistance" type="fts:FTDistance"/> <xs:element name="FTWindow" type="fts:FTWindow"/> <xs:element name="FTTimes" type="fts:FTTimes"/> </xs:choice> <xs:element name="matchOption" type="fts:FTMatchOption" minOccurs="0"/> <xs:element name="weight" type="xs:float" minOccurs="0"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTWords"> <xs:sequence> <xs:element name="searchToken" type="fts:TokenInfo" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="type" type="fts:FTWordsType" use="required"/> </xs:complexType> <xs:complexType name="FTAnd"> <xs:sequence> <xs:element name="left" type="fts:FTSelection"/> <xs:element name="right" type="fts:FTSelection"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTOr"> <xs:sequence> <xs:element name="left" type="fts:FTSelection"/> <xs:element name="right" type="fts:FTSelection"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTUnaryNot"> <xs:sequence> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTMildNot"> <xs:sequence> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTOrder"> <xs:sequence> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTScope"> <xs:sequence> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> <xs:attribute name="type" type="fts:ScopeType" use="required"/> <xs:attribute name="scope" type="fts:ScopeSelector" use="required"/> </xs:complexType> <xs:complexType name="FTContent"> <xs:sequence> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> <xs:attribute name="type" type="fts:ContentMatchType" use="required"/> </xs:complexType> <xs:complexType name="FTDistance"> <xs:sequence> <xs:element name="range" type="fts:FTRangeSpec"/> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> <xs:attribute name="type" type="fts:DistanceType" use="required"/> </xs:complexType> <xs:complexType name="FTWindow"> <xs:sequence> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> <xs:attribute name="size" type="xs:integer" use="required"/> <xs:attribute name="type" type="fts:DistanceType" use="required"/> </xs:complexType> <xs:complexType name="FTTimes"> <xs:sequence> <xs:element name="range" type="fts:FTRangeSpec"/> <xs:element name="selection" type="fts:FTSelection"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTCaseOption"> <xs:attribute name="value" use="required"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="lowercase"/> <xs:enumeration value="uppercase"/> <xs:enumeration value="case insensitive"/> <xs:enumeration value="case sensitive"/> </xs:restriction> </xs:simpleType> </xs:attribute> </xs:complexType> <xs:complexType name="FTRangeSpec"> <xs:attribute name="type" type="fts:RangeSpecType" use="required"/> <xs:attribute name="m" type="xs:integer"/> <xs:attribute name="n" type="xs:integer" use="required"/> </xs:complexType> <xs:simpleType name="FTWordsType"> <xs:restriction base="xs:string"> <xs:enumeration value="any"/> <xs:enumeration value="all"/> <xs:enumeration value="phrase"/> <xs:enumeration value="any word"/> <xs:enumeration value="all word"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="ScopeType"> <xs:restriction base="xs:string"> <xs:enumeration value="same"/> <xs:enumeration value="different"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="ScopeSelector"> <xs:restriction base="xs:string"> <xs:enumeration value="paragraph"/> <xs:enumeration value="sentence"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="RangeSpecType"> <xs:restriction base="xs:string"> <xs:enumeration value="exactly"/> <xs:enumeration value="at least"/> <xs:enumeration value="at most"/> <xs:enumeration value="from to"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="DistanceType"> <xs:restriction base="xs:string"> <xs:enumeration value="paragraph"/> <xs:enumeration value="sentence"/> <xs:enumeration value="word"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="ContentMatchType"> <xs:restriction base="xs:string"> <xs:enumeration value="at start"/> <xs:enumeration value="at end"/> <xs:enumeration value="entire content"/> </xs:restriction> </xs:simpleType> </xs:schema>

The XML representation of the match options is discussed in the match options section

The evaluate function

We present denotational semantics for the evaluation of FTSelections. Specifically, we define a function fts:evaluate that takes in three parameters: (1) an FTSelection, (2) a search context node, and (3) the default set of match options that apply to the evaluation of the FTSelection. The fts:evaluate function returns the AllMatches that is the result of evaluating the FTSelection. When fts:evaluate is applied to some FTSelection X, it calls the function fts:applyX to build the resulting AllMatches. If X is applied on nested FTSelections, the fts:evaluate function is recursively called on these nested FTSelections and the returned AllMatches are used in the evaluation of fts:applyX.

See the section Match Options Semantics for the semantics of the full-text match options.

We first present a high-level description of the fts:evaluate function, and then describe the details.

The fts:evaluate function is given below.

function evaluate($ftSelect as element(*, fts:FTSelection), $searchContext as node(), $matchOptions as FTMatchOptions, $searchTokenNum as xs:integer) as AllMatches { if (fn:count($ftSelect/FTMatchOption) > 0) then (: First we deal with all match options that the :) (: FTSelection might bear: we add the match options :) (: in front of the current match options sequence :) (: and pass the new sequence to the recursive call :) let $newFTSelection := $ftSelect/*[!(. instance of element(FTMatchOption))] return fts:evaluate($newFTSelection, $searchContext, ($ftSelect/matchOption, $matchOptions), $searchTokenNum) else if (fn:count($ftSelect/weight) > 0) then (: Weight has no bearing on semantics – just :) (: call "evaluate" on nested FTSelection :) let $newFTSelection := $ftSelect/*[! (. instance of element(weight)] return fts:evaluate($newFTSelection, $searchContext, $matchOptions, $searchTokenNum) else typeswitch ($ftSelect) case ($nftSelection as element(FTWords)) (: Apply the FTWords in the search context :) return applyFTWords($searchContext, $matchOptions, $nftSelection/searchToken, $searchTokenNum + 1); case ($nftSelection as element(FTAnd)) return let $left = fts:evaluate($nftSelection/left, $searchContext, $matchOptions, $searchTokenNum) let $newSearchTokenNum = $left/@stokenNum let $right = fts:evaluate($nftSelection/right, $searchContext, $matchOptions, $newSearchTokenNum) return applyFTAnd($left, $right) case ($nftSelection as element(FTOr)) return let $left = fts:evaluate($nftSelection/left, $searchContext, $matchOptions, $searchTokenNum) let $newSearchTokenNum = $left/@stokenNum let $right = fts:evaluate($nftSelection/right, $searchContext, $matchOptions, $newSarchTokenNum) return applyFTOr($left, $right) case ($nftSelection as element(FTUnaryNot)) return applyFTUnaryNot($nftSelection/selection) case ($ftSelection as element(FTMildNot)) return let $left = fts:evaluate($nftSelection/left, $searchContext, $matchOptions, $searchTokenNum) let $newSearchTokenNum = $left/@stokenNum let $right = fts:evaluate($nftSelection/right, $searchContext, $matchOptions, $newSearchTokenNum) return applyFTMildNot($left, $right) case ($nftSelection as element(FTOrder)) let $nested := fts:evaluate($nftSelection/selection, $searchContext, $matchOptions, $searchTokenNum) return applyFTOrder($nested) case ($nftSelection as element(FTScope)) let $nested := fts:evaluate($nftSelection/selection, $searchContext, $matchOptions, $searchTokenNum) return applyFTScope($nftSelection/@type, $nftSelection/@scope, $nested) case ($nftSelection as element(FTContent)) let $nested := fts:evaluate($nftSelection/selection, $searchContext, $matchOptions, $searchTokenNum) return applyFTContent($searchContext, $nftSelection/@type, $nested) case ($nftSelection as element(FTDistance)) let $nested := fts:evaluate($nftSelection/selection, $searchContext, $matchOptions, $searchTokenNum) return applyFTDistance($matchOptions, $nftSelection/@type, $nftSelection/range, $nested) case ($nftSelection as element(FTWindow)) let $nested := fts:evaluate($nftSelection/selection, $searchContext, $matchOptions, $searchTokenNum) return applyFTWindow($matchOptions, $nftSelection/@type, $nftSelection/@size, $nested) case ($nftSelection as element(FTTimes)) let $nested := fts:evaluate($nftSelection/selection, $searchContext, $matchOptions, $searchTokenNum) return applyFTTimes($nftSelection/range, $nested) }

Let us now walk through the above pseudo-code to understand the semantics of the function. For concreteness, let us assume that the FTSelection was invoked inside an ftcontains expression such as searchContext ftcontains ftselection. In order to determine the AllMatches result of ftselection, the fts:evaluate function is invoked as follows: fts:evaluate($ftselection, $searchContext, $matchOptions, 0), where $ftselection is the XML representation of the ftselection and $searchContext is bound to the result of the evaluation of the XQuery expression searchContext.

Initially, the $searchTokensNum is 0, i.e. currently 0 search tokens have been processed.

The $matchOptions above is the default (implementation-defined) list of match options that apply to the evaluation of ftselection (such as stemming but not thesaurus) and is implementation-defined. Match options embedded in ftselection can change the match options collection as evaluation proceeds. In order to express the order in which match options are applied to an FTSelection, the match options are organized in a stack. The top match option in the stack is to be applied first, the next match option is to be applied second, and so on. The ordering among match options is necessary because match options are not always commutative. For example, synonym(stem(word)) is not always the same as stem(synonym(word)). Of course, match optionss can be reordered when they commute, but this is an optimization issue and is beyond the scope of this semantics document.

Given the invocation of: fts:evaluate($ftselection, $searchContext, $matchOptions), evaluation proceeds as follows. First, $ftselection is checked to see whether it is a match option applied on a nested FTSelection (case 1), a weight specification (case 2), a FTWords (case 3), or some other FTSelection (case 4). Let us consider these four cases in turn.

Case 1: If $ftselection contains a match option, then it modifies the context for the nested FTSelection. Consequently, a new match option element is created and pushed onto the top of the stack of match options. The createOptionElement function used to create a stack element corresponding to the match option simply creates a data structure that stores the type of match option (such as stemming, thesaurus, synonyms, ignore, etc.) and the details relating to the match option (such as the name of the thesaurus, the words to ignore, etc.). The context match option created is added to the top of the stack because, in the FTSelection, it was applied before the other match options in the current match options stack. The evaluate function is then invoked on the nested FTSelection with the new match options stack. When the function returns, the match option is popped from the stack, and the result of the nested evaluate function is returned. The match option is popped because the match options should not apply to FTSelections outside its scope.

Case 2: If $ftselection contains a weight specification, then the specification is simply ignored (because it does not alter semantics). The evaluate function is recursively called on the nested FTSelection and the resulting AllMatches is directly returned.

Case 3: If $ftselection is a FTWords, then it does not have any nested FTSelections. Consequently, this is the base of the recursive call, and the AllMatches result of the FTWords is computed and returned. The AllMatches is computed by invoking the applyFTWords function with the current search context and other necessary information. The semantics of how exactly the corresponding applyFTWords creates AllMatches for FTWords will be specified in the next section.

Case 4: If $ftselection contains neither a match option nor a weight specification and is not a FTWords, the FTSelection performs some form of full-text operation such as &&, ||, window, etc. Note that these operations are fully-compositional, and can be invoked on nested FTSelections. Consequently, evaluation proceeds as follows. First, the evaluate function is recursively invoked on each nested FTSelection. The result of evaluating each nested FTSelection is AllMatches. These AllMatches are transformed into a result AllMatches by applying the full- text operation corresponding to FTSelection1 (generically named applyX for some type of FTSelection X in the pseudo-code). As an example, let FTSelection1 be FTSelection2 && FTSelection3. Here FTSelection2 and FTSelection3 can themselves be arbitrarily nested FTSelections. Thus, evaluate is invoked on FTSelection2 and FTSelection3, and the resulting AllMatches are transformed to the output AllMatches using the applyFTAnd function corresponding to && .

Note that specifying the semantics of the applyFTSelection function for each FTSelection is key to specifying the semantics of the FTSelection itself. In the subsequent sections, we define the semantics of the applyX function for each FTSelection kind X.

Formal semantics functions

The formal semantics of the ApplyX functions for each FTSelection kind X is specified in terms of four functions. How these four functions are computed is implementation-defined, but the functions have to satisfy some well-defined properties. We first present the properties of the formal semantics functions, and then present the semantics of the family of functions applyX in terms of these functions.

The first function, getTokenInfo has been described in tokenization section.

The wordDistance returns the number of words that occur between the positions of the TokenInfos $tokenInfo1 and $tokenInfo2. For example, two consecutive words have a distance of 0.

function fts:wordDistance( $tokenInfo1 as fts:TokenInfo, $tokenInfo2 as fts:TokenInfo, $matchOptions as fts:FTMatchOptions) as xs:integer

Similarly, the function getParaDistancereturns the number of paragraphs that occur between the TokenInfos $tokenInfo1 and $tokenInfo2.

function fts:paraDistance( $tokenInfo1 as fts:TokenInfo, $tokenInfo2 as fts:TokenInfo, $matchOptions as fts:FTMatchOptions) as xs:integer

The function sentenceDistance returns the number of sentences that occur between the TokenInfos $tokenInfo1 and $tokenInfo2 .

function fts:sentenceDistance( $tokenInfo1 as fts:TokenInfo, $tokenInfo2 as fts:TokenInfo, $matchOptions as fts:FTMatchOptions) as xs:integer

The function isStartToken checks if the TokenInfo $tokenInfo describes the first token of the node $searchContext.

function fts:isStartToken( $searchContext as node(), $tokenInfo as fts:TokenInfo) as xs:boolean

The function isEndToken checks if the TokenInfo $tokenInfo describes the last token of the node $searchContext.

function fts:isEndToken( $searchContext as node(), $tokenInfo as fts:TokenInfo) as xs:boolean FTWords

We first consider the case where FTWords consists of a single search string. The parameters of the applySingleSearchToken function are the search context, the list of match options, the search TokenInfo, and the position where the latter occurs in the query.

In general for all cases of FTWords, if the after the application of all FTMatchOptions , the sequence of search tokens is empty, an empty AllMatches is returned.

declare function fts:applySingleSearchToken( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchToken as fts:TokenInfo, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { <allMatches stokenNum="{$queryPos}"> { let $token_pos := fts:getTokenInfo($searchContext, $matchOptions, $searchToken) for $pos in $token_pos return <match> <stringInclude queryPos="{$queryPos}" queryString="{$searchToken/@word}" > {$pos} </stringInclude> </match> } </allMatches> }

Intuitively, the AllMatches corresponding to an FTWords corresponds to a set of Matches, each of which is associated with a position where the corresponding search token was found. For example, the AllMatches result for the FTWords "Mustang" evaluated in the context of the sample document will be (in graphical terms):

The other cases can be rewritten as complex FTSelections that operate on single string FTWordss.

In the case of a FTWords with any word specified, the semantics is given below. Since FTWords does not have nested FTSelections, the ApplyFTWords function does not take in any AllMatches parameters corresponding to nested FTSelection results.

declare function fts:MakeDisjunction($curRes as element(allMatches, fts:AllMatches), $rest as element(allMatches, fts:AllMatches)*) as element(allMatches, fts:AllMatches) { if (fn:count($rest) = 0) then $curRes else let $firstAllMatches := $rest[1] let $restAllMatches := fn-subsequence($rest, 2) let $newCurRes := fts:ApplyFTOr($curRes, $firstAllMatches) return fts:MakeDisjunction($newCurRes, $restAllMatches) } declare function fts:ApplyFTWordsAnyWord( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchStrings as xs:string*, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { let $searchTokens := for $searchString in $searchStrings return fts:getSearchTokenInfo($searchString, $matchOptions) return if (fn:count($searchTokens) eq 0) then <allMatches stokenNum="0" /> else let $allAllMatches = for $searchToken at $pos in $searchTokens return fts:applySingleSearchToken( $searchContext, $matchOptions, $searchToken, $queryPos + $pos - 1) let $firstAllMatches := $allAllMatches[1] let $restAllMatches := fn:subsequence($allAllMatches, 2) return fts:MakeDisjunction($firstAllMatches, $restAllMatches) }

Intuitively, all search strings are tokenized and a single sequence that consists of all TokenInfos is constructed. For each of these, the result of FTWords is computed using ApplySingleSearchSelection. Finally, the conjunction of all resulting AllMatches is computed.

Similarly, in the case of a FTWords with all word specified, the semantics is given below.

declare function fts:MakeConjunction($curRes as element(allMatches, fts:AllMatches), $rest as element(allMatches, fts:AllMatches)*) as element(allMatches, fts:AllMatches) { if (fn:count($rest) = 0) then $curRes else let $firstAllMatches := $rest[1] let $restAllMatches := fn-subsequence($rest, 2) let $newCurRes := fts:ApplyFTAnd($curRes, $firstAllMatches) return fts:MakeConjunction($newCurRes, $restAllMatches) } declare function fts:ApplyFTWordsAllWord( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchStrings as xs:string*, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { let $searchTokens := for $searchString in $searchStrings return fts:getSearchTokenInfo($searchString, $matchOptions) return if (fn:count($searchTokens) eq 0) then <allMatches stokenNum="0" /> else let $allAllMatches = for $searchToken at $pos in $searchTokens return fts:applySingleSearchToken( $searchContext, $matchOptions, $searchToken, $queryPos + $pos - 1) let $firstAllMatches := $allAllMatches[1] let $restAllMatches := fn:subsequence($allAllMatches, 2) return fts:MakeConjunction($firstAllMatches, $restAllMatches) }

In the case of a FTWords with phrase specified, the semantics is given below.

declare function fts:ApplyFTWordsPhrase( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchStrings as xs:string*, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { let $conj := fts:ApplyFTWordsAllWord($searchContext, $matchOptions, $searchStrings, $queryPos) let $ordered := fts:ApplyFTOrder($conj) let $distance1 := fts:ApplyFTDistance($matchOptions, $ordered, <fts:range type="exactly" n="0">) return $distance1 }

The semantics of this function differs from the semantics of the functions

The above function is similar to the one in the case of all word. The only difference is that the additional FTSelections ordered and word distance 1 are applied.

The semantics for the case of FTWords with any specified is given below.

declare function fts:ApplyFTWordsAny( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchStrings as xs:string*, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { if (fn:count($searchTokens) eq 0) then <allMatches stokenNum="0" /> else let $firstSearchString := $searchStrings[1] let $restSearchString := fn:subsequence($searchStrings, 2) let $firstAllMatches := fts:ApplyFTWordsPhrase($searchContext, $matchOptions, $firstSearchString, $queryPos) let $newQueryPos := fn:max($firstAllMatches//@queyrPos) + 1 let $restAllMatches := fts:ApplyFTWordsAny($searchContext, $matchOptions, $restSearchString, $newQueryPos) return fts:ApplyFTOr($firstAllMatches, $resAllMatches) }

Intuitively, the FTWords with any specified forms the disjunction of the AllMatches that are the result of the matching of each seperate search string as a phrase.

Analogously, the semantics for the case of a FTWords with all specified is:

declare function fts:ApplyFTWordsAll( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchStrings as xs:string*, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { if (fn:count($searchTokens) = 0) then <allMatches stokenNum="0" /> else let $firstSearchString := $searchStrings[1] let $restSearchString := fn:subsequence($searchStrings, 2) let $firstAllMatches := fts:ApplyFTWordsPhrase($searchContext, $matchOptions, $firstSearchString, $queryPos) let $newQueryPos := fn:max($firstAllMatches//@quetyPos) + 1 let $restAllMatches := fts:ApplyFTWordsAll($searchContext, $matchOptions, $searchStrings, $newQueryPos) return fts:ApplyFTAnd($firstAllMatches, $resAllMatches) }

As before, the difference from the case of any is the use of conjunction instead of disjunction.

Finally, we define the function that combines all of the above cases.

declare function fts:ApplyFTWords($searchContext as Node*, $matchOptions as fts:FTMatchOptions, $type as element(type, fts:FTWordsType), $searchTokens as xs:string*, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { if ($type eq "any word") then fts:ApplyFTWordsAnyWord($searchContext, $matchOptions, $searchTokens, $queryPos) else if ($type eq "all word") then fts:ApplyFTWordsAllWord($searchContext, $matchOptions, $searchTokens, $queryPos) else if ($type eq "phrase") then fts:ApplyFTWordsPhrase($searchContext, $matchOptions, $searchTokens, $queryPos) else if ($type eq "any") then fts:ApplyFTWordsAny($searchContext, $matchOptions, $searchTokens, $queryPos) else fts:ApplyFTWordsAll($searchContext, $matchOptions, $searchTokens, $queryPos) } FTOr

The parameters of the ApplyFTOr function are the two AllMatches parameters corresponding to the results of the two nested FTSelections. The search context and the match options stack are not used in this case. The function definition is given below.

declare function fts:ApplyFTOr($allMatches1 as element(allMatches, fts:AllMatches), $allMatches2 as element(allMatches, fts:AllMatches)) as element(allMatches, fts:AllMatches) { <allMatches stokenNum="{fn:max(($allMatches1/@stokenNum, $allMatches2/@stokenNum))}"> ($allMatches1/match $allMatches2/match) </allMatches> }

The function creates a new AllMatches whose Matches are simply the union of those found in the input AllMatches. The rationale for this semantics is that each Match represents one possible "solution" to the corresponding FTSelection. Thus, if we "or" two AllMatches, a Match from either of the AllMatches should also be a solution.

As an example, consider the FTSelection "Mustang" || "Honda" in the context of the sample document. The AllMatches corresponding to "Mustang" and "Honda" are:

The AllMatches produced by ApplyFTOr is:

FTAnd

The parameters of the ApplyFTAnd function are the two AllMatches parameters corresponding to the results of the two nested FTSelections. The search context and the match options are not used in this case. The function definition is given below.

declare function fts:ApplyFTAnd ($allMatches1 as element(allMatches, fts:AllMatches), $allMatches2 as element(allMatches, fts:AllMatches)) as element(allMatches, fts:AllMatches) { <allMatches stokenNum="{fn:max(($allMatches1/@stokenNum, $allMatches2/@stokenNum))}" > {for $sm1 in $allMatches1/match for $sm2 in $allMatches2/match return <match> {$sm1/* $sm2/*} </match> } </allMatches> }

Intuitively, the result of a conjunction is a new AllMatches that contains the "Cartesian product" of the simple matches of the participating FTSelections. Every resulting Match is formed the combination of the stringInclude components and stringExclude components from each of the AllMatches of the nested FTSelection conditions. Thus every simple match will contain the positions to satisfy a Match from both original FTSelections and will exclude the positions that will violate the same Matches.

As an example let us consider the FTSelection "Mustang" && "rust" in the context of the sample document. The source AllMatches are:

The AllMatches produced by ApplyFTAnd is:

FTUnaryNot

The parameters of the ApplyFTUnaryNot function are the search context, the list of match optionss, and one AllMatches parameter corresponding to the result of the nested FTSelection to be negated. The search context and the match options are not used in this case. The function definition is given below.

declare function fts:InvertStringMatch($strm) { if ($strm instanceof element(stringExclude)) then <stringInclude queryPos="{$strm/@queryPos}" queryString="{$strm/@queryString}"> {$strm/docPos} </stringInclude> else <stringExclude queryPos="{$strm/@queryPos}" queryString="{$strm/@queryString}"> {$strm/docPos} </stringInclude> } declare function fts:UnaryNotHelper($sms) { <allMatches stokenNum="{$stokenNum}"> { for $sm in $sms/match[1]/child::element() for $rest in fts:UnaryNotHelper( fn:subsequence($sms/match, 2)/match return <match> (fts:InvertStringMatch($sm) $rest/*) </match> } </allMatches> } declare function fts:ApplyFTUnaryNot($allMatches as element(allMatches, fts:AllMatches)) as element(allMatches, fts:AllMatches) { if ($allMatches/match) then {fts:UnaryNotHelper($allMatches)} else <allMatches stokenNum="{$allMatches/@stokenNum}"> <match /> </allMatches> }

The process of the generation of the resulting AllMatches of an FTUnaryNot resembles the transformation of a negation of prepositional formula in DNF back to DNF. The intuition is that negation of AllMatches requires the inversion of all the conditions on the nodes encoded by the AllMatches .

In the implementation above, this inversion is implemented as follows. The function fts:invertStringMatch inverts a stringInclude into a stringExclude and vice versa. The function fts:neg_helper transforms the source Matches into the resulting Matches by combining a the inversions of a stringInclude or stringExclude component from every source Match into a new Match.

As an example, let us consider the FTSelection ! ("Mustang" || "Honda") in the context of the sample document. The source AllMatches is:

The FTUnaryNot will transform it to:

FTMildNot

The parameters of the ApplyFTMildNot function are the two AllMatches parameters corresponding to the results of the two nested FTSelections. The search context and the match options stack are not used in this case. The function definition is given below.

declare function fts:ApplyFTMildNot($allMatches1 as element(allMatches, fts:AllMatches), $allMatches2 as element(allMatches, fts:AllMatches)) as element(allMatches, fts:AllMatches){ if (fn:count($allMatches2//stringExclude) gt 0) then fn:error("Invalid expression on the right-hand side of a not-in") else <allMatches stokenNum="{$allMatches1/@stokenNum}"> {let $posSet2 = $allMatches2/match/stringInclude/pos return $allMatch1/match[every $pos1 in ./stringInclude/pos, $pos2 in $posSet2 satisfies $pos1 ne $pos2] } </allMatches> }

The resulting AllMatches consists of those Matches of the first operand that do not mention in their stringInclude components positions mentioned in a stringInclude component in the AllMatches of the second operand.

As an example, consider the FTSelection ("Ford" mildnot "Ford Mustang") in the context of the sample document. The source AllMatches are:

The FTMildNot will transform these to empty AllMatches because both position 1 and position 27 from the first AllMatches contain only TokenInfos from stringInclude components of the second AllMatches.

FTOrder

The parameters of the ApplyFTOrder function are the search context, the list of match options, and one AllMatches parameter corresponding to the result of the nested FTSelections. The evaluation context and the match options are not used in this case. The function definition is given below.

declare function fts:ApplyFTOrder($allMatches as element(allMatches, fts:AllMatches)) as element(allMatches, fts:AllMatches) { <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match where every $stringInclude1 in $match, $stringInclude2 in $match satisfies (($stringInclude1/tokenInfo/@pos <= $stringInclude2/tokenInfo/@pos) and ($stringInclude1/@queryPos <= $stringInclude2/@queryPos)) or (($stringInclude1/tokenInfo/@pos>= $stringInclude2/tokenInfo/@pos) and ($stringInclude1/@queryPos >= $stringInclude2/@queryPos)) return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where every $stringIncl in $match/stringInclude satisfies (($stringExcl/tokenInfo/@pos <= $stringIncl/tokenInfo/@pos) and ($stringExcl/@queryPos <= $stringIncl/@queryPos)) or (($stringExcl/tokenInfo/@pos >= $stringIncl/tokenInfo/@pos) and ($stringExcl/@queryPos >= $stringIncl/@queryPos)) } </match> } </allMatches> }

The resulting AllMatches contains all Match of the parameter whose positions in the stringInclude elements are in the order of the query positions of their query strings. Only those stringExcludes are retained that preserve the order.

As an example, consider the FTSelection ("great" && "condition") ordered in the context of the sample document. The source AllMatches is:

The FTOrder will return:

FTScope

The parameters of the ApplyFTScope function are the search context, the list of match options, the type of the scope (same or different), the linguistic unit (sentence or paragraph) and one AllMatches parameter corresponding to the result of the nested FTSelections. The search context and the match options are not used in this case. The functions definitions depending on the type of the scope (paragraph, sentence) and the scope predicate (same, different) are given below.

In case of same sentence, the semantics is given by:

declare function fts:ApplyFTScopeSameSentence( $allMatches as element(allMatches, fts:AllMatches) ) as element(allMatches, fts:AllMatches) { <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match where every $stringInclude1 in $match, $stringInclude2 in $match satisfies $stringInclude1/tokenInfo/@sentence = $stringInclude2/tokenInfo/@sentence return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where every $stringIncl in $match/stringInclude satisfies $stringIncl/tokenInfo/@sentence = $stringExcl/tokenInfo/@sentence } </match> } </allMatches> }

Similarly, the semantics for different sentence is given by:

declare function fts:ApplyFTScopeDifferentSentence( $allMatches as element(allMatches, fts:AllMatches) ) as element(allMatches, fts:AllMatches) { <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match where every $stringInclude1 in $match, $stringInclude2 in $match satisfies $stringInclude1 = $stringInclude2 or $stringInclude1/tokenInfo/@sentence != $stringInclude2/tokenInfo/@sentence return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where every $stringIncl in $match/stringInclude satisfies $stringIncl/tokenInfo/@sentence != $stringExcl/tokenInfo/@sentence } </match> } </allMatches> }

In case of same paragraph, the semantics is given by:

declare function fts:ApplyFTScopeSameParagraph( $allMatches as element(allMatches, fts:AllMatches) ) as element(allMatches, fts:AllMatches) { <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match where every $stringInclude1 in $match, $stringInclude2 in $match satisfies $stringInclude1/tokenInfo/@para = $stringInclude2/tokenInfo/@para return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where every $stringIncl in $match/stringInclude satisfies $stringIncl/tokenInfo/@para = $stringExcl/tokenInfo/@para } </match> } </allMatches> }

Finally, the semantics for different paragraph is given by:

declare function fts:ApplyFTScopeDifferentParagraph( $type $allMatches as element(allMatches, fts:AllMatches) ) as element(allMatches, fts:AllMatches) { <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match where every $stringInclude1 in $match, $stringInclude2 in $match satisfies $stringInclude1 = $stringInclude2 or $stringInclude1/tokenInfo/@para != $stringInclude2/tokenInfo/@para return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where every $stringIncl in $match/stringInclude satisfies $stringIncl/tokenInfo/@para != $stringExcl/tokenInfo/@para } </match> } </allMatches> }

If for instance the type of the scope is "sentence", the semantics is straightforward. For every Match from the AllMatches of the operand, it filters those that contain string matches from stringInclude only in the same (different) element sentence. From the stringExcludes of the AllMatches, only those that refer to the same node are retained. The case for scope type paragraph is analogous.

The semantics for the general case is given by:

declare function fts:ApplyFTScope( $type as fts:ScopeType, $selector fts:ScopeSelector, $allMatches as element(allMatches, fts:AllMatches) ) as element(allMatches, fts:AllMatches) { if ($type eq "same" and $selector eq "sentence") then fts:ApplyFTScopeSameSentence($allMatches) else if ($type eq "different" and $selector eq "sentence") then fts:ApplyFTScopeDifferentSentence($allMatches) else if ($type eq "same" and $selector eq "paragraph") then fts:ApplyFTScopeSameParagraph($allMatches) else fts:ApplyFTScopeDifferentParagraph($allMatches) }

As an example, consider the FTSelection ("Mustang" && "Honda") same paragraph in the context of the sample document. The source AllMatches is:

The FTScope will convert this to an empty AllMatches because neither Matches contain TokenInfos from a single element.

FTContent

The parameters of the ApplyFTContent function are the search context, the match options, and the type of the content match (at the start of the current node, at the end of it, or its entire content), and one AllMatches parameter corresponding to the result of the nested FTSelections. The semantics is given given below.

declare function fts:ApplyFTContent( $searchContext as node(), $matchOptions as element(matchOptions, fts:FTMatchOptions), $type as fts:CotnentMatchType, $allMatches as element(allMatches, fts:AllMatches) ) as element(allMatches, fts:AllMatches) { if ($type eq "entire content") then let $temp1 := fts:ApplyFTWordDistanceExactly( $matchOptions, $allMatches, 1) let $temp2 := fts:ApplyFTContent( $searchContext, $matchOptions, $temp1, "at start") let $temp3 := fts:ApplyFTContent( $searchContext, $matchOptions, $temp2, "at end") return $temp3 else <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match where if ($type eq "at start") then some $si in $math/stringInclude satisfies fts:isStartToken($searchContext, $si/tokenInfo) else (: $type eq "at end" :) then some $si in $math/stringInclude satisfies fts:isEndToken($searchContext, $si/tokenInfo) else return {$match} </allMatches> }

The above functions considers three cases depending on the type of the content match. The case of entire match is evaluated as distance exactly 1 word at start at end, i.e. all the StringIncludes should match every token in the content of the current search context node. The case at start retains only those Matches that contain a StringInclude that matches the first token. This is checked using the semantic function fts:isStartToken. Similarly, the case at end retains only those Matches that contain a StringInclude that matches the last token. This is checked using the semantic function fts:isEndToken.

FTDistance

The parameters of the ApplyFTDistance function are the search context, the list of match options, one AllMatches parameter corresponding to the result of the nested FTSelections, the unit of the distance (words, sentences, paragraphs) and the range specification used. The search context is not used in this case. The semantics for the different cases depending on the distance units and the range specification are given below.

The function for the case word distance exactly N is presented below:

declare function fts:ApplyFTWordDistanceExactly( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer) ) as element(allMatches, fts:AllMatches) { <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $idx in (1 to fn:count($sorted) - 1) satisfies fts:wordDistance( $sorted[$idx]/tokenInfo, $sorted[$idx+1]/tokenInfo, $matchOptions) = $n return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:wordDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) = $n return $stringExcl } </match> } </allMatches> }

Similarly, the semantics for the case of word distance at least N is presented below:

declare function fts:ApplyFWordDistanceAtLeast( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:wordDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) >= $n return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:wordDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) >= $n return $stringExcl } </match> } </allMatches> }

The semantics for the case of word distance at most N is given by:

declare function fts:ApplyFWordDistanceAtMost( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@Identifier ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:wordDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) <= $n return <match> {$match/stringInclude} for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:wordDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) <= $n return $stringExclude } </match> } </allMatches> }

The semantics for the final case of word distance from M to N is given by:

declare function fts:ApplyFWordDistanceFromTo( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $m as xs:integer, $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:wordDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) >= $m and fts:wordDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) <= $n return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:wordDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) >= $m and fts:wordDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) <= $n return $stringExcl } </match> } </allMatches> }

The function for the case sentence distance exactly N is presented below:

declare function fts:ApplyFSentenceDistanceExactly( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:sentenceDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) = $n return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:sentenceDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) = $n return $stringExcl } </match> } </allMatches> }

Similarly, the semantics for the case of sentence distance at least N is presented below:

declare function fts:ApplyFSentenceDistanceAtLeast( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:sentenceDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) >= $n return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:sentenceDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) >= $n return $stringExcl } </match> } </allMatches> }

The semantics for the case of sentence distance at most N is given by:

declare function fts:ApplyFSentenceDistanceAtMost( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:sentenceDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) <= $n return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:sentenceDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) <= $n return $stringExcl } </match> } </allMatches> }

The semantics for the final case of sentence distance from M to N is given by:

declare function fts:ApplyFSentenceDistanceFromTo( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $m as xs:integer, $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:sentenceDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) >= $m and fts:sentenceDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) <= $n return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:sentenceDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) >= $m and fts:sentenceDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) <= $n return $stringExcl } </match> } </allMatches> }

The function for the case paragraph distance exactly N is presented below:

declare function fts:ApplyFTParagraphDistanceExactly( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:paraDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) = $n return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:paraDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) = $n return $stringExcl } </match> } </allMatches> }

Similarly, the semantics for the case of paragraph distance at least N is presented below:

declare function fts:ApplyFTParagraphDistanceAtLeast( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:paraDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) >= $n return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:paraDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) >= $n return $stringExcl } </match> } </allMatches> }

The semantics for the case of paragraph distance at most N is given by:

declare function fts:ApplyFTParagraphDistanceAtMost( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $si/tokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:paraDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) <= $n return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:paraDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) <= $n return $stringExcl } </match> } </allMatches> }

The semantics for the final case of paragraph distance from M to N is given by:

declare function fts:ApplyFTParagraphDistanceFromTo( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $m as xs:integer, $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches stokenNum="{$allMatches/@stokenNum}"> { for $match in $allMatches/match let $sorted = for $si in $match/stringInclude order by $sitokenInfo/@pos ascending return $si where every $index in in (1 to fn:count($sorted) - 1) satisfies fts:paraDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) >= $m and fts:paraDistance( $sorted[$index]/tokenInfo, $sorted[$index+1]/tokenInfo, $matchOptions) <= $n return <match> {$match/stringInclude} { for $stringExcl in $match/stringExclude where some $stringIncl in $match/stringInclude satisfies fts:paraDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) >= $m and fts:paraDistance( $stringIncl/tokenInfo, $stringExcl/tokenInfo, $matchOptions) <= $n return $stringExcl } </match> } </allMatches> }

Intuitively, the resulting AllMatches contains those Matches of the operand that satisfy the condition that the distance (measured in words, sentences, or paragraphs) for every couple of consecutive valid positions in stringInclude elements is in the specified interval. Here by consecutive, we mean with no other valid positions from the same stringInclude element between them.

In the general case, the semantics is given by:

declare function fts:ApplyFTDistance( $matchOptions as element(matchOptions, fts:FTMatchOptions), $type as fts:DistanceType, $range as element(range, fts:FTRangeSpec), $allMatches as element(allMatches, fts:AllMatches) ) as element(allMatches, fts:AllMatches) { if ($type eq "word") then if ($range/@type eq "exactly") then fts:ApplyFTWordDistanceExactly($matchOptions, $allMatches, $range/@n) else if ($range/@type eq "at least") then fts:ApplyFTWordDistanceAtLeast($matchOptions, $allMatches, $range/@n) else if ($range/@type eq "at most") then fts:ApplyFTWordDistanceAtMost($matchOptions, $allMatches, $ range/@n) else fts:ApplyFTWordDistanceFromTo($matchOptions, $allMatches, $range/@m, $range/@n) else if ($type eq "sentence") then if ($range/@type eq "exactly") then fts:ApplyFTSentenceDistanceExactly($matchOptions, $allMatches, $range/@n) else if ($range/@type eq "at least") then fts:ApplyFTSentenceDistanceAtLeast($matchOptions, $allMatches, $range/@n) else if ($range/@type eq "at most") then fts:ApplyFTSentenceDistanceAtMost($matchOptions, $allMatches, $range/@n) else fts:ApplyFTSentenceDistanceFromTo($matchOptions, $allMatches, $range/@m, $range/@n) else if ($range/@type eq "exactly") then fts:ApplyFTParagraphDistanceExactly($matchOptions, $allMatches, $range/@n) else if ($range/@type eq "at least") then fts:ApplyFTParagraphDistanceAtLeast($matchOptions, $allMatches, $range/@n) else if ($range/@type eq "at most") then fts:ApplyFTParagraphDistanceAtMost($matchOptions, $allMatches, $range/@n) else fts:ApplyFTParagraphDistanceFromTo($matchOptions, $allMatches, $range/@m, $range/@n) }

As an example, consider the FTDistance selection ("Ford Mustang" && "excellent") word distance at most 3 over the sample document. The six Matches of the source AllMatches for ("Ford Mustang" && "excellent") are given below:

The result for the above FTDistance selection will consist of only the first Match because only its the distance between consecuive TokenInfos (distance 1 and distance 3 in this case) is less or equal to 3.

FTWindow

The parameters of the ApplyFTWindow function are the search context, the list of match options, the unit of type fts:DistanceType, a size, and one AllMatches parameter corresponding to the result of the nested FTSelections. The search context is not used in this case. For each of the different unit types we define an individual function as follows.

The function for the case window N words is presented below:

define function fts:ApplyFTWordWindow( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match let $minpos := fn:min($match/*/tokenInfo/@pos), $maxpos := fn:max($match/*/tokenInfo/@pos) for $windowStartPos in ($minpos to $maxpos-n+1) let $windowEndPos := $windowStartPos+n-1 where fn:min($match/stringInclude/tokenInfo/@pos) >= $windowStartPos and fn:max($match/stringInclude/tokenInfo/@pos) <= $windowEndPos return <match> {$match/stringInclude} { for $stringExclude in $match/stringExclude where $stringExclude/tokenInfo/@pos >= $windowStartPos and $stringExclude/tokenInfo/@pos <= $windowEndPos return $stringExclude } </match> } </allMatches> }

The function for the case window N sentences is presented below:

define function fts:ApplyFTSentenceWindow( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match let $minpos := fn:min($match/*/tokenInfo/@sentence), $maxpos := fn:max($match/*/tokenInfo/@sentence) for $windowStartPos in ($minpos to $maxpos-n+1) let $windowEndPos := $windowStartPos+n-1 where fn:min($match/stringInclude/tokenInfo/@sentence) >= $windowStartPos and fn:max($match/stringInclude/tokenInfo/@sentence) <= $windowEndPos return <match> {$match/stringInclude} { for $stringExclude in $match/stringExclude where $stringExclude/tokenInfo/@sentence >= $windowStartPos and $stringExclude/tokenInfo/@sentence <= $windowEndPos return $stringExclude } </match> } </allMatches> }

The function for the case word N paragraphs is presented below:

define function fts:ApplyFTParagraphWindow( $matchOptions as element(matchOptions, fts:FTMatchOptions), $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches> { for $match in $allMatches/match let $minpos := fn:min($match/*/tokenInfo/@para), $maxpos := fn:max($match/*/tokenInfo/@para) for $windowStartPos in ($minpos to $maxpos-n+1) let $windowEndPos := $windowStartPos+n-1 where fn:min($match/stringInclude/tokenInfo/@para) >= $windowStartPos and fn:max($match/stringInclude/tokenInfo/@para) <= $windowEndPos return <match> {$match/stringInclude} { for $stringExclude in $match/stringExclude where $stringExclude/tokenInfo/@para >= $windowStartPos and $stringExclude/tokenInfo/@para <= $windowEndPos return $stringExclude } </match> } </allMatches> }

Intuitively, the resulting AllMatches contains those Matches of the operand that satisfy the condition that there exists a sequence of the specified number of consecutive (word, sentence, or paragraph) positions, such that all StringIncludes are within that window and only those StringExcludes are retained, that also lie within that window.

In the general case, the semantics is given by:

declare function fts:ApplyFTWindow( $matchOptions as element(matchOptions, fts:FTMatchOptions), $type as fts:DistanceType, $size as xs:integer, $allMatches as element(allMatches, fts:AllMatches), ) as element(allMatches, fts:AllMatches) { if ($type eq "word") then fts:ApplyFTWordWindow($matchOptions, $allMatches, $size) else if ($type eq "sentence") then fts:ApplyFTSentenceWindow($matchOptions, $allMatches, $size) else fts:ApplyFTParagraphWindow($matchOptions, $allMatches, $size) }

As an example, consider the FTWindow selection ("Ford Mustang" && "excellent") window 10 words over the sample document. The six Matches of the source AllMatches for ("Ford Mustang" && "excellent") are given below:

The result for the above FTWindow selection will consist of only the first, the fifth, and the sixth Matches because their respective window sizes are 5, 4, and 9.

FTTimes

The parameters of the ApplyFTTimes function are the search context, the list of match options, one AllMatches a range specification, and parameter corresponding to the result of the nested FTSelection. The search context and the match options stack are not used in this case.

The function definitions, depending the range specification FTRange limiting the number of occurrences, follow.

declare function fts:FormCombinations($sms, $times) { if (fn:count($sms) lt $times) then () else if (fn:count($sms) eq $times) then <match> {$sms/*} </match> else { fts:FormCombination(fn:subsequence($sms, 2), $times) <match> {$sms[1]/*} {fts:FormCombinations(fn:subsequence($sms, 2), $times-1)/*} </match> } } declare function fts::FormRange($sms, $l, $u, $stokenNum) { let $lower_match := <allMatches stokenNum="{$stokenNum}"> {fts:FormCombinations($sms, $l) } </allMatches> return if ($l > $u) then () else fts:ApplyFTAnd(<allMatches stokenNum="{$stokenNum}"> {fts:FormCombinations($sms, $l)} </allMatches>, fts::ApplyFTUnaryNot( <allMatches> {fts:FormCombinations($sms, $u+1)} </allMatches>) ) }

We now define the semantics for the case exactly N occurrences:

declare function fts:ApplyFTTimesExactly( $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { fts:FormRange($allMatches/match, $n, $n, $allMatches/@stokenNum) }

We next define the semantics for the case at least N occurrences:

declare function fts:ApplyFTTimesAtLeast( $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { <allMatches stokenNum="{$allMatches/@stokenNum}"> {fts:formCombinations($allMatches/match, $n)} </allMatches> }

We next define the semantics for the case at most N occurrences:

declare function fts:ApplyFTTimesAtMost( $allMatches as element(allMatches, fts:AllMatches), $n as xs:integer ) as element(allMatches, fts:AllMatches) { fts:formRange($allMatches/match, 0, $n, $allMatches/@stokenNum) }

Finally, we define the semantics for the case from M to N occurrences:

declare function fts:ApplyFTTimesFromTo( $allMatches as element(allMatches, fts:AllMatches), $m as xs:integer, $n as xs:integer ) as element(allMatches, fts:AllMatches) { fts:formRange($allMatches/match, $m, $n, $allMatches/@stokenNum) }

The intuition is as follows. The way to ensure that there are at least N different matches of an FTSelection is to ensure that at least N of its Matches occur simultaneously. This is similar to forming their conjunction: combine N distinct Matches into one simple match. Therefore, the full match for the selection condition involving the range specifier at least N is to form all possible combinations of N simple matches of the operand and form one simple match for each combination negating the rest of the simple matches. This operations is performed in the function fts:FormCombinations.

In the case of the range [l, u], it is treated as the condition at least l and not at least u + 1. This transformation is performed in the function fts:FormRange.

The semantics in the general case is given by:

declare function fts:ApplyFTTimes( $range as element(range, fts:FTRangeSpec), $allMatches as element(allMatches, fts:AllMatches), ) as element(allMatches, fts:AllMatches) { if ($range/@type eq "exactly") then fts:ApplyFTTimesExactly($allMatches, $range/@n) else if ($range/@type eq "at least") then fts:ApplyFTTimesAtLeast($allMatches, $range/@n) else if ($range/@type eq "at most") then fts:ApplyFTTimesAtMost($allMatches, $range/@n) else fts:ApplyFTTimesFromTo($allMatches, $range/@m, $range/@n) }

As an example, consider the FTTimes selection "Mustang" at least 2 occurrences over the sample document. The source AllMatches of the FTWords selection "Mustang" is:

The result will consist of all couples of Matches from above:

Match Options Semantics Types

We take a similar approach to the one used for defining the semantics of FTSelections.. We will use XQuery functions to define the semantics of FTMatchOptions. These functions operate on an XML representation of the FTMatchOptions. The representation closely follows the syntax. As in the case of the XML representation of FTSelections, the XML representation of FTMatchOptions is essentially an AST. Each FTMatchOption is represented by an XML element. Additional characteristics of the option are generally represented as attributes. The schema is given below.

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified"> <xs:complexType name="FTMatchOptions"> <xs:sequence> <xs:element name="matchOption" type="fts:FTMatchOption"/> </xs:sequence> </xs:complexType> <xs:complexType name="FTMatchOption"> <xs:choice> <xs:element name="case" type="fts:FTCaseOption" /> <xs:element name="diacritics" type="fts:FTDiacriticsOption" /> <xs:element name="thesaurus" type="fts:FTThesaurusOption" /> <xs:element name="stem" type="fts:FTStemOption" /> <xs:element name="wildcard" type="fts:FTWildCardOption" /> <xs:element name="language" type="fts:FTLanguageOption" /> <xs:element name="stopWord" type="fts:FTStopwordOption" /> </xs:choice> </xs:complexType> <xs:complexType name="FTCaseOption"> <xs:attribute name="caseIndicator"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="insensitive"/> <xs:enumeration value="sensitive"/> <xs:enumeration value="lowercase"/> <xs:enumeration value="uppercase"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="caseLanguage" type="xs:string"/> </xs:complexType> <xs:complexType name="FTDiacriticsOption"> <xs:attribute name="diacriticsIndicator"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="insensitive"/> <xs:enumeration value="sensitive"/> <xs:enumeration value="with"/> <xs:enumeration value="without"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="language" type="xs:string"/> </xs:complexType> <xs:complexType name="FTThesaurusOption"> <xs:sequence> <xs:element name="thesaurusName" type="xs:string" minOccurs="0" maxOccurs="1"/> <xs:element name="relationship" type="xs:string" minOccurs="0" maxOccurs="1"/> <xs:element name="range" type="fts:FTRangeSpec" minOccurs="0" maxOccurs="1"/> </xs:sequence> <xs:attribute name="thesaurusIndicator"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="with"/> <xs:enumeration value="without"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="language" type="xs:string"/> </xs:complexType> <xs:complexType name="FTStemOption"> <xs:attribute name="stemIndicator"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="with"/> <xs:enumeration value="without"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="language" type="xs:string"/> </xs:complexType> <xs:complexType name="FTWildCardOption"> <xs:attribute name="wildcardIndicator"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="with"/> <xs:enumeration value="without"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="language" type="xs:string"/> </xs:complexType> <xs:complexType name="FTLanguageOption"> <xs:attribute name="languageName" type="xs:string"/> </xs:complexType> <xs:complexType name="FTStopwordOption"> <xs:sequence> <xs:choice> <xs:element name="default-stopwords"> <xs:complexType /> </xs:element> <xs:element name="stop-word" type="xs:string" /> <xs:element name="uri" type="xs:anyURI" /> </xs:choice> <xs:element name="oper" minOccurs="0" maxOccurs="unbounded"> <xs:choice> <xs:element name="stop-word" type="xs:string" /> <xs:element name="uri" type="xs:anyURI" /> </xs:choice> <xs:attribute name="type"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="union"/> <xs:enumeration value="except"/> </xs:restriction> </xs:simpleType> </xs:attribute> </xs:element> </xs:sequence> </xs:complexType> </xs:schema>

In addition, we need the explicit representation of the concept of a phrase. We need this representation to support thesauri lookups. Each lookup produces a sequence of such phrases. Each phrase is one possible alternative for the search string.

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified"> <xs:complexType name="TokenPhrase"> <xs:sequence> <xs:element name="token" type="xs:string" minOccurs="1" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:schema> High-Level Semantics

The section on the semantics of FTSelections focused on the case if no FTMatchOptions are present in the language. Next, we describe how this model is extended to support FTMatchOptions.

The extension is achieved by (a)modifying the existing semantic functions of FTSelections and (b)adding additional semantic functions that are specific to the FTMatchOptions.

With regards to point (a), the semantics of most the FTSelections remains unchanged. The modifications are to the method for matching search tokens. The changes in point (b) are more significant and will be discussed later.

We start with the changes pertaining to (a). These changes are in the semantics of FTWords because it is the most influenced by the FTMatchOptions. Under the extended semantics, the search tokens are modified (search token expansion) depending on the applied FTMatchOptions. For example, in the presence of FTThesaurusOption search tokens may be replaced with related tokens based on a thesaurus lookup.

declare function fts:applySingleSearchToken( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchToken as fts:TokenInfo, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { let $withDiacriticsOption := $matchOptions[(fn:local-name(.) eq "diacritics") and (./@type eq "with")][1] return if ($withDiacriticsOption) then let $newOption1 := <diacritics type="insensitive" /> let $newOption2 := <diacritics type="without" /> let $lhs := fts:applySingleSearchToken( $searchContext, ($newOption1, $matchOptions), $searchToken, $queryPos) let $rhs := fts:applySingleSearchToken( $searchContext, ($newOption2, $matchOptions), $searchToken, $queryPos) return fts:ApplyMildNot($lhs, $rhs) else let $thesaurusOption := $matchOptions[(fn:local-name(.) eq "thesaurus") and (./@type eq "with")][1] return if ($thesaurusOption) then let $noThesaurusOption := (<theasurus thesaurusIndicator="without" />, $matchOptions) let $lookupRes := fts:applyThesaurusOption( $thesaurusOption, $searchStrings) return fts:ApplyPhraseAlternatives($searchContext, $noThesaurusOptions, $lookupRes, $queryPos) else <allMatches stokenNum="{$queryPos}"> {let $searchTokens := if ($matchOptions//wildcard) then fts:applyWildCardOption($searchContext, $matchOptions, $searchToken) else $searchToken let $effectiveOptions := $matchOptions except $matchOptions[self::wildcard] let $token_pos := fts:matchStr($searchContext, $effectiveOptions, $searchTokens) for $pos in $token_pos return <match> <stringInclude queryPos="{$queryPos}" queryString="{$searchToken/@word}" > <tokenInfo>{$pos}</tokenInfo> </stringInclude> </match>} </allMatches> }; declare function fts:ApplyPhraseAlternatives( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchPhrases as fts:TokenPhrase*, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { if (fn:count($searchPhrases) eq 0) then <allMatches stokenNum="0" /> else let $firstSearchPhrase := $searchPhrase[1] let $restSearchPhrases := fn:subsequence($searchPhrases, 2) let $firstAllMatches := fts:ApplyFTWordsPhrase($searchContext, $matchOptions, $firstSearchPhrase/word, $queryPos) let $newQueryPos := fn:max($firstAllMatches//@queyrPos) + 1 let $restAllMatches := fts:ApplyPhraseAlternatives($searchContext, $matchOptions, $restSearchPhrases, $newQueryPos) return fts:ApplyFTOr($firstAllMatches, $resAllMatches) };

There are several major differences to the semantics of the single-token search as described in the previous section. All are related to processing match options. Three FTMatchOptions are processed differently than the rest of the FTMatchOptions. The FTDiacriticsOption option with type "with" is processed as though the query is $searchToken/@word diacratics insensitive mildnot $searchToken/@word without diacritics. Intuitively, the desired Matches are those that contain any version of the search token except the one without any diacritics. The reason for this seemingly unnatural semantics is to avoid having to guess which letters should be replaced with their diacritics equivalent.

The second FTMatchOption that is processed differently is FTThesaurusOption. The reason is that its semantic cannot be represented simply in terms of search token expansion. Since the result of a thesaurus lookup can be a sequence of alternatives, we need a higher level processing. Intuitively, all returned alternatives are connected in a disjunction using the fts:ApplyPhraseAlternatives. The latter function is almost identical to fts:ApplyFTWordsAny but takes into consideration the specific representation of the search tokens in $searchPhrases. It should be noted that the matching of the alternatives is performed with FTThesaurusOption turned off. Thus, we avoid double expansions, i.e. expanstions of an already expanded token.

The last FTMatchOption that is processed differently is FTWildCardOption. It commutes with all other options and therefore it is possible to ignore its position within the FTMatchOptions stack.

All other FTMatchOptions are processed in the MatchStr function.

declare function fts:matchStr( $searchContext as node(), $matchOptionss as fts:FTMatchOptions, $searchToken as fts:TokenInfo) as element(tokenInfo, fts:TokenInfo)* { let $nonexpOptions := $matchOptions[self::language or self::ignore] let $expOptions := $matchOptions except $nonexpOptions let $searchTokens := applyMatchOptions($matchOptions, $searchTokens), $searchTokens return getTokenInfo($searchContext, $nonexpOptions, $searchToken) }

Intuitively, the above function rewrites the search tokens based on the applied FTMatchOptions and obtains the resulting sequence of TokenInfos that match the search tokens. FTThesaurusOption is treated differently than the others. The reason we want to avoid repeated theasurus expansion if it has already been applied to the search tokens. In this case, all FTThesaurusOptions are removed.

Each other FTMatchOptions transforms the search tokens by means of the fts:applyMatchOption function. Its structure is very similar to the one used for the fts:evaluate function. It inspects the supplied FTMatchOptions and applies them using per-FTMatchOption functions much like the per-FTSelection functions. These will be discussed later.

One last change to search token matching is with regards to matching phrases. The change allows multi-token thesauri lookups. I.e., it allows that entire search phrases be modified using a thesaurus.

declare function fts:ApplyFTWordsPhrase( $searchContext as node(), $matchOptions as fts:FTMatchOptions, $searchStrings as xs:string*, $queryPos as xs:integer) as element(allMatches, fts:AllMatches) { let $thesaurusOption := $matchOptions[fn:local-name(.) eq "thesaurus"][1] return if ($thesaurusOption and $thesaurusOption/@type eq "with") then let $noThesaurusOptions := $matchOptions[fn:local-name(.) ne "thesaurus"] let $lookupRes := fts:applyThesaurusOption($thesaurusOption, $searchStrings) return fts:ApplyPhraseAlternatives($searchContext, $noThesaurusOptions, $lookupRes, $queryPos) else let $conj := fts:ApplyFTWordsAllWord($searchContext, $matchOptions, $searchStrings, $queryPos) let $ordered := fts:ApplyFTOrder($conj) let $distance1 := fts:ApplyFTDistance($matchOptions, $ordered, <fts:range type="exactly" n="0">) return $distance1 };

The difference in the case of the fts:ApplyFTWordsPhrase fucntion is that an explicit check for the presence of a FTThesaurusOption is done. This allows that phrase lookups be done in the thesaurus. If an FTThesaurusOption, it is processed as in fts:ApplySingleSearchToken.

Now, we move to second type of modifications to the semantics, namely the functions that implement the semantics of the FTMatchOption.

declare function fts:applyMatchOption( $matchOptions as fts:FTMatchOption*, $searchTokens as fts:TokenInfo* ) as element(tokenInfo, fts:TokenInfo)* { if ($matchOptions) then let $firstOption := $matchOptions[1] let $firstOptionType := fn:local-name($firstOption) let $restOptions := $matchOptions[fn:local-name(.) ne $firstOptionType] let $applyFirst := fts:applyMatchOption($firstOption, $searchTokens) return fts:applyMatchOptions($restOptions, $$applyFirst) else $searchTokens }; declare function fts:applyMatchOption( $matchOption as fts:FTMatchOption, $searchTokens as fts:TokenInfo* ) as element(tokenInfo, fts:TokenInfo)* { if (fn:local-name($matchOption) eq "stopWord") then fts::applyStopWordOption($matchOptions, $searchTokens) else if (fn:local-name($matchOption) eq "case") return applyCaseOption($matchOption,$searchTokens) else if (fn:local-name($matchOption) eq "diacritics") return fts:applyDiacriticsOption($matchOption, $searchTokens) else if (fn:local-name($matchOption) eq "stem") return fts:applyStemOption($matchOption, $searchTokens) };

Intuitively, the fts:ApplyMatchOptions function expands the search tokens by the consequtive appliaction of each specified FTMatchOption. Once a FTMatchOption of a particular type has been applied. All other options of the same type are ignored since the former overrides them.

The application of each FTMatchOption is performed by the dispatcher function fts:ApplyMatchOption, which invokes the repective function implementing the semantics of the option.

The semantic functions for earch option are described in subsequent sections.

Formal Semantics Functions

Before describing the functions implementing the semantics of each FTMatchOption, we will present a list of formal semantics functions that must be provided by each XQuery 1.0 and XPath 2.0 Full-Text implementation.

function fts:lowerCase($token as fts:TokenInfo, $caseLanguage as xs:string) as fts:TokenInfo function fts:upperCase($token as fts:TokenInfo, $caseLanguage as xs:string) as fts:TokenInfo function fts:insensitiveCase($token as fts:TokenInfo, $caseLanguage as xs:string) as fts:TokenInfo

Intuitively, the three functions above convert the token in a TokenInfo object to lower-case, upper-case, or case-insensitive form.

function fts:removeDiacritics( $token as fts:TokenInfo, $diacriticsLanguage as xs:string) as fts:TokenInfo function fts:insensitiveDiacritics( $token as fts:TokenInfo, $diacriticsLanguage as xs:string) as fts:TokenInfo

Intuitively, the two functions above convert the token in a TokenInfo object to a form without diacritics or to a diacritics-insensitive form.

function fts:lookupThesaurus($tokens as fts:TokenInfo*, $thesaurusName as xs:string, $thesaurusLanguage as xs:string, $relationship as xs:string, $range as fts:FTRanceSpec?) as element(tokenPhrase, fts:TokenPhrase)*

The above function finds all words related to $tokens in the thesaurus $thesaurusName for the language $thesaurusLanguage using the relationship $relationship within the optional number of levels $range. If $tokens consists of more than one TokenInfos, it is regarded as a phrase.

The function returns a sequence of expansion alternatives. Each alternative is regarded as a new search phrase and is represented as a tokenized phrase. All the alternatives are treated as though they are connected with a disjunction (FTOr).

function fts:stemmedForm($word as fts:TokenInfo, $stemLanguage as xs:string) as fts:TokenInfo

The above function converts the token in a TokenInfo object to a form that represents its stem.

function fts:wildcardForm($word as fts:TokenInfo, $wildcardLanguage as xs:string) as fts:TokenInfo*

The above function converts the token in a TokenInfo object to a sequence of forms that can be used by the tokenizer to match document tokens.

FTCaseOption declare function fts:applyCaseOption($matchOption as fts:FTMatchOption, $searchTokens as fts:TokenInfo*) as fts:TokenInfo* { let $searchToken := $searchTokens[1] let $nextTokens := $searchTokens[position() ge 2] let $returnedTokens := if ($matchOption/@caseIndicator = "lowercase") then (fts:lowerCase($searchToken/@word, $matchOption/@language), applyCaseOption($matchOption, $nextTokens)) else if ($matchOption/@caseIndicator = "uppercase") then (fts:upperCase($searchToken, $matchOption/@language), applyCaseOption($matchOption, $nextTokens)) else if ($matchOption/@caseIndicator = "insensitive") then (insensitiveCase($searchToken, $matchOption/@language), applyCaseOption($matchOption, $nextTokens)) else $searchTokens return $returnedTokens } FTDiacriticsOption declare function fts:applyDiacriticsOption( $matchOption as fts:FTMatchOption, $searchTokens as fts:TokenInfo*) as fts:TokenInfo* { let $searchToken := $searchTokens[1] let $nextTokens := $searchTokens[position() ge 2] let $indicator := $matchOption/@diacriticsIndicator let $returnedTokens := if ($indicator eq "with") then (addDiacritics($searchToken, $matchOption/@language), applyDiacriticsOption($matchOption, $nextTokens)) else if ($indicator eq "without") then (removeDiacritics($searchToken, $matchOption/@language), applyDiacriticsOption($matchOption, $nextTokens)) else if ($indicator eq "insensitive") then (insensitiveDiacritics($searchToken, $matchOption/@language), applyDiacriticsOption($matchOption, $nextTokens)) else (: $indicator eq "sensitive" :) $searchTokens return $returnedTokens } FTStemOption declare function fts:applyStemOption($matchOption as fts:FTMatchOption, $searchTokens as fts:TokenInfo*) as fts:TokenInfo* { if ($matchOption/@stemIndicator = "with") then let $searchToken := $searchTokens[1] let $nextTokens := $searchTokens[position() ge 2] return (stemmedForm($searchToken, $matchOption/@language), applyStemOption($matchOption, $nextTokens) else if ($matchOption/@stemIndicator = "without") then $returnedTokens else () } FTStopWordOption

Stop-Words interact with FTDistanceSelection and FTWindowSelection.

declare function fts:applyStopwordOption($matchOption as fts:FTMatchOption, $searchTokens as fts:TokenInfo*) as fts:TokenInfo* { let $rootElem := fn:local-name($matchOption/element()[1]) let $rootWords := $matchOption/element()[1]/text() let $swords := if ($rootElem eq "stop-word") then $rootWords else fts:resolveStopwordsUri($rootWords) let $tokenizedSwords := for $sw in $swords return <tokenInfo word="{$sw}" pos="0" sentence="0" para="0" /> let $restOpers := $matchOption/element()[position() ge 2] let $effectiveStopwords := fts:calcStopwords($tokenizedSwords, $resOpers) return fts:remStopwords($searchTokens, $stopWords) }; declare function fts:addStopwords($stopWords as fts:TokenInfo*, $newStopwords as fts:TokenInfo*) as fts:TokenInfo* { if ($newStopwords) then let $firstStopword := $newStopwords[1] let $restStopwords := $newStopwords[position() ge 2] let $temp := if ($stopWords[@word eq $firstStopword/@word]) then $stopWords else ($stopWords, $firstStopword) return addStopwords($temp, $restStopwords) else $stopWords }; declare function fts:remStopwords($stopWords as fts:TokenInfo*, $remStopwords as fts:TokenInfo*) as fts:TokenInfo* { if ($newStopwords) then let $firstStopword := $newStopwords[1] let $restStopwords := $newStopwords[position() ge 2] let $temp := if ($stopWords[@word eq $firstStopword/@word]) then $stopWords[@word ne $firstStopword] else $stopWords return remStopwords($temp, $restStopwords) else $stopWords }; declare function fts:calcStopwords($stopWords as fts:TokenInfo*, $opers) as fts:TokenInfo* { if ($opers) then let $firstOper := $opers[1] let $restOpers := $opers[position() ge 2] let $operType := $firstOper/@type let $operElem := fn:local-name($firstOper/element()) let $operWords := $firstOper/element()/text() let $swords := if ($operElem eq "stop-word") then $operWords else fts:resolveStopwordsUri($operWords) return if ($operType eq "union") then calcStopwords(fts:addStopword($stopWords, $swords, $restOpers) else calcStopwords(fts:remStopword($stopWords, $swords), $restOpers) else $stopWords };

Intuitively, the semantics of the option is as follows. First, the set of effective stop words set is computed using the fts:calcStopwords function. The function applies the general set operations on the set of stop words. The function uses the function fts:resoleStopwordsUri to resolve any URI to a sequence of strings. Then, the effective stop words are removed from the set of search tokens.

FTLanguageOption FTWildCardOption declare function fts:applyWildCardOption($matchOption as fts:FTMatchOption, $searchTokens as fts:TokenInfo*) as xs:TokenInfo* { if ($matchOption/@wildcardIndicator = "with") { let $searchToken := $searchTokens[1] let $nextTokens := $searchTokens[position() ge 2] return wildcardForm($searchToken, $matchOption/@language), applyWildCardOption($matchOption, $nextTokens) } else if ($matchOption/@wildcardIndicator eq "without") then $searchTokens else () }; XQuery 1.0 and XPath 2.0 Full-Text and Scoring Expressions FTContainsExpr

We now present the formal semantics of the FTContainsExpr expression. It takes in (1) an search context consisting of a sequence of nodes (which is the result of a regular XQuery/XPath expression), and (2) AllMatches corresponding to an FTSelection, and returns a sequence of nodes. Since FTContainsExpr returns results in the XQuery data model (a sequence of nodes), it can be treated like regular XQuery expressions and can be fully composed with other XQuery expressions. In addition, since FTContainsExpr maps AllMatches to a sequence of nodes, it provides the "glue" and well-defined semantics for mapping from AllMatches to the XQuery data model.

The formal semantics of FTContainsExpr is specified in terms of an formal semantics functions. The functions have to comply with the prototype defined below. The semantics of FTContainsExpr will be presented based on this function.

Semantics of FTContainsExpr

Consider an FTContainsExpr expression of the form EvaluationContext ftcontains FTSelection, where EvaluationContext is an XQuery expression that returns a sequence of nodes, and FTSelection is an FTSelection that returns AllMatches. Intuitively, the FTContainsExpr returns true if and only if some node in the result of EvaluationContext satisfies the AllMatches returned by FTSelection.

If the FTContainsExpr is of the form EvaluationContext ftcontains FTSelection without content IgnoreExpr for some XQuery expression IgnoreExpr, then that FTContainsExpr is evaluated as:

declare function reconstruct($n as node(), $ignore as node()*) as node()? { if (some $i in $ignore satisfies $n is $i) then () else if ($n instance of element()) then let $nodeName := fn:node-name($n) let $nodeContent := for $nn in $n/node() return reconstruct($nn) return element {$nodeName} {$nodeContent} else $n } let $newEvalContext := let $ignoreNodes := EvaluationContext/IgnoreExpr/text() return for $n in EvaluationContext return reconstruct($n, $ignoreNodes) return$newEvalContext ftcontains FTSelection

Intuitively, we rewrite EvaluationContext so it does not include any text node children from nodes that should be ignored.

We now formally define the semantics of FTContainsExpr. The semantics is defined in terms of a regular XQuery function (without any XQuery 1.0 and XPath 2.0 Full-Text extensions). The XQuery function takes in three parameters: the first parameter is the sequence of nodes returned by EvalationContext; the second parameter is the XML node representation of FTSelection; the third parameter is the XML representation of the set of default values for each of the FTMatchOptions, as given by the static context. The XQuery function (by definition) returns true if and only if the corresponding FTContainsExpr returns true, and thus specifies the semantics of FTContainsExpr. Note that by using regular XQuery to specify the formal semantics, we avoid the need to introduce new formalism. We simply reuse the formal semantics of XQuery.

declare function FTContainsExpr( $searchContext as node()*, $ftSelection as fts:FTSelection, $defOptions as fts:FTMatchOptions) as xs:Boolean { return some $node in $searchContext satisfies let $allMatches := fts:evaluate($ftSelection, $node, $defOptions, 0) return some $match in $allMatches/match satisfies fn:count($match/stringExclude) eq 0 }

Intuitively, the above function returns true if and only if the AllMatches that is the result of the application of the FTSelection for some node in the search context contains a Match with no StringExcludes. This means that there is a set of TokenInfos in that node which satisfy the condition of the FTSelection

Example

We will now show the evaluation of a more elaborate example of FTContainsExpr. We use the same sample document. For convenience, we present it again here.

Let the above document be assigned to $doc. We will walk through the evaluation of the following FTContainsExpr

$doc ftcontains ( ( "mustang" && (("great" || "excellent") at least 2 occurrences) ) window 30 words && ! "rust" ) same node

We first evaluate the FTSelection to AllMatches

( ( "mustang" && (("great" || "excellent") at least 2 occurrences) ) window 30 words && ! "rust" ) same node

Step 1: Evaluate the FTWords "Mustang"

Step 2: Evaluate the FTWords "great"

Step 3: Evaluate the FTWords "excellent"

Step 4 - Apply the FTOr ("great" || "excellent"): form the union of the Matches

Step 5 - Apply the FTTimes ("great" || "excellent") at least 2 occurrences: form 2-tuples (couples) of Matches

Step 6 - Apply the FTAnd "Mustang" && (("great" || "excellent") at least 2 occurrences): form the "Cartesian product" of Matches

Step 7 - Apply the FTWindow ("Mustang" && (("great" || "excellent") at least 2 occurrences)) window 30 words: filter out Matches for which the window is not less than or equal to 30

Step 8 - Match FTWords "rust"

Step 9 - Apply the FTUnaryNot ! "rust": transform the stringInclude into stringExclude

Step 10 - Apply the FTAnd (("Mustang" && (("great" || "excellent") at least 2 occurrences)) window 30 words) && ! "rust": form the "Cartesian product" of the Matches

Step 11: Apply the final FTScope filter out Matches whose TokenInfos are not within the same node

This is the final AllMatches from the evaluation of the FTSelection.

The resulting AllMatches does not contain a Match that does not contain a StringExclude. Therefore, the sample FTContainsExpr returns false.

Scoring

In this section, we discuss the semantics of the use of scoring variables in for and let clauses, or XPath 2.0 for expressions. The semantics of these constructs cannot be expressed in terms of XQuery, because they requires the presence of second-order functions (i.e. functions that do not evaluate their argument(s) as regular XQuery expression(s) but use them interpreted).

Nevertheless, in the interest of the exposition, we will assume that such functions are present. In particular, we assume that there are two semantic second-order function fts:score and fts:scoreSequence that take one argument (an expression) and return the score value of this expression, respectively a sequence of score values, one for each item that the expression evaluates to. The scores must satisfy scoring properties.

A for clause involving a score variable: for $result score $score in Expr ... is evaluated as though it is replaced with the set of clauses let $scoreSeq := fts:scoreSequence(Expr) for $result at $i in Expr let $score := $scoreSeq[$i] ... Here, $scoreSeq and $i are new variables, not appearing elsewhere, and fts:scoreSequence is the second-order function described in the previous paragraph.

Similarly, a let clause involving a score variable: let $result score $score := Expr ... is evaluated as though it is replaced with let $result := Expr let $score := fts:score(Expr) ...

EBNF for XQuery 1.0 Grammar with Full-Text extensions

The EBNF in this document and in this section is aligned with the current XML Query 1.0 grammar (see http://www.w3.org/TR/2005/WD-xquery-20050915/.

Terminal Symbols

The following symbols are used only in the definition of terminal symbols; they are not terminal symbols in the grammar of .

EBNF for XPath 2.0 Grammar with Full-Text extensions

The EBNF in this document and in this section is aligned with the current XPath 2.0 grammar (see http://www.w3.org/TR/2005/WD-xpath20-20050915/.

Terminal Symbols

The following symbols are used only in the definition of terminal symbols; they are not terminal symbols in the grammar of .

References Normative References XQuery and XPath Full-Text Requirements, Stephen Buxton, Michael Rys, Editors. World Wide Web Consortium, 02 May 2003. This version is http://www.w3.org/TR/2003/WD-xquery-full-text-requirements-20030502/. The latest version is available at http://www.w3.org/TR/xquery-full-text-requirements/. Non-normative References Documentation Guidelines for the Establishment and Development of Monolingual Thesauri, Geneva: International Organization for Standardization, 2nd edition, 1986. ISO/IEC 13249-2 Information technology --- Database languages --- SQL Multimedia and Application Packages --- Part 2: Full-Text. Geneva: International Organization for Standardization, 2nd edition, 2003. Issues List

This section contains the current issues related to this document.

This list of issues is classified in clusters. Each cluster has a unique name that reflects its topic. Each issue has a unique number. Some issues are labelled VNext. The clusters are:

Cluster A: Scoring and Weighting

Cluster B: IgnoreOption, Markup vs. Structure

Cluster C: Wildcards, Regex, Match Anchoring

Cluster D: Thesaurus, Match Option Defaults and Policies

Cluster E: Other MatchOptions Details

Cluster F: Grammar Integration, Syntax Details, and Naming

Cluster G: Semantics Details

Cluster H: Extensions

Cluster I: Simplifications and Variations of Language Constructs

Cluster J: IgnoreOption, Markup vs. Structure

Cluster K: Issue closed before we started clustering

Scoring Properties (Cluster A, Issue 1)

Is it possible to specify anything other than range ? Examples: do we want to define scoring rules for efficient scoring, rules to guarantee score monotonicity?

CLOSED.

No changes required. Closed at FTTF Meeting 62:

Scoring Values (Cluster A, Issue 2)

Answers that do not contain a match (in the Boolean sense) are assigned a score value that depends on the scoring algorithm and that might be greater than 0.

The following implications should hold:

score = 0 implies ftcontains is false.

score <> 0 does not imply anything for ftcontains.

ftcontains is true implies score > 0.

ftcontains is false does not imply anything for score.

This interpretation enables the use of query relaxation in the ftcontains expression and thus, return a score value greater than 0 for those nodes that do not match the ftcontains expression (in a Boolean sense).

For example, given the query:

for $b in //books score $score as $b//content ftcontains "usability && testing" where $score > 0 return {$b}

The scoring algorithm could rewrite it to:

for for $b in //books score $score as $b//content ftcontains "usability || testing with stemming" where $score > 0 return {$b}

and thus, some of the books that are not returned by the first query will be returned by the second query.

CLOSED.

We discussed several alternatives in and we would like to adopt the one described above.

However, this issue is still under discussion.

See resolution in Cluster A, Issue 60.

Semantics Data Model (Cluster K, Issue 3)

Data model incorporates new names - TokenInfo, Match, AllMatches.

CLOSED.

All occurrences of FullMatch, SimpleMatch, and Position in the text, in the schemas, and in the XQuery implementations of the semantics have been replaced with AllMatches, Match, and TokenInfo respectively.

FTContains Grammar (Cluster K, Issue 4)

Expr "ftcontains" FTSelection FTIgnoreCtxMod?. One production for FTSelection which includes FTIgnoreCtxMod?

CLOSED.

We replaced the previous grammar production Expr "ftcontains" FTSelection that allowed FTIgnoreCtxMod to be combined with any FTSelection with the new one that restricts the application of FTIgnoreCtxMod to the highest level.

FTContextModifiers (Cluster K, Issue 5)

Paul C.: Change the name of the FTContextModifer production which modify the operational semantics of the FTSelections they are applied to. Abandon the use of "ContextModifier" as in FTCaseCtxMod, FTStemCtxMod, FTIgnoreCtxMod. Issue raised at FTTF Feb 5-6, 2004 meeting. Find in the minutes at: (Cntl-F on FTContextModifiers)

CLOSED.

Replaced FTContextModfiers with FTMatchOptions as in FTCaseOption, FTStemOption, FTIgnoreOption in the Feburary 26, 2004 Editor's Draft.

CLOSED February 26, 2004.

Grammar (Cluster K, Issue 6)

Grammar: Where does the ftcontains expression belong in the XQuery grammar: Boolean expression or comparison expression?

CLOSED.

The ftcontains expression plugs in to the XQuery grammar in the "FTComparisonExpr" production. This seems to give ftcontains the correct precedence among other XQuery operations, and it makes intuitive sense.

Wildcards (Cluster C, Issue 7)

Pat Case: There are a few inconsistencies between this document and the Use Cases Working Draft.

This document and the Use Cases Working Draft present different syntax in regex examples. I can find no syntax provided in this document for the starts-with and exact match functionality. Should we rename the Wildcard section in the Use Cases to Regex Section and possibly rethink the use cases?

CLOSED.

We dropped regular expression support in favor of wildcard support. Closed at Meeting 67:

Thesaurus (Cluster D, Issue 8)

Thesaurus names: "synonyms", "narrower terms", "soundex", "spellcheck" and "wordnet". We need to define Thesaurus operators. We need more options when specifying thesaurs: Name, URI, Depth, Dimension. Standards. ISO 2788/ANSI Z39.19.

We need to discuss what the grammar of ThesaurusMatchOption is. Current grammar is:

FTThesaurusOption ::= ("with"? "thesaurus" Expr) | "without thesaurus".

Proposed grammar is:

FTThesaurusOption ::= ("with"? "thesaurus" Expr "operation" Expr) | "without thesaurus".

CLOSED.

Changed the syntax and semantics of thesaurus according to

Window (VNext, Cluster H, Issue 9)

Currently, FTDistanceSpec only permits a single distance specification for all of the terms specified by an FTSelection.

For example:

("dog" && "cat" && "bird") with word distance at most 10

In this scenario above, the terms "dog", "cat", and "bird" must all occur within 10 words of one another.

However, if one would want to return documents where "dog" occurs within 10 words of "cat" and this SAME "cat" term occurs within 5 words of "bird", it is currently not possible with the current language specification. The best that could be done is the following:

(("dog" && "cat") with word distance at most 10) and (("cat" && "bird") with word distance at most 5)

But, this will not lead to the exact desired result because the "cat" and "bird' comparison will not use only those "cat" terms which occurred within 10 positions of "dog" ... it can use any "cat" term within the search context.

CLOSED.

The issue has been closed on April 25, 2005 < >. No changes are made to the language. Although the current language can express a lot of the specified types in question, the group recognizes that the query expressions are clumsy and difficult to write. Therefore, this issue will be considered again for VNext.

MildNot (Cluster I, Issue 10)

Andrew E.: Should we remove the mild not? It has never been included in a query language before.

Pat Case has provided use cases to justify its inclusion at:

Discussion followed. Michael Rys' reply:

Pat Case's reply:

Use case paraphrase (for non-members): Consider a collection of 3 documents:

The Delights of Mexico - a document that includes "Mexico" several times.

The Perils of New Mexico - a document that includes "New Mexico" several times.

Travel in North America - a document that includes both "Mexico" and "New Mexico" several times.

Suppose you are planning a trip to Mexico. You want documents 1 and 3, but not 2. You could search for "Mexico" and get documents 1, 2 and 3. Or you could search for "Mexico AND NOT 'New Mexico'" and get just document 1. But the "strong not" has ruled out document 3 - even though it contained the thing you were looking for - just because it contained the thing you were not looking for.

The "mild not" operator allows you to say "Mexico MILD NOT 'New Mexico'", which means "find me all the documents that contain 'Mexico'. Do not take any notice of occurrences of 'New Mexico', but do not rule out a document just because it contains 'New Mexico'".

There are many cases where you may want to search for a word, but NOT get documents just because they contain a common phrase that includes that word. e.g. "security" mildnot "social security", "house" mildnot "house of representatives", "estate tax" mildnot "real estate tax"

CLOSED.

Issues 10 and 41 are now closed. We add the mildnot functionality and FTMildNot is spelled as "not in". Closed at FTTF Meeting 80:

Markup vs Structure (Cluster J, Issue 11)

Some tags are "markup" - e.g. b - some are "structure" - e.g. title. We generally want to treat structure tags as word boundaries, but not markup tags. How do we distinguish between markup and structure?

Michael to provide reformulation.

CLOSED.

Closed on April 29, 2005 and updated Section 1.1 as in http://lists.w3.org/Archives/Member/member-query-fttf/2005Apr/0091.html.

MatchOption Policy (Cluster D, Issue 12)

We need some indirection to specify match context, defaults "Thesaurus name" gives us a way to define a thesaurus, then specify it in the query - an indirection. Steve Buxton proposes there are many classes of things that are needed for context-match (stoplist, special characters, etc.) that need an indirection. So we need an extra level of indirection - a named policy that refers to a set of named things.

Loose Grammar (Cluster I, Issue 13)

The grammar allows lots of queries that do not make sense. e.g. "(dog || cat) within word distance N", "dog within word distance N", "(dog || cat) ordered", "!dog 5 times" If the grammar does not provide a way of identifying these "nonsense queries", then the implementation still has to identify them - i.e. implementors will have to augment the grammar to identify nonsense queries, and augment the semantics to do something with them.

J. Doerre asks if we should allow nested FTNegations in the RHS of a FTMildNegation. From his email () point 3: "The ApplyFTSelection ignores all StringExcludes in the arguments of the FTMildNegation. I think, if we don't want to deal with StringExcludes in that function, we should explicitly forbid them to appear, i.e. require arguments of FTMildNegation to not include any FTNegation."

CLOSED.

Leave the grammar as it is for a couple of reasons. 1. We cannot solve this problem with a (context-free) grammar without complicating it unnecessary. For example, apart from "(dog || cat) word distance N", the "no-op" rule can be also applied to "(dog with diacritics || cat) case-insensitive without stop words word distance N".

2. It is hard if not impossible to enumerate all "no-ops". Here are some additional ones: "a" && !"a", (dog && cat) distance at most 5 words distance at most 6 words, "To be or not to be" distance at least 10 words, etc. It should be left to the application to determine what constitutes a no-op and optimize if possible.

See F2F minutes in

FTTimesSelection (Cluster G, Issue 14)

How do I count occurrences, where the query is NOT a single term?. How many occurrences of "!dog" are there in "very very big"? Zero or very many?

CLOSED.

RegExp Escape (Cluster C, Issue 15)

Need to define some escaping mechanism for regexp characters, and for (||, ...).

CLOSED.

Closed on Feb. 14, 2005 because regular expressions are not part of the language anymore.

FTScopeSelection (Cluster I, Issue 16)

Is there a need for both FTScopeSelection and FTDistance ? For example, how is the 'same sentence' or 'same paragraph' really different than a FTDistance of 'with sentence exactly 1' or 'with paragraph exactly 1'?.

CLOSED.

We decided to keep both FTScopeSelection and FTDistance.

Weighting (Cluster A, Issue 17)

Michael R.: What syntactic form should scoring take? How do we describe the constraints on the types of expressions that are allowed? Should scoring be expressed using a second-order function, a stand-alone operator, or as a clause in a FLWOR expression? Consider moving weighting to ftContains, something like the following: TreatExpr ("ftcontains" FTSelection ("weight" Expr)? )?

Options in presentation of full-text language proposal and some discussion at XQuery January meeting, Tampa at: (Cntl-F on Report of Full-Text Task Force)

CLOSED.

Added weight to FTSelection inside a scoring expression.

Weight Values (Cluster A, Issue 18)

Valid values for weights must be defined.

CLOSED.

Weight values in scoring expressions are in the interval [0,1].

FTScopeSelection on structure (VNext, Cluster H, Issue 19)

Scoping based on structure (e.g. same node and different node) should be considered. Support for queries where distance is measured in terms of "number of intervening elements" where elements can be any markup including chapter, paragraph and sentence. Consider sentence/paragraph/node distance.

CLOSED.

Postponed to VNext.

LanguageMatchOption (Cluster E, Issue 20)

What is the default language? SA: Dana F.: does the language have to be a literal or an Expr that returns xs:string? Is there an implementation-defined list of valid languages ?

CLOSED.

1. Default language is "None".

The Working Draft states explicitly in Section 3.2.7 the possibility to have no language selected. I think this is a good choice for the default (and it is specified as the default in the Working Draft). A typical application that uses XQuery-FT will probably have logic in place to override the default by the language setting from the locale of the client, so the default is really unimportant.

2. The language is given by a UnionExpr that must return an xs:string, or an empty sequence. This is what the Working Draft specifies. Let us keep it like that.

3. Yes, there is an implementation-defined list of valid languages. We added a statement on this to Section 3.2.7. See

CaseMatchOption and SpecialCharMatchOption (Cluster E, Issue 21)

Paul C. pointed out whether "lowercase", "uppercase", "case sensitive" and "case insensitive" should be defined in the context of Unicode. J. Doerre provided this link to the Unicode standard is: . The current version is 4.0.0. Case folding is described in Chapter 3.13. Please note that the case folding operations, like toUppercase(X), only depend on the characters to be folded, not on additional information, like language.

CLOSED.

There will be no syntax for special character handling in the current draft. Issues to consider for v. next are in this list of issues.

DiacriticsMatchOption (Cluster E, Issue 22)

Paul C.: We need to define what a diacritic is. Steve B. pointed out whether "with diacritics" and "without diacritics" are needed or not.

CLOSED.

We removed the special character match option as instructed in

Tokenizers (Cluster J, Issue 23)

Darin/Paul C.: What is the most general behavior for tokenizers?

Michael Kay: Can we define a set of rules that apply regardless of which tokenizer we are using in the same manner as the rues we defined for scoring? For example, we could impose constraints on words, sentences and paragraphs.

CLOSED.

Modified item 7 in Section 1.1 to reflect conditions on tokenizers.

SpecialCharMatchOption (Cluster E, Issue 24)

We need to say more about special characters, what kind of special characters do we want to consider, what is their impact on the ability to use a given index, their impact on tokenization.

CLOSED.

We decided to remove this match option from the current WD and create new issues to be considered for v. next.

MatchOption Syntax (Cluster E, Issue 25)

Paul C.: It maybe that we should reconsider the syntax and allow to apply modifiers to individual words.

CLOSED.

StopWordsMatchOption (Cluster E, Issue 26)

We need to say more about stopwords, what kind of stop words do we want to consider, what is their impact on the ability to use a given index, their impact on tokenization. Should we allow to specify the URI of a StopWords list? Paul C.: What would a single search with a stop word return?

CLOSED.

We changed the syntax of stop words sepcification to allow for using a URI as a stop word list. The new syntax is given in:

MatchOption and Tokenization (Cluster C, Issue 27)

Does the language document clearly state the impact of match options on tokenization? Consider regex * when does it get applied? What effect does it have on word breaks? Example: expr ftcontains "brown .ox" with regex, expr ftcontains "brown .*ox" with regex.

CLOSED.

Closed, on Feb. 17, 2005, because no longer an issue.

The only impact of match options on tokenization that needs to be addressed in the specification is the impact of the wildcard match option. Other match options, like "language", are allowed to impact tokenization in an implementation-dependent way.

For the wildcard match option its implication on tokenization is now clearly stated in its description, namely that wildcards, i.e., the character sequences ".", ".*", ".+", etc., are to be interpreted as token-internal character sequences when within an FTWords that is inside the scope of the wildcard match option.

IGNORE Syntax (Cluster B, Issue 28)

Do we need special syntax for IGNORE in case of level by level search?

CLOSED.

We already have a syntax for this.

Scoping (Cluster I, Issue 29)

Do we need same sentence, same paragraph search? * in semantics, not in requirements.

CLOSED.

Closed by Pat Case in http://lists.w3.org/Archives/Member/member-query-fttf/2005Mar/0230.html

This recommendation should focus on functionality which serves all languages. It should also selectively include functionalities useful within families of languages. Searching within sentences and paragraphs is useful to many western languages and some non-western languages. They should remain in the recommendation.

Precedence of XQuery and full-text (Cluster F, Issue 30)

We need to distinguish between XQuery expressions embedded in full-text expressions and FTSelections themselves. S. Buxton suggests that we use different kinds of parentheses to distinguish between these two expressions. See his message in and subsequent messages. A simple example is to distinguish between ("cat") as an XQuery expression that builds an XQuery sequence and ("cat") as an FTSelection.

In the current draft of the document, we are using lookahead

Other possibilities include the use of "{}" to switch from full-text to XQuery when XQuery expressions are embedded in full-text expressions. This is similar to element construction in XQuery and has been pointed out by Mary H in her email at

CLOSED.

We decided to use {} to delimit XQuery expressions inside XQuery Full-Text ones according to the discussion in

Optional Keyword "with" in FTDistance (Cluster F, Issue 31)

In 3.1.9 FTDistance: Do we need "with" in FTDistance?