XQuery and XPath Full-Text Use Cases

W3C Working Draft 14 February 2003

This version:: http://www.w3.org/TR/2003/WD-xmlquery-full-text-use-cases-20030214/
Latest version:: http://www.w3.org/TR/xmlquery-full-text-use-cases
Editors:: Sihem Amer-Yahia, AT&T Labs; Pat Case, Library of Congress

This document is also available in these non-normative formats: XML.

Abstract

This document specifies usage scenarios for full-text queries as part of XML Query [XQuery] and XPath [XPath].

Status of this Document

This is a public W3C Working Draft for review by W3C Members and other interested parties. This section describes the status of this document at the time of its publication. It is a draft document and may be updated, replaced, or made obsolete by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress." A list of current public W3C technical reports can be found at http://www.w3.org/TR/.

The Full-Text Use Cases have been defined jointly by the XML Query Working Group and the XSL Working Group (both parts of the XML Activity) .

The Full-Text Use Cases are published in conjunction with the XQuery and XPath Full-Text Requirements.

This is the first version of this document.

This document is a work in progress. It contains many open issues, and should not be considered to be fully stable. Vendors who wish to create preview implementations based on this document do so at their own risk. While this document reflects the general consensus of the working groups, there are still controversial areas that may be subject to change.

Public comments on this document and its open issues are welcome. Comments should be sent to the W3C XPath/XQuery mailing list, public-qt-comments@w3.org (archived at http://lists.w3.org/Archives/Public/public-qt-comments/).

Patent disclosures relevant to this specification may be found on the XML Query Working Group's patent disclosure page at http://www.w3.org/2002/08/xmlquery-IPR-statements and on the XSL Working Group's patent disclosure page at http://www.w3.org/Style/XSL/Disclosures.

A list of current W3C Recommendations and other technical documents can be found at http://www.w3.org/TR/.

Appendices

A Acknowledgements
B References
B.1 References (Primary)
B.2 References (Background)

1 Full-Text Use Cases: Preliminaries

1.1 Proper Display of This Unicode Document

(1) Use a current operating system and browser.

(2) Set the character encoding in the browser to Unicode or UTF-8. Often this setting is changed from the View menu.

1.2 Introduction

The use cases listed below were created by XML Query and XSL Working Groups, to illustrate important applications of full-text querying within an XML query language. Each use case exercises a specific functionality relevant to full-text querying. A Schema and sample input data are provided. Each use case specifies a set of queries that might be applied to the input data, and the expected results for each query. In a future version, the use cases will be republished with solutions in XQuery and/or XPath.

The document supplements the XML Query Use Cases which can be found in the W3C XML Query Use Cases [XQuery-UseCases]. Use cases for character string querying are included in the XML Query Use Cases, not in this document.

The full-text queries in the following use cases are performed on text which has been tokenized, i.e., broken into a sequence of words, units of punctuation, and spaces. A word is defined as any character, n-gram, or sequence of characters returned by a tokenizer as a basic unit to be queried. Each instance of a word consists of 0 or more consecutive characters. Beyond that words are implementation defined. Note that consecutive words need not be separated by either punctuation or space, and words may overlap. Tokenization enables functions and operators which work with the relative positioning of words (e.g., proximity operators). Tokenization also enables functions and operators which operate on a part or the root of the word (e.g., wildcards, stemming).

A phrase is a sequence of ordered words. A sequence can contain any number of words.

These use cases:

(1) Present some possible functions and features for tokenized text support in XQuery and XPath. None are yet available in XQuery or XPath. Please comment on these use cases and recommend others.

(2) Illustrate simple and complex queries. The more complex queries would normally only be constructed by programmers, librarians, and other expert users, or provided for novice users via saved queries or graphical user interfaces. Each query is intended to illustrate a single functionality, although queries might overlap in their functionalities (e.g., phrases and ordered proximity queries allowing no intervening words). Overlapping and similar functionalities are noted in the comments on query behavior.

(3) Draw from sample data which are almost entirely in English. Use cases in other languages are solicited, especially where they illustrate language-specific implementations of functions and features. Among the most sought after are use cases for queries using prefix and infix wild cards, proximity queries, and operators and queries requiring functionality which may not have Western language equivalents.

(4) Include queries which in most instances can be written with pure Boolean full-text predicates or with scoring (e.g., scoring on the number of occurrences of a word or phrase, scoring on how close words are to one another within a proximity query, scoring on how similar a word is to the one being stemmed) [BYR99] [HTK00]. A few, in Section 12 (SCORE), cannot be written with Boolean full-text predicates. Scoring methodologies will not be defined in this standard. Scoring will be implementation-defined. Results are provided in document order, except those in Section 12 (SCORE). Results could be returned ordered differently, such as by relevance (based on implementation-defined scoring) or explicitly by an element.

(5) Include queries on element content and attribute values.

(6) Include queries which are case-insensitive. When returning a paragraph, the text is returned as it occurs in the data model. This approach was chosen to keep the sample data short and the expected results meaningful. It would have be equally valid to return only the character queried. An variation is found in Section 5 (CHARACTER-MANIPULATION).

(7) Include queries which when they target XML elements are understood, unless otherwise stated, to query text within any text node descendant of the element.

(8) Include queries which return only elements and attributes which meet all the conditions specified in the query. In particular, Boolean queries return results where the Boolean conditions in the query are satisfied, i.e., are used to select what is being returned to users.

Query results may be returned in different ways. From a query for books containing the word "usability", users might be interested in returning, for each book containing the word "usability", its number and its entire content. In another situtation for the same query, users might be interested in returning, for each book containing the word "usability", its number and only the elements and attributes in the content which contain the word "usability". As in this second situation, the queries in these use cases return only elements and attributes which meet all the conditions specified in the query.

The Return clause may also include additional or different elements and attributes if specified, and may construct new elements.

(9) Include queries which provide some of the basic functionality of fuzzy match querying (e.g., wildcards, stemming, dictionary and thesaurus support, proximity).

(10) Provide highlighting of found words and phrases in the expected results of queries as an aid to users. The presence of highlighting says nothing about whether highlighting will be a feature of XQuery or XPath full-text querying.

(11) Display no Solutions in XQuery because no decisions have been taken on syntax. They will be added in a future version.

Examples of full-text querying functionalities for XML query languages can be found in [FGR01], [HTK00], [MJK98], [SCH01] and [TWE00].

To make the output more readable, the output of queries has been formatted using whitespace which may not be returned by a query processor. This whitespace should not be considered normative for the correctness of results.

These use cases represent a snapshot of an ongoing work. Some important operators and features are not yet adequately covered by a use case. The XML Query and XSL Working Groups reserve the right to add, delete or modify individual queries or whole use cases as the work progresses. The presence of a query in this set of use cases does not necessarily indicate that the query will be expressible in XQuery [XQuery] and/or XPath [XPath] to be created by the XML Query and XSL Working Groups.

1.3 Explanation of Query Statements

The queries in these use cases are presented in the following format:

Query number Query title

User statement of query

Statement of functionality illustrated by query

Operands: Parts of words, words, phrases
Functionality: Operators, functions, collations, other functionality
Context: One or more XPath expressions locating the elements and attributes to be queried
Return: One or more XPath expressions which are returned only if they meet all the conditions specified in the query, and additional or different XPath expressions if specified. These may include constructed elements.
Comments: Comments on query behavior in general and against the sample data in particular, plus the rationale for including this query in the use cases.
Version: Each query is marked as "For consideration in v.1" or "For consideration after v.1".