W3C

XQuery and XPath Full-Text Requirements

W3C Working Draft 14 February 2003

This version:
http://www.w3.org/TR/2003/WD-xmlquery-full-text-requirements-20030214/
Latest version:
http://www.w3.org/TR/xmlquery-full-text-requirements/
Editors:
Stephen Buxton, Oracle Corp <stephen.buxton@oracle.com>
Michael Rys, Microsoft <mrys@microsoft.com>

Abstract

The document specifies requirements for Full-Text search for use in XQuery [XQuery] and XPath [XPath].

Status of this Document

This is a public W3C Working Draft for review by W3C Members and other interested parties. This section describes the status of this document at the time of its publication. It is a draft document and may be updated, replaced, or made obsolete by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress." A list of current public W3C technical reports can be found at http://www.w3.org/TR/.

The Full-Text Requirements have been defined jointly by the XQuery Working Group and the XSL Working Group (both part of the XML Activity).

This is the first version of this document.

This document is a work in progress. It contains many open issues, and should not be considered to be fully stable. Vendors who wish to create preview implementations based on this document do so at their own risk. While this document reflects the general consensus of the working groups, there are still controversial areas that may be subject to change.

Public comments on this document and its open issues are welcome. Comments should be sent to the W3C XPath/XQuery mailing list, public-qt-comments@w3.org (archived at http://lists.w3.org/Archives/Public/public-qt-comments/).

Patent disclosures relevant to this specification may be found on the XML Query Working Group's patent disclosure page at http://www.w3.org/2002/08/xmlquery-IPR-statements and on the XSL Working Group's patent disclosure page at http://www.w3.org/Style/XSL/Disclosures.

A list of current W3C Recommendations and other technical documents can be found at http://www.w3.org/TR/.

Table of Contents

1 Introduction
2 Terminology
2.1 MUST
2.2 MAY
2.3 SHOULD
2.4 SCORE
2.5 Full-Text Search
3 Language Design
3.1 The Data Model
3.2 Side-effects on the data
3.3 Score Function and Full-Text predicates
3.3.1 Predicate and Score Independence
3.3.2 Score language
3.4 Score algorithm
3.4.1 Return Score
3.4.2 Sort by Score
3.4.3 Type, Range of Score
3.4.4 Score Statistics
3.4.5 Semantics of Score
3.5 Combined score
3.5.1 Score Combination
3.5.2 Score algorithm vendor-provided
3.5.3 Score algorithm overridable
3.5.4 Score influence
3.6 Extensibility
3.6.1 Extensible by vendors
3.6.2 Extensible by users
3.7 First, Future Versions
3.8 End user language
3.9 Searchable query
3.10 Universality
4 Integration
4.1 XPath
4.2 Extensibility Mechanisms
4.2.1 Integration into XQuery/XPath
4.2.2 XQuery/XPath Full-Text Extensibility
4.3 Composability
4.4 Human-readable
4.5 XML syntax
5 Implementation
5.1 Declarativity
6 Functionality and Scope
6.1 Functionality
6.2 Search Scope
6.2.1 Search within arbitrary structure
6.2.2 Constructed Structures
6.2.3 Return Arbitrary Nodes
6.2.4 Parts of Search Tree
6.3 Attributes
6.3.1 Search within attributes
6.3.2 Search across attributes and content
6.4 Markup
6.5 Element Boundaries
6.5.1 Search across element boundaries
6.5.2 Element as a token boundary
6.6 Score
6.6.1 Score accessible
6.6.2 Implicit ordering
6.6.3 Score extendable

Appendix

A References
A.1 Non-Normative


1 Introduction

"Full-Text Search" (FTS) is a large field which covers a vast array of functionality. In addition, there are many different ways one could combine FTS capabilities with XQuery and XPath.

This paper describes a set of requirements for FTS in XQuery/XPath (XQuery/XPath Full-Text). At this stage in the life of the document, these requirements should be read as suggestions only: the issues associated with the requirements are to be discussed and resolved by the relevant Working Groups. This format provides a firm basis for the Working Groups to set the direction of the work on XQuery/XPath Full-Text, and to compare existing proposals. Once the issues are resolved and this Requirements document is finalized, it will be easier to define the functionality of XQuery/XPath Full-Text and it's integration with XQuery and/or XPath.

Note that we will attempt to define requirements for the language without reference to any particular solution.

2 Terminology

We use the terms MUST, SHOULD and MAY throughout the document to specify the extent to which an item is a requirement for the work of XQuery/XPath Full-Text. We use the same definitions of MUST, SHOULD and MAY as The XQuery Requirements [XQuery Requirements]

2.1 MUST

[Definition: MUST means that the item is an absolute requirement.]

2.2 MAY

[Definition: MAY means that there may exist valid reasons not to treat this item as a requirement, but the full implications should be understood and the case carefully weighed before discarding this item.]

2.3 SHOULD

[Definition: SHOULD means that an item deserves attention, but further study is needed to determine whether the item should be treated as a requirement.]

When the words MUST, SHOULD, or MAY are used in this technical sense, they occur as a hyperlink to these definitions. These words will also be used with their conventional English meaning, in which case there is no hyperlink. For instance, the phrase "the full implications should be understood" uses the word "should" in its conventional English sense, and therefore occurs without the hyperlink.

Other terminology used in this document:

2.4 SCORE

[Definition: SCORE reflects relevance of matched material.]

2.5 Full-Text Search

[Definition: Full-Text Search in this document is an extension to the XQuery/XPath language. It provides a way to query text which has been tokenized, i.e. broken into a sequence of words, units of punctuation, and spaces. Tokenization enables functions and operators whch work with the relative positioning of words (e.g., proximity operators). Tokenization also enables functions and operators which operate on a part or the root of the word (e.g., wildcards, stemming).]

3 Language Design

This section covers requirements for XQuery/XPath Full-Text language design that are independent from, but related to, integration and scoping requirements.

3.1 The Data Model

XQuery/XPath Full-Text functions MUST operate on instances of the XQuery/XPath Data Model.

3.2 Side-effects on the data

XQuery/XPath Full-Text MUST NOT introduce or rely on side-effects.

3.3 Score Function and Full-Text predicates

3.3.1 Predicate and Score Independence

XQuery/XPath Full-Text MUST allow Full-Text predicates and SCORE functions independently.

3.3.2 Score language

XQuery/XPath Full-Text MUST either
  • use the same language for Full-Text predicates and SCORE functions

Or
  • use a language for Full-Text predicates that is a proper subset of the language for SCORE functions

3.4 Score algorithm

3.4.1 Return Score

XQuery/XPath Full-Text MUST allow the user to return SCORE.

3.4.2 Sort by Score

XQuery/XPath Full-Text MUST allow the user to sort by SCORE.

3.4.3 Type, Range of Score

XQuery/XPath Full-Text MUST define the type and range of SCORE values. The SCORESHOULD be a float, in the range 0-1.

3.4.4 Score Statistics

XQuery/XPath Full-Text MUST NOT require an explicit definition of the global corpus statistics (statistics, such as word frequency, used in calculating SCORE).

3.4.5 Semantics of Score

XQuery/XPath Full-Text MAY partially define the semantics of SCORE.

3.5 Combined score

3.5.1 Score Combination

XQuery/XPath Full-Text MUST be able to generate a SCORE for a combination of Full-Text predicates.

3.5.2 Score algorithm vendor-provided

The algorithm to produce combined SCOREs MUST be vendor-provided.

3.5.3 Score algorithm overridable

The algorithm to produce combined SCOREs SHOULD be overridable by users.

3.5.4 Score influence

Users MUST be able to influence individual components of complex score expressions.

3.6 Extensibility

3.6.1 Extensible by vendors

XQuery/XPath Full-Text MUST be extensible by vendors.

3.6.2 Extensible by users

XQuery/XPath Full-Text MAY be extensible by users.

3.7 First, Future Versions

The first version of XQuery/XPath Full-Text MUST provide a robust framework for future versions.

3.8 End user language

It is not a requirement that XQuery/XPath Full Text be designed as an end-user UI language.

3.9 Searchable query

It SHOULD be possible to search XQuery/XPath Full-Text queries.

3.10 Universality

XQuery/XPath Full-Text SHOULD be universal. As a minimum, XQuery/XPath Full-Text MUST allow Full-Text search in any Unicode character-set and in all common written natural languages.

4 Integration

This section specifies requirements for the integration of XQuery/XPath Full-Text with XQuery and XPath.

4.1 XPath

Part, but not necessarily all, of XQuery/XPath Full-Text MUST be usable as part of an XPath expression..

4.2 Extensibility Mechanisms

4.2.1 Integration into XQuery/XPath

XQuery/XPath Full-Text SHOULD use the extensibility mechanisms that exist in XQuery and XPath for integration into XQuery and XPath.

4.2.2 XQuery/XPath Full-Text Extensibility

XQuery/XPath Full-Text MUST use the extensibility mechanisms that exist in XQuery and XPath for it's own extensibility.

4.3 Composability

XQuery/XPath Full-Text MUST be composable with XQuery, and SHOULD be composable with itself.

4.4 Human-readable

XQuery/XPath Full-Text may have more than one syntax binding. One query language syntax must be convenient for humans to read and write. See XQuery Requirements

4.5 XML syntax

XQuery/XPath Full-Text MAY have more than one syntax binding. One query language syntax MUST be expressed in XML in a way that reflects the underlying structure of the query. See XQuery Requirements

5 Implementation

5.1 Declarativity

XQuery/XPath Full-Text MUST be declarative. Notably, it MUST not enforce a particular evaluation strategy.

6 Functionality and Scope

This section defines requirements for the functionality in XQuery/XPath Full-Text, and the scope of XQuery/XPath Full-Text queries.

6.1 Functionality

XQuery/XPath Full-Text MUST provide, in the first release, the minimum set of Full-Text functionality that is useful.

  1. single-word search

  2. phrase search

  3. support for stopwords

  4. single character suffix

  5. 0 or more character suffix

  6. 0 or more character prefix

  7. 0 or more character infix

  8. proximity searching (unit: words)

  9. specification of order in proximity searching

  10. combination using AND

  11. combination using OR

  12. combination using NOT

  13. word normalization, diacritics

  14. ranking, relevance

Additional functionality represented in the [XQuery and XPath Full-Text Use Cases]MUST be considered, but may be left to a future release.

Additional functionality from other Full-Text search contexts such as [SQL/MM Full-Text]MUST be considered, but SHOULD be left to a future release.

6.2 Search Scope

6.2.1 Search within arbitrary structure

XQuery/XPath Full-Text MUST allow search within an arbitrary structure (an arbitrary XPath expression).

6.2.2 Constructed Structures

XQuery/XPath Full-Text MUST NOT preclude Full-Text search within structures constructed during a query.

6.2.3 Return Arbitrary Nodes

XQuery/XPath Full-Text MUST allow a query to return arbitrary nodes.

6.2.4 Parts of Search Tree

XQuery/XPath Full-Text MUST allow the combination of predicates on different parts of the searched document 'tree'.

6.3 Attributes

6.3.1 Search within attributes

XQuery/XPath Full-Text MUST support Full-Text search within attributes.

6.3.2 Search across attributes and content

XQuery/XPath Full-Text MAY support Full-Text search within attributes in conjunction with Full-Text search within element content .

6.4 Markup

If XQuery/XPath Full-Text supports search within names of elements and attributes, then it MUST distinguish between
  • element content and attribute values

and
  • names of elements and attributes

in any search.

6.5 Element Boundaries

6.5.1 Search across element boundaries

XQuery/XPath Full-Text MUST support search across element boundaries, at least for NEAR.

6.5.2 Element as a token boundary

XQuery/XPath Full-Text MUST treat an element as a token boundary. This MAY be user-defined.

6.6 Score

6.6.1 Score accessible

SCOREMUST be accessible anywhere in the scope of the query.

6.6.2 Implicit ordering

SCORESHOULD NOT be used for implicit ordering.

6.6.3 Score extendable

SCOREMAY be extendable to a general distance-measure.

A References

A.1 Non-Normative

XQuery
XQuery 1.0: An XML Query Language. W3C Working Draft. (See http://www.w3.org/TR/xquery/.)
XPath
XML Path Language (XPath) 2.0. W3C Working Draft. (See http://www.w3.org/TR/xpath20.)
XQuery Requirements
XQuery Requirements (See http://www.w3.org/XML/Group/2001/02/xmlquery-req20010203.html.)
XQuery and XPath Full-Text Use Cases
XQuery 1.0 and XPath 2.0 Full-Text Use Cases (See http://www.w3.org/TR/2003/WD-xmlquery-full-text-use-cases-20030214/.)
SQL/MM Full-Text
ISO/IEC 13249-2:2000, Information technology - Database languages - SQL Multimedia and Application Packages - Part 2: Full-Text, International Organization For Standardization, 2000, referenced in e.g. "SQL Multimedia and Application Packages (SQL/MM)" (See http://www.acm.org/sigmod/record/issues/0112/standards.pdf)