4697 – [FT] editorial: 1.1 Full-Text Search and XML

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 4697 - [FT] editorial: 1.1 Full-Text Search and XML

Summary: [FT] editorial: 1.1 Full-Text Search and XML

Status:	CLOSED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Full Text 1.0 (show other bugs)
Version:	Last Call drafts
Hardware:	All All

Importance:	P2 minor
Target Milestone:	---
Assignee:	Pat Case
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2007-06-23 09:51 UTC by Michael Dyck
Modified:	2007-10-26 20:30 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Michael Dyck 2007-06-23 09:51:12 UTC

1.1 Full-Text Search and XML

[1]
"The following definitions apply to full-text search:"
    Note that 4 and 5 aren't actually definitions.
    (5 belongs with a definition of tokenization.)

[2]
"A token is defined as a character, n-gram, or sequence of characters"
    [2a]
    Usually, definitions don't say "is defined as". Change to just "is"?

    [2b]
    There is no definition of "n-gram", and no other use of it in the
    document. Delete?

[3]
"Each instance of a token consists of one or more consecutive characters."
    [3a]
    The phrase "Each instance of a token" is undefined, and suggests that
    a token is an abstract thing that has to be instantiated. Just say
    "Each token".

    [3b]
    I think you'll get a better definition if you combine the two
    sentences, e.g.:
        A token is a sequence of one or more consecutive characters,
        returned by a tokenizer as a basic unit to be searched.

    [3c]
    You might need to clarify what you mean by "consecutive".  Point 5
    appears to give implementations the freedom to treat
        i<i>tal</i>ic
    as a single (6-character) token, but its characters are not
    consecutive characters in the XML document.

[4]
"a phrase/sentence/paragraph is an ordered sequence of any number of
tokens."
    Must/should the order of the sequence reflect document order? (e.g.,
    tokens are ordered according to the document order of their first
    character) Either way, it might be good to say.

[5]
"... the containment hierarchy: paragraphs contain sentences, which
contain tokens"
    [5a]
    Phrases don't (aren't required to) participate in the containment
    hierarchy? (Can a phrase match across a sentence boundary?)

    [5b]
    It's not clear what it means for one ordered sequence of tokens (A) to
    "contain" another (B). Presumably all the tokens in B must be in A,
    and I'm guessing they have to be in the same order. Do the tokens of
    B also have to be consecutive in A?

[6]
"The tokenizer has to process two codepoint equal strings in the same way,
i.e., it should identify the same tokens."
    [6a]
    Change "has to" to "must"?

    [6b]
    Change "should" to "must"?

    [6c]
    These constraints on tokenization are stated in four places (1.1,
    2.1, 4.1, and appx I) in slightly different ways. Surely we can
    delete/merge a few of them.

Comment 1 Michael Dyck 2007-08-27 06:15:14 UTC

A specific suggestion for point [6c], which also takes care of [6a] and [6b]:

In section 1.1, second list, item 3:

--- Extract the definitions of "sentence" and "paragraph" and put them between items 2 and 3. (Append them to item 2, or make a new item, whichever you prefer.)

--- Delete the three sentences at the end of the item:
        Whatever a tokenizer for a particular language chooses to do,
        it must preserve the containment hierarchy: paragraphs contain
        sentences, which contain tokens.

        The tokenizer has to process two codepoint equal strings in the
        same way, i.e., it should identify the same tokens. Everything
        else about the behavior of the tokenizer is implementation-defined.

--- Move the definition of tokenization (and the subsequent constraints, and the Note re overlapping tokens) from 4.1 to replace the sentences deleted above.

    But instead of the 4.1 phrasing:
        paragraphs contain sentences contain words
    use the 1.1 phrasing:
        paragraphs contain sentences, which contain tokens

--- As for the three sentences at the start of the item, delete or reposition or leave them, as you please. (It might be more stylistically consistent to put them after the definition.) 

In section 2.1, delete the repeated paragraph and list:
   "Tokenization, including .. same tokens in each."

Comment 2 Jim Melton 2007-09-13 22:36:33 UTC

As decided in meeting #152 (the minutes of which are at the member-only URI http://lists.w3.org/Archives/Member/member-query-fttf/2007Sep/0005.html), items [2a], [3a], [3b], and [3c] (that is, all of item [3]), [6a], and [6b] have been resolved. 

That leaves items [1], [2b], [4], [5a], [5b], and [6c] to be resolved.

Comment 3 Jim Melton 2007-09-13 23:10:35 UTC

Additionally, as decided in meeting #152 (the minutes of which are at the member-only URI
http://lists.w3.org/Archives/Member/member-query-fttf/2007Sep/0005.html), item
[4] was resolved with no action. 

That leaves items [1], [2b], [5a], [5b], and [6c] to be resolved.

Comment 4 Pat Case 2007-10-15 20:14:09 UTC

[1] The FTTF agreed.  We removed the line: "The following definitions apply to full-text search:" and broke the items out of the list. Added Note to the "As XQuery and XPath evolve" paragraph. Reversed the last 2 sentences. 
[2b] The FTTF agreed.  We removed "n-gram."
[5a] Phrases are not part of the containment hierarchy. A phrase can cross sentence boundaries. No change made.
[5b] The FTTF agreed. We removed these sentences: 
Whatever a tokenizer for a particular language chooses to do, it must preserve the containment hierarchy: paragraphs contain sentences, which contain tokens.
The tokenizer must process two codepoint equal strings in the same way, i.e., it must identify the same tokens. Everything else about the behavior of the tokenizer is implementation-defined.
[6c] The FTTF agreed. We consolidated the early introductions to tokenization into one place in 1.1, removing it from 2.1. We deleted some of the sentences in favor of a forward pointer to 4.1.

These changes will appear in the next build of the internal Full-Text language after the October 11 build, and in the next public version. They close the last items in this bug. If you approve of the changes, please mark the bug closed.