This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 3739 - [FT] Description of tokenization (Editorial/Technical)
Summary: [FT] Description of tokenization (Editorial/Technical)
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Full Text 1.0 (show other bugs)
Version: Working drafts
Hardware: PC Windows XP
: P2 normal
Target Milestone: ---
Assignee: Jochen Doerre
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-09-18 19:22 UTC by Mary Holstege
Modified: 2007-01-08 20:00 UTC (History)
0 users

See Also:


Attachments

Description Mary Holstege 2006-09-18 19:22:58 UTC
== Section 1.1 (Full-Text Search and XML)
Bullet 3, final paragraph: 
"The tokenizer has to evaluate two equal strings..."
(1) Suggest replacing "evaluate" with some other word that doesn't carry the
same implications in the XQuery context, perhaps "process".
(2) "equal" is troubling as well: equal as in XQuery equals in the face of
a collation? Or codepoint-by-codepoint equal?  I believe we mean the latter.

Bullets 4 and 5
Should mention the relationship of markup to tokenization, particularly
paragraph identification.  I expect for most XML markup that it will be the
markup, not white space, that identifies paragraph boundaries.
Comment 1 Mary Holstege 2006-10-02 19:03:40 UTC
WG agreed with this comment on 2006-10-02.

Change "evaluate" to "process"
Change "equal" to "codepoint equal"

Modify bullet 5 with the sentences:

"Semantic markup serves well as token boundaries. Some formatting markup serves
well as token boundaries, for example, paragraphs are most commonly delimited
by formatting markup. Other formatting markup may not serve well as token
boundaries."
 
Comment 2 Jochen Doerre 2006-10-13 10:12:22 UTC
DONE as agreed.