4667 – [Full text LC draft sec. 2.1] Status of text (nodes)

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 4667 - [Full text LC draft sec. 2.1] Status of text (nodes)

Summary: [Full text LC draft sec. 2.1] Status of text (nodes)

Status:	CLOSED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Full Text 1.0 (show other bugs)
Version:	Last Call drafts
Hardware:	PC Windows XP

Importance:	P2 normal
Target Milestone:	---
Assignee:	Jim Melton
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2007-06-20 07:04 UTC by Felix Sasaki
Modified:	2007-07-26 00:21 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Felix Sasaki 2007-06-20 07:04:45 UTC

The second paragraph contains:

"Full-text operators typically work on sequences of token occurrences found in the target text (nodes) of a search."

Could you clarify the role of markup? I.e., does the number of token occurrences found in the target differ, depending on whether the target text (e.g. the content of a <p> element is in one (i.e. no embedded markup in <p> or several (i.e. embedded markup like <em> in <p>) element nodes?

I assume that this behavior can be parametrized with the ignore option.

Comment 1 Jim Melton 2007-06-26 16:59:43 UTC

If we ignore questions about the effect of element boundaries on tokenization, it is correct to say that the tokenization algorithm is applied to the string value of the search context. See section 4.1, definition of tokenization. Therefore, the number of tokens (token occurrences) does not vary depending on embedded markup.

The specification currently says, in Section 1.1, second list, item 5, "Some formatting markup serves well as token boundaries, for example, paragraphs are most commonly delimited by formatting markup. Other formatting markup may not serve well as token boundaries. Implementations are free to provide implementation-defined ways to differentiate between the markup's effect on token boundaries during tokenization." If all element tags in the search context fall adjacent to locations that would be tokenization boundaries in any case (e.g., space characters in our default tokenizer), then the answer does not change based on whether an implementation chooses to use or to ignore the presence of certain elements to determine token boundaries.

When an element tag occurs at a location other than an "ordinary" token bounday, then the answer might change when the tokenizer chooses to create a token boundary based on the presence and position of the element tag.

We believe that this answers your question. However, it occurs to us that an example might help readers. Do you think that an example such as this one would satisfy your comment? "Emphasize a syllable, as well as a word within text.". There might be ten or eleven tokens in that text, based on whether the implementation chooses to create a token boundary at the first closing tag. The second ... element would not (in our default tokenizer) affect the number of tokens in the search context.

Regarding the use of the FTIgnoreOption, the document currently states (see section 4.3.1) that the process of ignoring is dependent on implementation decisions about whether various element boundaries are or are not ignored during tokenization. We anticipate a proposal (from Michael Rys) that would further relax the limitations to allow the behavior to be even more implementation-defined.

I have marked this bug FIXED in the belief that we have answered your question and with our agreement that we will happily add an example such a the one we suggested at your request. If you agree that this solves your concern, please mark this bug FIXED.

Comment 2 Jim Melton 2007-07-26 00:21:47 UTC

In the absence of a response to my comment in http://www.w3.org/Bugs/Public/show_bug.cgi?id=4667#c1 for over a month, I am marking this bug CLOSED.  If this is not acceptable to you, please reopen it and explain your reasons.