This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 3783 - [FT] Tokenization: When to flow-through/flow-around markup?
Summary: [FT] Tokenization: When to flow-through/flow-around markup?
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Full Text 1.0 (show other bugs)
Version: Working drafts
Hardware: PC Windows XP
: P2 minor
Target Milestone: ---
Assignee: Joaquin Delgado
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-10-02 18:14 UTC by Joaquin Delgado
Modified: 2006-11-13 21:40 UTC (History)
0 users

See Also:


Attachments

Description Joaquin Delgado 2006-10-02 18:14:55 UTC
>Issue: paragraphs and sentences (Test, mostly)
>Sentence boundary detection is highly language-dependent and
>relies on specific language and perhaps even vocabulary knowledge.
>Paragraph boundaries ditto likewise, although in practice folks
>put paragraph structure into their markup, so then the issue is
>which markup counts as breaking paragraphs and which doesn't?
>
>Issue: flow-through/flow-around markup (Test, mostly)
>Similarly: which markup indicates word breaks and which doesn't?
>Which markup is flowed-around (e.g. footnotes) for phrase and
>proximity matching?
>
>I call these two spec issues also only because it is weird that
>we have query options for ignoring some nodes, but not for
>specifying any of these other important facts.  For the record,
>I think it is correct not to have them in the query, but I also
>think putting ignored nodes into the query is a big mistake as
>well.  I also think we need to acknowledge them in some way in
>testing and the spec.
>  
>
Now, here we do have a testing issue as well as spec problem and we should discuss this in the taskforce right away. I would categorize these two issues under the same umbrella: when to flow-through/flow-around markup. In other words, there are some nodes that should be considered/ignored for tokenization and querying and that might alter the semantics of some of the operators defined in the spec. You have a valid point about FTIgnoreOption. For example, Can a bold markup, which is not a word breaker and therefor ignored by the tokenizer,   be considered as part of the search context (i.e. allowing the search to be restricted to bolded nodes only)?

I propose to have the capabilities to

    * Ignore tags in a particular namespace (e.g. XHTML namespace)
    * Declare tags as delimiters for word, sentence and paragraphs.
Comment 1 Joaquin Delgado 2006-11-13 21:37:26 UTC
We agreed at the F2F to leave this completely implementation defined