2299 – Distance constraints do not work on phrases (formerly Cluster G, Issue 63)

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 2299 - Distance constraints do not work on phrases (formerly Cluster G, Issue 63)

Summary: Distance constraints do not work on phrases (formerly Cluster G, Issue 63)

Status:	CLOSED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Full Text 1.0 (show other bugs)
Version:	Working drafts
Hardware:	PC Windows XP

Importance:	P2 normal
Target Milestone:	---
Assignee:	Jochen Doerre
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:	2599 2600
Blocks:
	Show dependency tree / graph

Reported:	2005-09-25 19:14 UTC by Jim Melton
Modified:	2006-04-12 22:40 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Jim Melton 2005-09-25 19:14:04 UTC

It is not possible to combine the distance operation with searching for phrases.

Example:

[. ftcontains "Redmond-based" && "company" distance at least 2]

The problem is that a phrase is internally resolved into a distance operation
itself, which can impose a contradicting requirement to the explicit distance
operation used in the query. Here it is (with some assumptions on tokenization):

[. ftcontains ("Redmond" && "-" && "based" ordered with distance 0) && "company"
distance at least 2]

The second distance constraint is then imposed to all individual tokens
(including those from the phrase) and hence cannot be satisfied. The query will
also return false.

Comment 1 Jochen Doerre 2005-11-28 18:41:30 UTC

To fix this we decided native phrases within matches as StringMatches that span
multiple tokens (intervals). In order to do so, the TokenInfo model has to be
extended to also model token intervals. At the same time this change allows
us to allow for tokenizers producing overlapping tokens.

Summary of the discussion/decision:
 - We need to allow for overlapping tokens for multiple reasons.
 - A phrase can be modeled as a "token" spanning multiple positions. This 
allows to treat it as a unit in constraints like FTDistance.
 - FTDistance constraints always disallow overlapping of tokens. A 
distance of 0 words (sentences/paragraphs) means adjacent word 
(sentence/paragraph).


Summary of changes to the semantics:
In 4.3.1 AllMatches
Change the type TokenInfo to now include the attributes
+startPos: integer
+endPos: integer
+startSent: integer
+endSent: integer
+startPara: integer
+endPara: integer

(as an aside: we also drop the "queryString", because it is not needed in the
semantics.)

4.3.1.3 XML representation (of AllMatches) adapted to the model above.

4.3.1.4 and 4.3.1.5 (Normalization). To be adapted, but not yet done.

4.3.2.9 FTOrder

Throughout the function, instead of testing for "tokenInfo/@pos", we 
should test for "tokenInfo/@startPos", i.e. the order constraint is only 
sensitive to the starting positions of matched tokens.

4.3.2.10 FTScope

Same sentence: the input AllMatches must satisfy, that for each match all 
covered sentence positions in each of the StringIncludes must be the same.
And retain only those StringExcludes that cover that same sentence (or, if no 
StringIncludes, at most one sentence).

Different sentence: for each match the StringIncludes cover disjoint 
sentences. Keep StringExcludes that cover sentences not covered by any 
StringInclude (drop if some sentence covered by both).

Same/different paragraph is analogous.


4.3.2.12 FTDistance

Distance constraints are never satisfied for a match that contains two 
StringIncludes which overlap. Check for each match that the 
list of StringIncludes sorted by startPos is such that for each pair of 
consecutive StringIncludes SI1, SI2 the end position (sentence/paragraph) 
of SI1 (the preceding) is within the required distance from the start 
position (sentence/paragraph) of SI2 (the suceeding). And keep only 
StringExcludes that are within the required distance from one of the 
StringIncludes.

(changed all 12 functions).

4.3.2.13 FTWindow

For each match the minimal startPos and the maximal endPos of the 
StringIncludes must fit into a window of N positions. Drop 
StringExcludes that may not be completely covered by any window covering 
the StringIncludes.

Comment 2 Jochen Doerre 2006-01-25 10:39:05 UTC

Update re. change of type TokenInfo: the word attribute of TokenInfo is required
for diacrtics and special characters handling (pointed out by Chavdar in our
FTTF-093 F2F). Hence, we need to keep this attribute.

Comment 3 Jim Melton 2006-04-12 22:40:30 UTC

With the closure of the two bugs on which this one depends, this bug can be marked CLOSED.