This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
It is not possible to combine the distance operation with searching for phrases. Example: [. ftcontains "Redmond-based" && "company" distance at least 2] The problem is that a phrase is internally resolved into a distance operation itself, which can impose a contradicting requirement to the explicit distance operation used in the query. Here it is (with some assumptions on tokenization): [. ftcontains ("Redmond" && "-" && "based" ordered with distance 0) && "company" distance at least 2] The second distance constraint is then imposed to all individual tokens (including those from the phrase) and hence cannot be satisfied. The query will also return false.
To fix this we decided native phrases within matches as StringMatches that span multiple tokens (intervals). In order to do so, the TokenInfo model has to be extended to also model token intervals. At the same time this change allows us to allow for tokenizers producing overlapping tokens. Summary of the discussion/decision: - We need to allow for overlapping tokens for multiple reasons. - A phrase can be modeled as a "token" spanning multiple positions. This allows to treat it as a unit in constraints like FTDistance. - FTDistance constraints always disallow overlapping of tokens. A distance of 0 words (sentences/paragraphs) means adjacent word (sentence/paragraph). Summary of changes to the semantics: In 4.3.1 AllMatches Change the type TokenInfo to now include the attributes +startPos: integer +endPos: integer +startSent: integer +endSent: integer +startPara: integer +endPara: integer (as an aside: we also drop the "queryString", because it is not needed in the semantics.) 4.3.1.3 XML representation (of AllMatches) adapted to the model above. 4.3.1.4 and 4.3.1.5 (Normalization). To be adapted, but not yet done. 4.3.2.9 FTOrder Throughout the function, instead of testing for "tokenInfo/@pos", we should test for "tokenInfo/@startPos", i.e. the order constraint is only sensitive to the starting positions of matched tokens. 4.3.2.10 FTScope Same sentence: the input AllMatches must satisfy, that for each match all covered sentence positions in each of the StringIncludes must be the same. And retain only those StringExcludes that cover that same sentence (or, if no StringIncludes, at most one sentence). Different sentence: for each match the StringIncludes cover disjoint sentences. Keep StringExcludes that cover sentences not covered by any StringInclude (drop if some sentence covered by both). Same/different paragraph is analogous. 4.3.2.12 FTDistance Distance constraints are never satisfied for a match that contains two StringIncludes which overlap. Check for each match that the list of StringIncludes sorted by startPos is such that for each pair of consecutive StringIncludes SI1, SI2 the end position (sentence/paragraph) of SI1 (the preceding) is within the required distance from the start position (sentence/paragraph) of SI2 (the suceeding). And keep only StringExcludes that are within the required distance from one of the StringIncludes. (changed all 12 functions). 4.3.2.13 FTWindow For each match the minimal startPos and the maximal endPos of the StringIncludes must fit into a window of N positions. Drop StringExcludes that may not be completely covered by any window covering the StringIncludes.
Update re. change of type TokenInfo: the word attribute of TokenInfo is required for diacrtics and special characters handling (pointed out by Chavdar in our FTTF-093 F2F). Hence, we need to keep this attribute.
With the closure of the two bugs on which this one depends, this bug can be marked CLOSED.