This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
According to the last published draft: "A word is defined as a character, n-gram, or sequence of characters returned by a tokenizer as a basic unit to be searched. Each instance of a word consists of one or more consecutive characters. Beyond that, words are implementation-defined. Note that consecutive words need not be separated by either punctuation or space, and words may overlap. A phrase is a sequence of ordered words which may contain any number of words." I'm not convinced we should use "word", which has its own semantics in plain English, in the above definition. The problem I have with "word" is that it may get confused with the meaning of "word" in plain English which is associated with a concept. Notice that an N-gram or an arbitrary sequence of characters does not have such connotation. I think the definition above relates more to "token". In fact later we later refer to words as tokens: "Whatever a tokenizer for a particular language chooses to do, it must preserve the containment hierarchy: paragraphs contain sentences which contain words. The tokenizer has to evaluate two equal strings in the same way, i.e., it should identify the same tokens." and also use the data structure called TokenInfo. I think its better to use tokens all throughout the document or clearly state that words and tokens mean the same thing.
Thanks for raising this discrepancy Joaquin. Early on, in consultation with the I18n, we decided to use "words" not "tokens". It holds more meaning and is less obscure. I still feel strongly that we were right in that decision. We struggled, again with assistance from the I18n, to produce what I think is an excellent definition of the word "word". In Section 4 we use "TokenInfo" (and define it) and variables such as "$searchToken". I would prefer they be "WordInfo" and $searchWord", but can live with these. In Section 4 I see occurrences of token and search token. I would like to see these changed to word, $searchToken, or another variable name where appropriate. If we decide we need to keep the word "token" in Section 4, I agree it should be defined, defined as a word returned by a tokenizer used as a search operand. I think I am correct that a token is always a word? When we treat phrases, sentences, and paragraphs as single units we call them intervals, right?
This comment amends my previous one which said: >If we decide we need to keep the word "token" in Section 4, I agree it should >be defined, defined as a word returned by a tokenizer used as a search operand. Remembering that Full Text is part of XQuery and XPath and may someday fold into those specs, and knowing the XQuery uses the word "token" for items other than words (without defining it), we probably should not use tokens in a more restrictive way within Full text, so we probably shouldn't define tokens in terms of words within Full Text. I recommend always using "word" instead of "token" because it is more specific to full-text querying.
If "word" is an issue, what about "term"? It is used heavily in information retrieval to mean "word".
The problem is that the implementation community always uses the term "token" and not word. Since this is primarily an implementation spec, I strongly urge us to use a term that the implementers can understand!
Changed occurrences of word into token wherever it makes sense since word has a special meaning in english. Also, added that word and token in some natural languages refer to the same concept.