2766 2006-01-25 01:21:32 +0000 Word or Token (need clarification) 2006-08-16 17:55:38 +0000 1 1 1 Unclassified XPath / XQuery / XSLT Full Text 1.0 Working drafts All Windows XP CLOSED FIXED P2 normal --- 1 joaquin.delgado sihem public-qt-comments oldest_to_newest 7962 0 joaquin.delgado 2006-01-25 01:21:33 +0000 According to the last published draft: "A word is defined as a character, n-gram, or sequence of characters returned by a tokenizer as a basic unit to be searched. Each instance of a word consists of one or more consecutive characters. Beyond that, words are implementation-defined. Note that consecutive words need not be separated by either punctuation or space, and words may overlap. A phrase is a sequence of ordered words which may contain any number of words." I'm not convinced we should use "word", which has its own semantics in plain English, in the above definition. The problem I have with "word" is that it may get confused with the meaning of "word" in plain English which is associated with a concept. Notice that an N-gram or an arbitrary sequence of characters does not have such connotation. I think the definition above relates more to "token". In fact later we later refer to words as tokens: "Whatever a tokenizer for a particular language chooses to do, it must preserve the containment hierarchy: paragraphs contain sentences which contain words. The tokenizer has to evaluate two equal strings in the same way, i.e., it should identify the same tokens." and also use the data structure called TokenInfo. I think its better to use tokens all throughout the document or clearly state that words and tokens mean the same thing. 7971 1 pcase 2006-01-25 14:51:29 +0000 Thanks for raising this discrepancy Joaquin. Early on, in consultation with the I18n, we decided to use "words" not "tokens". It holds more meaning and is less obscure. I still feel strongly that we were right in that decision. We struggled, again with assistance from the I18n, to produce what I think is an excellent definition of the word "word". In Section 4 we use "TokenInfo" (and define it) and variables such as "$searchToken". I would prefer they be "WordInfo" and $searchWord", but can live with these. In Section 4 I see occurrences of token and search token. I would like to see these changed to word, $searchToken, or another variable name where appropriate. If we decide we need to keep the word "token" in Section 4, I agree it should be defined, defined as a word returned by a tokenizer used as a search operand. I think I am correct that a token is always a word? When we treat phrases, sentences, and paragraphs as single units we call them intervals, right? 7973 2 pcase 2006-01-25 17:44:11 +0000 This comment amends my previous one which said: >If we decide we need to keep the word "token" in Section 4, I agree it should >be defined, defined as a word returned by a tokenizer used as a search operand. Remembering that Full Text is part of XQuery and XPath and may someday fold into those specs, and knowing the XQuery uses the word "token" for items other than words (without defining it), we probably should not use tokens in a more restrictive way within Full text, so we probably shouldn't define tokens in terms of words within Full Text. I recommend always using "word" instead of "token" because it is more specific to full-text querying. 7974 3 sihem 2006-01-25 17:55:31 +0000 If "word" is an issue, what about "term"? It is used heavily in information retrieval to mean "word". 7975 4 mrys 2006-01-25 18:00:19 +0000 The problem is that the implementation community always uses the term "token" and not word. Since this is primarily an implementation spec, I strongly urge us to use a term that the implementers can understand! 8046 5 sihem 2006-01-30 18:33:30 +0000 Changed occurrences of word into token wherever it makes sense since word has a special meaning in english. Also, added that word and token in some natural languages refer to the same concept.