2766 – Word or Token (need clarification)

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 2766 - Word or Token (need clarification)

Summary: Word or Token (need clarification)

Status:	CLOSED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Full Text 1.0 (show other bugs)
Version:	Working drafts
Hardware:	All Windows XP

Importance:	P2 normal
Target Milestone:	---
Assignee:	Sihem Amer-Yahia
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2006-01-25 01:21 UTC by Joaquin Delgado
Modified:	2006-08-16 17:55 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Joaquin Delgado 2006-01-25 01:21:33 UTC

According to the last published draft:

"A word is defined as a character, n-gram, or sequence of characters returned 
by a tokenizer as a basic unit to be searched. Each instance of a word 
consists of one or more consecutive characters. Beyond that, words are 
implementation-defined. Note that consecutive words need not be separated by 
either punctuation or space, and words may overlap. A phrase is a sequence of 
ordered words which may contain any number of words."

I'm not convinced we should use "word", which has its own semantics in plain 
English, in the above definition. The problem I have with "word" is that it 
may get confused with the meaning  of "word" in plain English which is 
associated with a concept. Notice that an N-gram or an arbitrary sequence of 
characters does not have such connotation. I think the definition above 
relates more to "token". In fact later we later refer to words as 
tokens: "Whatever a tokenizer for a particular language chooses to do, it must 
preserve the containment hierarchy: paragraphs contain sentences which contain 
words. The tokenizer has to evaluate two equal strings in the same way, i.e., 
it should identify the same tokens." and also use the data structure called 
TokenInfo. I think its better to use tokens all throughout the document or 
clearly state that words and tokens mean the same thing.

Comment 1 Pat Case 2006-01-25 14:51:29 UTC

Thanks for raising this discrepancy Joaquin.

Early on, in consultation with the I18n, we decided to use "words" 
not "tokens". It holds more meaning and is less obscure. I still feel strongly 
that we were right in that decision.

We struggled, again with assistance from the I18n, to produce what I think is 
an excellent definition of the word "word". 

In Section 4 we use "TokenInfo" (and define it) and variables such 
as "$searchToken". I would prefer they be "WordInfo" and $searchWord", but can 
live with these. 

In Section 4 I see occurrences of token and search token. I would like to see 
these changed to word, $searchToken, or another variable name where appropriate.

If we decide we need to keep the word "token" in Section 4, I agree it should 
be defined, defined as a word returned by a tokenizer used as a search operand.

I think I am correct that a token is always a word? When we treat phrases, 
sentences, and paragraphs as single units we call them intervals, right?

Comment 2 Pat Case 2006-01-25 17:44:11 UTC

This comment amends my previous one which said:

>If we decide we need to keep the word "token" in Section 4, I agree it should 
>be defined, defined as a word returned by a tokenizer used as a search operand.

Remembering that Full Text is part of XQuery and XPath and may someday fold 
into those specs, and knowing the XQuery uses the word "token" for items other 
than words (without defining it), we probably should not use tokens in a more 
restrictive way within Full text, so we probably shouldn't define tokens in 
terms of words within Full Text. 

I recommend always using "word" instead of "token" because it is more specific 
to full-text querying.

Comment 3 Sihem Amer-Yahia 2006-01-25 17:55:31 UTC

If "word" is an issue, what about "term"? It is used heavily in information
retrieval to mean "word".

Comment 4 Michael Rys 2006-01-25 18:00:19 UTC

The problem is that the implementation community always uses the term "token" 
and not word. Since this is primarily an implementation spec, I strongly urge 
us to use a term that the implementers can understand!

Comment 5 Sihem Amer-Yahia 2006-01-30 18:33:30 UTC

Changed occurrences of word into token wherever it makes sense since word has a
special meaning in english. Also, added that word and token in some natural
languages refer to the same concept.