<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>2766</bug_id>
          
          <creation_ts>2006-01-25 01:21:32 +0000</creation_ts>
          <short_desc>Word or Token (need clarification)</short_desc>
          <delta_ts>2006-08-16 17:55:38 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>XPath / XQuery / XSLT</product>
          <component>Full Text 1.0</component>
          <version>Working drafts</version>
          <rep_platform>All</rep_platform>
          <op_sys>Windows XP</op_sys>
          <bug_status>CLOSED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Joaquin Delgado">joaquin.delgado</reporter>
          <assigned_to name="Sihem Amer-Yahia">sihem</assigned_to>
          
          
          <qa_contact name="Mailing list for public feedback on specs from XSL and XML Query WGs">public-qt-comments</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>7962</commentid>
    <comment_count>0</comment_count>
    <who name="Joaquin Delgado">joaquin.delgado</who>
    <bug_when>2006-01-25 01:21:33 +0000</bug_when>
    <thetext>According to the last published draft:

&quot;A word is defined as a character, n-gram, or sequence of characters returned 
by a tokenizer as a basic unit to be searched. Each instance of a word 
consists of one or more consecutive characters. Beyond that, words are 
implementation-defined. Note that consecutive words need not be separated by 
either punctuation or space, and words may overlap. A phrase is a sequence of 
ordered words which may contain any number of words.&quot;

I&apos;m not convinced we should use &quot;word&quot;, which has its own semantics in plain 
English, in the above definition. The problem I have with &quot;word&quot; is that it 
may get confused with the meaning  of &quot;word&quot; in plain English which is 
associated with a concept. Notice that an N-gram or an arbitrary sequence of 
characters does not have such connotation. I think the definition above 
relates more to &quot;token&quot;. In fact later we later refer to words as 
tokens: &quot;Whatever a tokenizer for a particular language chooses to do, it must 
preserve the containment hierarchy: paragraphs contain sentences which contain 
words. The tokenizer has to evaluate two equal strings in the same way, i.e., 
it should identify the same tokens.&quot; and also use the data structure called 
TokenInfo. I think its better to use tokens all throughout the document or 
clearly state that words and tokens mean the same thing.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>7971</commentid>
    <comment_count>1</comment_count>
    <who name="Pat Case">pcase</who>
    <bug_when>2006-01-25 14:51:29 +0000</bug_when>
    <thetext>Thanks for raising this discrepancy Joaquin.

Early on, in consultation with the I18n, we decided to use &quot;words&quot; 
not &quot;tokens&quot;. It holds more meaning and is less obscure. I still feel strongly 
that we were right in that decision.

We struggled, again with assistance from the I18n, to produce what I think is 
an excellent definition of the word &quot;word&quot;. 

In Section 4 we use &quot;TokenInfo&quot; (and define it) and variables such 
as &quot;$searchToken&quot;. I would prefer they be &quot;WordInfo&quot; and $searchWord&quot;, but can 
live with these. 

In Section 4 I see occurrences of token and search token. I would like to see 
these changed to word, $searchToken, or another variable name where appropriate.

If we decide we need to keep the word &quot;token&quot; in Section 4, I agree it should 
be defined, defined as a word returned by a tokenizer used as a search operand.

I think I am correct that a token is always a word? When we treat phrases, 
sentences, and paragraphs as single units we call them intervals, right?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>7973</commentid>
    <comment_count>2</comment_count>
    <who name="Pat Case">pcase</who>
    <bug_when>2006-01-25 17:44:11 +0000</bug_when>
    <thetext>This comment amends my previous one which said:

&gt;If we decide we need to keep the word &quot;token&quot; in Section 4, I agree it should 
&gt;be defined, defined as a word returned by a tokenizer used as a search operand.

Remembering that Full Text is part of XQuery and XPath and may someday fold 
into those specs, and knowing the XQuery uses the word &quot;token&quot; for items other 
than words (without defining it), we probably should not use tokens in a more 
restrictive way within Full text, so we probably shouldn&apos;t define tokens in 
terms of words within Full Text. 

I recommend always using &quot;word&quot; instead of &quot;token&quot; because it is more specific 
to full-text querying.
</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>7974</commentid>
    <comment_count>3</comment_count>
    <who name="Sihem Amer-Yahia">sihem</who>
    <bug_when>2006-01-25 17:55:31 +0000</bug_when>
    <thetext>If &quot;word&quot; is an issue, what about &quot;term&quot;? It is used heavily in information
retrieval to mean &quot;word&quot;.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>7975</commentid>
    <comment_count>4</comment_count>
    <who name="Michael Rys">mrys</who>
    <bug_when>2006-01-25 18:00:19 +0000</bug_when>
    <thetext>The problem is that the implementation community always uses the term &quot;token&quot; 
and not word. Since this is primarily an implementation spec, I strongly urge 
us to use a term that the implementers can understand!
</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>8046</commentid>
    <comment_count>5</comment_count>
    <who name="Sihem Amer-Yahia">sihem</who>
    <bug_when>2006-01-30 18:33:30 +0000</bug_when>
    <thetext>Changed occurrences of word into token wherever it makes sense since word has a
special meaning in english. Also, added that word and token in some natural
languages refer to the same concept. </thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>