Re: more on attribute proposal from Alan Kent on 2003-07-21 (www-zig@w3.org from July 2003)

From: Alan Kent <ajk@mds.rmit.edu.au>
Date: Mon, 21 Jul 2003 10:06:35 +1000
To: www-zig@w3.org
Message-ID: <20030721100635.B10849@io.mds.rmit.edu.au>
Hi Ray,

I have replied to your email below.

Sorry to be a broken record, but I think its *critical* to get scanning
to work. Once scanning is fixed, the query stuff falls out in the wash.
The attribute architecture is not only for querying - its for everything
that uses attributes. This includes scanning.

If you step away from doing searches for a moment, and just look at
scanning indexes, you immediately and clearly hit the problem (in my
opinion anyway! ;-). If I have term-lists that contain words from titles
and the complete values of titles, then how do I express an attribute
list for scanning?

If I read the textual descriptions of the various attribute types, then
format/structure sounds ideal. The Bib-2 attribute values make complete
sense for scanning. Its the Util attribute values that are strange.
Why specify 'any of these words' when doing a scan to identify that
I want the title as a scan? Its semantically wrong. You want to identify
the fact that you want words independently to the search-oriented operator
of how to handle multiple terms in a search.

Once you reach this point, you realise the descriptions for the attribute
types are good. The overall architecture is good. Its just that the utility
attribute set has not defined words vs strings, and that some comparison
operators (any/all/adj) have slipped into format/structure by mistake.

So I strongly recommended for a moment forgetting searches, and thinking
about SCAN requests. What are the attribute lists for scanning title
as keywords and title as complete values?



On Fri, Jul 18, 2003 at 05:46:57PM -0400, Ray Denenberg wrote:
> There's consensus (among those who have participated in this
> discussion) that allTheseWords, anyOfTheseWords, adjacentWords should
> be changed from Structure/format  to Comparison attributes.
> 
> There's less consensus about adding two new Structure/format
> attributes, (1) word(s), and (2) string (or 'completeValue').  Mike
> feels strongly that they should be added, and I don't feel strongly but
> am somewhat uncomfortable about adding them (without clarifying certain
> other parts of the proposal). I don't know how strongly Alan feel.  And
> I'd like to get other opinions.

Actually, I think the discussion has been the opposite. I think there is
strong consensus that word and string should be in format/structure.
This is because they should be talking about the format or interpretation
of the structure of the value supplied. This is ideal for doing index
scanning too as the attribute is also for what is returned by a SCAN
request - it describes the format/structure of the returned scan terms.
It is not purely a query attribute.

As a *result* of this consensus, it was realised and agreed to all,any,adj
words should move out - they are in the wrong spot. Comparison is a more
correct place. It makes sense with scanning too. Comparisons are query
operators, not scanning stuff (this is a little hand-wavy here, which is
always dangerous as I know I can come up with example applications where
this is not true).

But I am happy to get other people's opinions. I think I have finally
managed to express what I meant clearly so that Mike and Rob understand
and agree with what I am saying (they may have actually reached where
I am at before me as a result of the CQL work).

> This is how I see it: if the query term is a set of words, and the
> comparison attribute is one of the above three, then clearly a
> structure/attribute to indicate  "words" is not necessary.

To keep arguments simpler, I have tried to avoid the different
term extraction rules side of things. But I want to support different
definitions of what a 'word' is. To me, allWords, anyWords etc really
should be allTerms, anyTerms, etc. They define what to do if there
are multiple terms extracted from the query. The terms can be words.
But the terms could also be floating point numbers defining a line
segment, or coordinate pairs, or special things in chemical formulas etc.
I think its better if comparison operators, whenever possible, should
define how to *compare* values, not how to extract values to be
compared. Orthogonality is good. It allows new term extraction rules
to be added orthogonally.

For example, Bib-2 already defines additional format/structure attributes.
So its not a possibility, its a current reality. Its not just words
and strings we are talking about - its the ability to define multiple
ways to structure terms extracted from records (and queries), then
then keeping this independent to comparison operators.

I would love to change any/all/adj from 'words' to 'terms' in
general. I think they will make sense when people define other
concepts of how to extract terms from record content. However, this
seemed to hard for people to swallow so I backed off trying to
get at least the major problem fixed.

> Conversely, if the desire is to search for words (as opposed to a
> complete string) then can the comparison attribute be anything but one
> of these three? 

Probably not. However, I believe a goal of the AA is to be extensible,
and I can see cases where different projects may want different concepts
of what a 'word' is. I am thinking more of chemical formulas, geographic
coordinates, other rich and complex data types etc.

This would be done by defining a new 'chemical' attribute set with a set
of chemsitry specific access point names and new format/structure attributes
related to chemical formulas. (Note: I know almost nothing about chemistry.
I am using it as an example only.)

> However, what if the term is a single word? If the intent is to
> search for it as a word (not a string), I don't think Alan's proposal
> addresses whether this should fit within the three attributes proposed
> - all three would mean the same thing, and so there may be  sentiment
> for separating out the single-word case. If so, then I can see a
> stronger argument for having  'word' and 'string' format/structure
> values.

I think what you are saying is a good example of why comparions operators
should not be used to define what terms are. Its a good example of one
of the many little nasty side effects that come up. That is why I
strongly believe any/all/adj words should not imply term structure.

> So I see two possibilities:
>
> 1.  A single-word search would be handled by one of the
> word-comparison attributes (one of these would be "singled-out" for
> this use),  no format/structure attribute included. If the term is a
> single-word but is to be searched as a string, then another comparison
> would be used. [aside: I'm not sure which one though. "Equal" seems to
> be precluded, since the Utility set prose says that it cannot be used
> with expansion/interpretation. On the other hand, Bath uses it.  This
> may be another defect that we should address.]
>
> 2. When the term is a single-word, the comparison attribute may not
> be one of the above three (they can only be used for multiple words)
> and the format/structure 'word' or 'string' is supplied.
> 
> I think we need to nail down one of these two, and I don't really care which. 

I think the above assumes there is only one definition of what a 'word'
is, and I think the goal of the AA is to be a framework for expansion,
not restrictive. So I don't think you can ever preclude using a
format/structure attribute in a query. I (personally) think it makes
sense leaving format/structure open to identify different ways to
extract multiple terms from a record (even different term extraction
rules).

I understand where you are coming from, but I don't think either (1) or (2)
above should be mandated. The problem is both options assume the client
*knows* whether the search term contains one or more words. But is
'book-case' one word or two? Clients do not know the word extraction
rules used by a server (there is no formal agreed interpretation of
what a word is anywhere), so clients cannot know if a search string
entered by a user is a single word or not.

So I think
* Clients must be allowed to send all/any/adj for single or multiple
  word queries.
* Clients must always be allowed to send a format/structure attribute.
  If omitted, its the servers choice as what to do.

I don't think its necessary to define the preferred way to do single
word queries as distinct from multi-word queries - as it implies the
client has to understand how to extract words from strings using
the same rules as a server. If this is considered important, then
I would look at adding a new comparison operator to the any/all/adj list
of 'exactly one', which aborts the query with an error if there is not
exactly one word (term) supplied. The responsibility is then given to
the server rather than being on the client.

Alan
Received on Sunday, 20 July 2003 20:06:42 UTC