Resource Grouping

Skeleton document for discussion in Boston, January 2007

This document is for discussion only and has no official status. It is one of three discussion documents for use at the meeting. The others are Boston 1 and, creatively, Boston 3

Inputs

This is to be a separate, stand alone Recommendation. A scope definition should, I think, therefore also stand alone.

Jo's unfinished paper sets out a detailed, flexible solution to grouping resources by their URIs, using XML.

Please (re) read it and also take a look at a similar procedure defined by an OASIS working group that has been brought to my attention.

Andrea Trasatti also recommends we look at sitemap.xml - a de facto standard along the lines of robots.txt

ERT

Has a particular interest in this. I'm hoping Shadi will be able to join us for this section.

Questions:

Should/can we stick with XML for the scope definition? The feeling in December's meeting was that we probably should. It's possible to encode the same thing in RDF but is it desirable so to do?

As Jo's paper notes, the issue of URI canonicalisation is unresolved. However, this doesn't seem insurmountable. A good canonicalisation scheme could well find uses elsewhere. At its simplest I'd suggest that scheme and authority should be treated as case-insensitive and matched using lower case. path, query and fragment are case sensitive.

But... if needed, I have a more sophisticated canonicalisation algorithm up my sleeve (written by Clive Feather of Demon/Thus)

In the real world it is not always possible to say whether two URIs address the 
same resource, because this will depend on the actual implementation deployed on 
network servers. Therefore the algorithm has two variants - "N" and 
"P" - which respectively tend towards false negatives (the URIs are 
assumed to be different) and false positives (they are assumed to be the same). 
The choice of variant will depend on the application domain.

(1) A relative URI is converted to the equivalent absolute URI, and the scheme name 
is converted to lowercase; any other changes depend on the specific scheme used. 
Except where done as part of a Escaped characters are decoded. 

(2) The URI is decomposed into scheme, authority, and a list of path_segments. If 
the scheme uses user/host/port triples as authorities, the authority is further 
split into the three parts. Any of these parts may be empty.

(3) If the port is the default port for the scheme, it is deleted.

(4) Variant N: the scheme and host (if a domain name) are lowercased.
    Variant P: all the components are lowercased.

(5) The host (if an IP address) and port have all leading zeroes removed.

(6) Variant P only: if the last path_segment ends with one of the strings in the first 
column, that string is replaced by the one in the second column:

        .html      .htm
        .jpeg      .jpg
        .text      .txt
        .ram       .ra

(7) Scheme "http" only: if the last path_segment is "index.htm" or 
"index.html", it is removed.

(8) The URI is recomposed, including only those components that remain and are not empty 
strings. That is, it is the concatenation of:
[always]                          scheme ":"
[if there is an authority]        "//" authority
[if there is a user and host]     "//" user "@" host
[if there is only a host]         "//" host
[if there is a port]              ":" port
[for each path_segment in order]  "/" path_segment

Examples:

  Original URI: "http://Fred@WWW.Thus.net:0112/Test/index.html"
  Variant N:    "http://Fred@www.thus.net:112/Test/"
  Variant P:    "http://fred@www.thus.net:112/test/"

  Original URI: "http://111.011.101.001/test/Image.jpeg"
  Variant N:    "http://111.11.101.1/test/Image.jpeg"
  Variant P:    "http://111.11.101.1/test/image.jpg"

To take an example from Jo's paper, a simple scope statement would be like this:

<scope>
  <host>
    <match name="example.com"/>
  </host>
</scope>

Not much room for confusion: if it's on the example.com domain (any scheme, any subdomain), it's in scope. Jo gives more complex examples too.

What isn't covered in Jo's paper is defining group membership by property. If we want to say "everything on example.com that is red and square" then we might get:

<scope>
  <host>
    <match name="example.com"/>
    <property>
      <ex:colour>red</ex:colour>
      <ex:shape>square</ex:shape>
    </property>
  </host>
</scope>

OK, but where do we find out that a given resource is red and square?

From the resource itself? From a look up table? Either? Both? To make a generic POWDER parser there needs to be a very limited number of ways of accessing the property data - preferably just one.

We discussed in the XG, but it didn't make it into the report, that we should also be able to simply list URIs that are and are not in scope. So we could have:

<scope>
  <host>
    <match name="example.com"/>
    <property>
      <ex:colour>red</ex:colour>
      <ex:shape>square</ex:shape>
      <exclude name="http://www.example.com/item7" matchType="beginsWith" />
    </property>
  </host>
</scope>

So here, although any resource that is red and square on the example.com domain is in scope, we're specifically excluding anything with a URL that begins with http://www.example.com/item7.