URI Pattern Matching for Groups of Resources

Draft 0.1 17 June 2006

Author:: Jo Rabin, Segala

Introduction

Content Labels [in the sense meant by WCL-XG] can refer to groups of resources as well as individual resources. It is frequently either impractical or impossible to list individual URIs that compose the group. There is consequently the need to provide patterns that URIs can be matched with in order to determine whether the label applies to them or not. The WCL-XG report [0] contains more information about Content Labels and what they can refer to.

The following may have more general applicability.

Earlier Work

The URISpace proposal [URISpace], abandoned work on "aboutEach" as part of RDF [aboutEach] and discontinued work on robots.txt [robots] are relevant to this note. The work on Rules-based Resource Property Sets in RDF [Rules] is extremely relevant to this note. Dan Brickley's note on RDF-CL expressivity [DanBri] provides useful background.

Design Goals and Constraints

Patterns MUST be unambiguous in yielding a match or not.
The ease of creation accurate and useful labels is of paramount importance. The ease of creation pattern matching software and its resource consumption is of much less priority. [Simplicity for the user is of much greater importance than simplicity for the implementer]. [Design costs should not be reduced at the expense of usage costs, even though the designer does not pay the usage costs].
It SHOULD be possible to write concise patterns and avoid undue repetition of shared information. [The complexity of the pattern representation should be commensurate with the inherent complexity of the pattern being represented]
Patterns MUST be easy to write and comprehend by humans and as far as is possible SHOULD avoid including or excluding cases unintentionally. The motivation specifically includes not making regular expressions the primary means of expressing patterns.
It MUST be possible to create software that implements pattern matching primarily using standard and commonly available components and specifically MUST not involve creation of custom parsing components. [i.e. it can't involve the creation of a new syntax, though clearly it will involve inventing a meta-syntax, if that makes sense.]
So far as is possible, use of processing resources SHOULD be minimised, especially by early detection of a match or failure to match.

Definitions

The result of a match of a URI to a pattern is true or false.

According to RFC3986 RFC3986 URIs (of the type we are primarily interested in) have the form:

scheme authority path query fragment

Example:

http://www.example.com/example1/example2?query=help#fragment

where authority is composed of

userInfo  port

We further define path to be composed of

leadingsegments finalsegment

And we define each of the above as a component. Components are ordered {so that we can refer to before and after left and right meaningfully.}

path and host have sub-components. The sub-components are separated by dots and slashes respectively in the lexical representation of these components.

A domain is composed of a contiguous selection of the rightmost sub-components of a host. A sub-domain is a contiguous selection of sub-components of host immediately before a domain.

Note: For our purposes, scheme and host are mandatory in the URIs we wish to match, everything else is optional.

Requirements

Non-Requirements

It is not required that patterns allow matching to any conceivable pattern. For example it is not required to be able to match all URIs where the character "b" is the second character of the host component.

High Level Requirements - Use Cases

A list of the kinds of URIs we need to match.

[need to insert]

Detailed / Inferred Requirements

It MUST be possible to build patterns that meet the following requirements and to build patterns out of arbitrary combinations of those patterns, except where this would result in a pattern that is not self consistent.

Scheme Patterns

Match all schemes.
Match one of an enumerated list of schemes.

Host (Domain and Sub-Domain) Patterns

Match a (sub-)domain and all sub-domains, except for those sub-domain patterns given by a list.
Match a (sub-)domain and only those sub-domain patterns given in a list.

Path and Sub-Path (leadingSegment and finalSegment) Patterns

Match only when the (sub-)path is empty.
Match the (sub-)path specified and all sub-paths, except those sub-path patterns given in a list.
Match the (sub-)path specified and only those sub-paths patterns given in a list.
Match when the finalSegment ends with the specified value.

Port Patterns

Match all ports including unspecified.
Match one of an enumerated list of ports.
[10a. Match a range of ports]

Query and Fragment Patterns

Match only when the component is empty.
Match when the component starts with the specified value.
Match when the component exactly matches the specified value.

All patterns

Must be available as either case sensitive or not, except protocol and host which are always case insensitive.
Must be available in negative as well as affirmative forms.

Outline Design

The overall pattern consists of pattern expressions about components which compose candidate URIs.

At least one host pattern is required. In its absence the pattern is invalid. [or can this be inferred in some contexts?]
The matching a pattern consists of an optionally case sensitive leftmost, rightmost or exact [or regex ???] match of the normalised text of the relevant component from the candidate URI. The specification of what to match for a component is called the pattern condition. The pattern condition is expressed using only literal characters (i.e. no wild cards etc.)
Pattern conditions may be positive or negative.
Component Patterns of the same order may be grouped together. If the pattern condition is positive the implied grouping operation is OR. If it is negative the implied grouping operation is AND. Negative and positive conditions cannot be mixed in a group.
Component patterns may be refined by adding inclusions or exclusions to the match.
A pattern consists of a tree structure containing component patterns in precedence order.
Sub-trees can apply either to the whole group or to the individual components in a group.
Other than for host patterns, if there is no pattern for a component then any value for that component of the candidate URI, including nothing, matches.
The tree is walked for each branch that has matching conditions until a leaf node positive match is made or no matching branches are available.

Example XML Syntax

[Note for avoidance of doubt and in view of the group's decision to adopt RDF that it is not intended that this XML representation necessary be adopted]

Pattern

Element pattern contains a pattern. Its children may be scheme and host.

Component Elements

Elements scheme, host, path, leadingsegments, finalsegment, query and fragment contain groups of component patterns for the relevant component.

Component elements contain a sequence of one or more match elements which form a group, followed by an optional component element of lower precedence.

Match

Element match defines a component pattern for its parent element. It has the attributes:

Name: the string to be matched for this component. Required.

Type: with values exact, startsin, endsin. The default type for component elements is given in the table:

scheme	exact
host	endsin
all others	startsin

Negate: true if the match is to be negated. Default false.

Case: true if the match is to be carried out with case sensitivity. Default false.

The match element contains a sequence consisting of one choice of 0 or more include or exclude elements, which form a group and which must not be mixed, followed by a component element of lower precedence than its parent.

Include and Exclude

Elements include and exclude contain the name attribute which refines the match by adding or subtracting the value of the name to the match. The name element identifies directly a match on the string remaining after matching the pattern defined by the containing match element. So if the match element specifies any subdomain of example.com then exclude with the value www. Would remove www.example.com from the match. Elements include and exclude inherit the value of type from their parent match element.

Worked Examples - Showing how to Implement Use Cases

Example 1 - The label applies to www.example.com and all sub-domains and any protocol.

<pattern>
  <host>
    <match name="www.example.com"/>
  </host>
</pattern>

Example 2 - The label applies to www.example.com/children and any sub-domain and any sub-path.

<pattern>
  <host>
    <match name="www.example.com"/>
    <path>
       <match name="/children/"/>
    </path>
  </host>
</pattern>

Example 3 - The label applies to sport and fashion for the hosts example.mobi example.com and example.net

<pattern>
  <host>
    <match name="example.com"/>
    <match name="example.mobi"/>
    <match name="example.net"/>
    <path>
       <match name="/sport/"/>
       <match name="/fashion/"/>
    </path>
  </host>
</pattern>

Example 4 - The label applies to sport and fashion for the hosts example.mobi example.com and example.net and also to weather but only at example.mobi, and rules out test.example.mobi.

<pattern>
  <host>
    <match name="example.com"/>
    <match name="example.net"/>
    <match name="example.mobi">
      <exclude name="test."/>
      <path>
        <match name="/weather/"/>
      <path>
    </match>
    <path>
       <match name="/sport/"/>
       <match name="/fashion/"/>
    </path>
  </host>
</pattern>

Further examples in Appendix.

Limitations

As expressed this does not provide a short hand notation for

"Hosts a and b match if the path is c, and hosts d and e match if the path is f"

It would have to be expressed as

"Host a matches if the path is c, host b matches if the path is c, host d matches if the path is f host e matches if the path is f."

This would be solved by introducing an explicit group element.

Processing Model for URI Dereferencing

The pattern model is intended for use both prior to and after dereferencing a URI. That is to say that if an application wishes to recover a label for a URI prior to dereferencing it and has access to a repository of such labels then it MAY choose to act on the basis of the labels discovered.

It is possible that in the course of dereferencing a URI a resource with a different URI results. This might happen as a result of redirection or of the server returning results having a different URI to that requested. In this case the evaluation SHOULD be repeated with the new URI.

[tbd] It is also possible that that dereferencing a URI may provide references to one or more content labels. Applications SHOULD evaluate such labels as part of determining the content labels that apply to the content. [we need to think more about this and it should be in [XGR-REPORT]].

URI Normalization

This is tricky. [Extract some stuff from [RFC3986]].

Acknowledgements

The author (Jo Rabin, Segala) is grateful to Phil Archer (ICRA) and Henry S. Thompson (Markup Systems / Delphix / W3C) for their insightful comments.

References

XGR-REPORT: http://www.w3.org/2005/Incubator/wcl/XGR-report/
URISpace: http://www.w3.org/TR/urispace
aboutEach: http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/#URIPrefix
robots: http://www.robotstxt.org/wc/norobots-rfc.html
Rules: http://www.w3.org/2004/12/q/doc/rdf-rulesets.html
DanBri: http://lists.w3.org/Archives/Public/public-xg-wcl/2006Jun/0000.html
RFC3986: http://www.gbiv.com/protocols/uri/rfc/rfc3986.html

Appendix - Parsing URIs

[Insert Discussion ...]

RFC 3986 [RFC3986] gives the following regex:

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

By changing this to

^(([^:/?#]+):)?(//((([^/?#]*)@)?([^/?#:]*)(:([^/?#]*))?))?([^?#]*)(\?([^#]*))?(#(.*))?

You get

scheme: $2, authority: $4 (userinfo: $6, host:$7, port: $9), path: $10, query: $12, fragment: $14

Appendix - Example Processing Model for Processing Patterns

Prior to using component patterns they are normalised. It is not important exactly when. The candidate URI is normalised at the start of processing. See below for discussion of URI normalisation.

The pattern must contain a host pattern, which because of precedence order must either be at the root or the child of a scheme pattern, itself at the root of the tree.

For each component pattern:

If the candidate URI's component matches the pattern and any refinements defined for that pattern then if the pattern contains a subtree, recursively process the component pattern in the sub-tree until the [... run out of time]

Appendix - More Examples

Example 5 - The label applies to ftp://ftp.example.com/lurid and http://www.example.com/lurid all subdomains any sub-path etc.

<pattern>
  <scheme>
    <match name="ftp">
      <host>
        <match name="ftp.example.com"/>
      </host>
    </match>
    <match name="http">
      <host>
        <match name="www.example.com"/>
      </host>
    </match>
    <path>
       <match name="/lurid/"/>
    </path>
  </scheme>
</pattern>

Example 6 - the label applies to example.com/silly and example.mobi/billy

<pattern>
  <host>
    <match name="example.com">
      <path>
        <match name="/silly/"/>
      <path>
    </match>
    <match name="example.mobi">
      <path>
        <match name="/billy/"/>
      <path>
    </match>
    <!-- slight issue here - don't want to match anything so match anything including nothing-->
    <path>
       <match name="" negate="true"/>
    </path>
  </host>
</pattern>

Example 7 - Label applies to exactly www.example.com only

<pattern>
  <host>
    <match name="www.example.com"/>
    <!-- take care of both trailing / and not -->
    <path>
       <match name="" type="exact" />
       <match name="/" type="exact" />
    </path>
  </host>
</pattern>

Example 8 - As explained in the text

<pattern>
  <scheme>
    <match name="http"/>
    <host>
       <match name="www.example.com">                     <!-- host must be www.example.com and any subdomains thereof -->
         <exclude name="451"/>                            <!-- except 451 and any subdomains -->
         <path>                                           <!-- for this host -->
            <match name="/images/">                       <!-- for path starts with /image/ -->
              <match name=".jpg" type="endsin"/>          <!-- all files must end in jpg -->
            </match>
         </path>
       </match>   
       <match name = "jorabin.com" match="exact">         <!-- host must be jorabin.com -->
         <include name = "x" type = "exact" />            <!-- and subdomains x and y -->
         <include name = "y" type = "exact" />
         <leadingsegments>
           <match name= "/news/" type= "exact">           <!-- if directory path is exactly /news/ -->
             <finalsegment>                               <!-- then must not be html -->
               <match name=".html" type="exact" negate = "true" case="true"/>
             </finalsegment>
           </match>
         </leadingsegments>
       </match>
       <path>
         <match name="/sport/"/>                           <!-- matches sport and fashion for  example.com and jorabin.com -->
         <match name="/fashion/"/>
       </path>  
    </host>
  </scheme>
</pattern>