Content Labels [in the sense meant by WCL-XG] can refer to groups of resources as well as individual resources. It is frequently either impractical or impossible to list individual URIs that compose the group. There is consequently the need to provide patterns that URIs can be matched with in order to determine whether the label applies to them or not. The WCL-XG report [0] contains more information about Content Labels and what they can refer to.
The following may have more general applicability.
The URISpace proposal [URISpace], abandoned work on "aboutEach" as part of RDF [aboutEach] and discontinued work on robots.txt [robots] are relevant to this note. The work on Rules-based Resource Property Sets in RDF [Rules] is extremely relevant to this note. Dan Brickley's note on RDF-CL expressivity [DanBri] provides useful background.
The result of a match of a URI to a pattern is true or false.
According to RFC3986 RFC3986 URIs (of the type we are primarily interested in) have the form:
scheme authority path query fragment
Example:
http://www.example.com/example1/example2?query=help#fragment
where authority is composed of
userInfo port
We further define path to be composed of
leadingsegments finalsegment
And we define each of the above as a component. Components are ordered {so that we can refer to before and after left and right meaningfully.}
path and host have sub-components. The sub-components are separated by dots and slashes respectively in the lexical representation of these components.
A domain is composed of a contiguous selection of the rightmost sub-components of a host. A sub-domain is a contiguous selection of sub-components of host immediately before a domain.
Note: For our purposes, scheme and host are mandatory in the URIs we wish to match, everything else is optional.
It is not required that patterns allow matching to any conceivable pattern. For example it is not required to be able to match all URIs where the character "b" is the second character of the host component.
A list of the kinds of URIs we need to match.
[need to insert]
It MUST be possible to build patterns that meet the following requirements and to build patterns out of arbitrary combinations of those patterns, except where this would result in a pattern that is not self consistent.
[10a. Match a range of ports]
The overall pattern consists of pattern expressions about components which compose candidate URIs.
[Note for avoidance of doubt and in view of the group's decision to adopt RDF that it is not intended that this XML representation necessary be adopted]
Element pattern contains a pattern. Its children may be scheme and host.
Elements scheme, host, path, leadingsegments, finalsegment, query and fragment contain groups of component patterns for the relevant component.
Component elements contain a sequence of one or more match elements which form a group, followed by an optional component element of lower precedence.
Element match defines a component pattern for its parent element. It has the attributes:
Name: the string to be matched for this component. Required.
Type: with values exact, startsin, endsin. The default type for component elements is given in the table:
scheme | exact |
host | endsin |
all others | startsin |
Negate: true if the match is to be negated. Default false.
Case: true if the match is to be carried out with case sensitivity. Default false.
The match element contains a sequence consisting of one choice of 0 or more include or exclude elements, which form a group and which must not be mixed, followed by a component element of lower precedence than its parent.
Elements include and exclude contain the name attribute which refines the match by adding or subtracting the value of the name to the match. The name element identifies directly a match on the string remaining after matching the pattern defined by the containing match element. So if the match element specifies any subdomain of example.com then exclude with the value www. Would remove www.example.com from the match. Elements include and exclude inherit the value of type from their parent match element.
Example 1 - The label applies to www.example.com and all sub-domains and any protocol.
<pattern> <host> <match name="www.example.com"/> </host> </pattern>
Example 2 - The label applies to www.example.com/children and any sub-domain and any sub-path.
<pattern> <host> <match name="www.example.com"/> <path> <match name="/children/"/> </path> </host> </pattern>
Example 3 - The label applies to sport and fashion for the hosts example.mobi example.com and example.net
<pattern> <host> <match name="example.com"/> <match name="example.mobi"/> <match name="example.net"/> <path> <match name="/sport/"/> <match name="/fashion/"/> </path> </host> </pattern>
Example 4 - The label applies to sport and fashion for the hosts example.mobi example.com and example.net and also to weather but only at example.mobi, and rules out test.example.mobi.
<pattern> <host> <match name="example.com"/> <match name="example.net"/> <match name="example.mobi"> <exclude name="test."/> <path> <match name="/weather/"/> <path> </match> <path> <match name="/sport/"/> <match name="/fashion/"/> </path> </host> </pattern>
Further examples in Appendix.
As expressed this does not provide a short hand notation for
"Hosts a and b match if the path is c, and hosts d and e match if the path is f"
It would have to be expressed as
"Host a matches if the path is c, host b matches if the path is c, host d matches if the path is f host e matches if the path is f."
This would be solved by introducing an explicit group element.
The pattern model is intended for use both prior to and after dereferencing a URI. That is to say that if an application wishes to recover a label for a URI prior to dereferencing it and has access to a repository of such labels then it MAY choose to act on the basis of the labels discovered.
It is possible that in the course of dereferencing a URI a resource with a different URI results. This might happen as a result of redirection or of the server returning results having a different URI to that requested. In this case the evaluation SHOULD be repeated with the new URI.
[tbd] It is also possible that that dereferencing a URI may provide references to one or more content labels. Applications SHOULD evaluate such labels as part of determining the content labels that apply to the content. [we need to think more about this and it should be in [XGR-REPORT]].
This is tricky. [Extract some stuff from [RFC3986]].
The author (Jo Rabin, Segala) is grateful to Phil Archer (ICRA) and Henry S. Thompson (Markup Systems / Delphix / W3C) for their insightful comments.
[Insert Discussion ...]
RFC 3986 [RFC3986] gives the following regex:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
By changing this to
^(([^:/?#]+):)?(//((([^/?#]*)@)?([^/?#:]*)(:([^/?#]*))?))?([^?#]*)(\?([^#]*))?(#(.*))?
You get
scheme: $2, authority: $4 (userinfo: $6, host:$7, port: $9), path: $10, query: $12, fragment: $14
Prior to using component patterns they are normalised. It is not important exactly when. The candidate URI is normalised at the start of processing. See below for discussion of URI normalisation.
The pattern must contain a host pattern, which because of precedence order must either be at the root or the child of a scheme pattern, itself at the root of the tree.
For each component pattern:
If the candidate URI's component matches the pattern and any refinements defined for that pattern then if the pattern contains a subtree, recursively process the component pattern in the sub-tree until the [... run out of time]
Example 5 - The label applies to ftp://ftp.example.com/lurid and http://www.example.com/lurid all subdomains any sub-path etc.
<pattern> <scheme> <match name="ftp"> <host> <match name="ftp.example.com"/> </host> </match> <match name="http"> <host> <match name="www.example.com"/> </host> </match> <path> <match name="/lurid/"/> </path> </scheme> </pattern>
Example 6 - the label applies to example.com/silly and example.mobi/billy
<pattern> <host> <match name="example.com"> <path> <match name="/silly/"/> <path> </match> <match name="example.mobi"> <path> <match name="/billy/"/> <path> </match> <!-- slight issue here - don't want to match anything so match anything including nothing--> <path> <match name="" negate="true"/> </path> </host> </pattern>
Example 7 - Label applies to exactly www.example.com only
<pattern> <host> <match name="www.example.com"/> <!-- take care of both trailing / and not --> <path> <match name="" type="exact" /> <match name="/" type="exact" /> </path> </host> </pattern>
Example 8 - As explained in the text
<pattern> <scheme> <match name="http"/> <host> <match name="www.example.com"> <!-- host must be www.example.com and any subdomains thereof --> <exclude name="451"/> <!-- except 451 and any subdomains --> <path> <!-- for this host --> <match name="/images/"> <!-- for path starts with /image/ --> <match name=".jpg" type="endsin"/> <!-- all files must end in jpg --> </match> </path> </match> <match name = "jorabin.com" match="exact"> <!-- host must be jorabin.com --> <include name = "x" type = "exact" /> <!-- and subdomains x and y --> <include name = "y" type = "exact" /> <leadingsegments> <match name= "/news/" type= "exact"> <!-- if directory path is exactly /news/ --> <finalsegment> <!-- then must not be html --> <match name=".html" type="exact" negate = "true" case="true"/> </finalsegment> </match> </leadingsegments> </match> <path> <match name="/sport/"/> <!-- matches sport and fashion for example.com and jorabin.com --> <match name="/fashion/"/> </path> </host> </scheme> </pattern>