W3C

Protocol for Web Description Resources (POWDER): Grouping of Resources

W3C Working Draft 24 March 2008

This version
http://www.w3.org/TR/2008/WD-powder-grouping-20080324/
Latest version
http://www.w3.org/TR/powder-grouping/
Previous version
http://www.w3.org/TR/2007/WD-powder-grouping-20071031/
Editors:
Andrea Perego, Università degli Studi dell'Insubria
Phil Archer, Family Online Safety Institute
Kevin Smith, Vodafone Group R & D

Abstract

The Protocol for Web Description Resources (POWDER) facilitates the publication of descriptions of multiple resources such as all those available from a Web site. This document describes how sets of IRIs can be defined such that descriptions or other data can be applied to the resources obtained by dereferencing IRIs that are elements of the set. IRI sets are defined as XML elements with relatively loose operational semantics. This is underpinned by the formal semantics of POWDER which include a semantic extension defined in this document. A GRDDL transform is associated with the POWDER namespace that maps the operational to the formal semantics.

Status of this document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is a Public Working Draft, designed to aid discussion. The POWDER Use Cases and Requirements document [USECASES] details the motivation for the creation this and companion documents. Changes since earlier versions of this document are recorded in the change log. This draft includes substantial and significant changes, introducing the operational/formal semantics model that is also developed in the Description Resources document [DR].

This document was developed by the POWDER Working Group. The Working Group expects to advance this Working Draft to Recommendation Status.

Please send comments about this document to public-powderwg@w3.org (with public archive); please include the text "comment" in the subject line.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents

1 Introduction
1.1 Design Goals and Constraints
1.2 Outline Methodology
1.3 Operational Semantics
1.4 Formal Semantics
2 Defining a Resource Set
2.1 Constraints on IRI Components
2.1.1 IRI Constraints Referring to Ports
2.1.2 IRI Constraints Referring to Query Strings
2.1.3 IRI/URI Canonicalization
2.1.4 Data encoding
2.2 Grouping using Wildcards: The includeiripattern and excludeiripattern Constraints
2.3 Grouping by Regular Expression: The includeregex and excluderegex Constraints
2.3.1 Safe Use of includeregex
2.4 Grouping by IP Address
2.4.1 Safe Usage of the includeipranges Constraint
2.5 Enumerating Elements of an IRI Set: the includeresources and excluderesources Constraints
2.6 Redirection: the includeredirection Constraint
2.7 Complex Sets: Conjunction and Disjunction
3 Extension Mechanism
3.1 Extension Example 1: Custom IRI Patterns
3.2 Extension Example 2: Custom Site Structure
3.3 Extension Example 3: ISAN
References
Acknowledgments
Change Log

1 Introduction

The Protocol for Web Description Resources (POWDER) facilitates the publication of descriptions of multiple resources such as all those available from a Web site. These descriptions are attributable to a named individual, organization or entity that may or may not be the creator of the described resources. This contrasts with more usual metadata that typically applies to a single resource, such as a specific document's title, which is usually provided by its author.

Description Resources (DRs) are described separately [DR]. This document sets out how groups (i.e. sets) of resources may be defined, either for use in DRs or in other contexts. Set theory has been used throughout as it provides a well-defined framework that leads to unambiguous definitions. However, it is used solely to provide a formal version of what is written in the natural language text. Companion documents describe the POWDER vocabulary [VOC] and XML data types [WDRD] that are derived from this and the Description Resources document, setting out each term's domain, range and constraints. As each term is introduced in this document, it is linked to its description in the vocabulary document. The use cases, a primer and test suite complete the document set.

The POWDER vocabulary namespace is http://www.w3.org/2007/05/powder# for which we use the prefix wdr. Where appropriate, we use the entity &wdr; to refer to the same namespace URI. All namespaces and prefixes used in this document are shown in the table below.

PrefixNamespace
wdrhttp://www.w3.org/2007/05/powder#
rdfhttp://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfshttp://www.w3.org/2000/01/rdf-schema#"
owlhttp://www.w3.org/2002/07/owl#
xsdhttp://www.w3.org/TR/xmlschema-2/

In this document, the words MUST, MUST NOT, SHOULD, SHOULD NOT and MAY are to be interpreted as described in RFC2119.

White space is any of U+0009, U+000A, U+000D and U+0020. A space-separated list is a string of which the items are separated by one or more space characters (in any order). The string may also be prefixed or suffixed with zero or more of those characters. To obtain the values from a space-separated list user agents MUST replace any sequence of space characters with a single U+0020 character, dropping any leading or trailing U+0020 character, and then chopping the resulting string at each occurrence of a U+0020 character, dropping that character in the process.

The (unqualified) terms POWDER, POWDER Document and Description Resource (DR) refer to operational representations and semantics. The terms POWDER-S and DR-S refer to documents and data that express the formal semantics of POWDER.

1.1 Design Goals and Constraints

In designing a system to define sets of resources we have drawn on earlier work [Rabin] carried out in the Web Content Label Incubator Activity [WCL-XG], and taken into account the following considerations.

  1. It must be possible to define a set of resources, either by describing the characteristics of the resources in the set, or by simply listing its elements.
  2. It must be possible to determine with certainty whether a given resource is or is not an element of the Resource Set
  3. The ease of creation of accurate and useful Resource Sets is important.
  4. It should be possible to write concise Resource Set definitions.
  5. Resource Set definitions must be easy to write, be comprehensible by humans and, as far as is possible, should avoid including or excluding resources unintentionally.
  6. It must be possible to create software that implements Resource Set definitions primarily using standard and commonly available components and specifically must not require the creation of custom parsing components.
  7. So far as is possible, use of processing resources should be minimized, especially by early detection of a match or failure to match.

1.2 Outline Methodology

Defining a resource set by specifying the characteristics that the resources in the set share is clearly an indirect approach, albeit a very useful one in the real world. In a logical sense, the definition must be interpreted to arrive at the full set.

Operationally, POWDER does not define resource sets, rather, it facilitates the definition of sets of IRIs (International Resource Indentifiers) [IRIS], which can be used to denote resources in terms of their identity. We use the notion of IRI instead of the one of URI, since IRIs are a superset of URIs. Therefore, an IRI set definition may denote a set of IRIs as well as a set of URIs.

IRI Set definitions must be unambiguous so that an application can always determine with certainty whether a specific IRI is or is not within the defined set.

More formally, an IRI Set definition D denotes a set of IRIs IS = DI, where DI is the interpretation of D, i.e., the set of IRIs sharing the characteristics denoted by D.

We take this further and allow an IRI set definition to be built up in stages.

An IRI set IS is denoted by an IRI set definition DIS in terms of one or more characteristics that the elements of the set have in common. Each characteristic is expressed by an IRI constraint C, and IRI constraints C1, C2, …, Cn give rise to IRI set definitions D, 1, D2, …, Dn, so that the complete IRI set definition DIS comprises D1, D2, …, Dn.

The IRI set IS is the intersection of the IRI sets denoted by the IRI set definitions in DIS.

Formally, IS = DISI = D1ID2I ∩ … ∩ DnI = (D1D2 ∧ … ∧ Dn)I.

For example, suppose that an IRI set IS is denoted by the following definitions:

Then, DIS will be defined as follows: “the top level components of the host component of the IRI match example.org” AND “the path component of the IRI begins with /foo.”

Whether the IRI of a specific resource R, known as the candidate resource, is a member of IRI Set IS or not is determined by comparing its characteristics with those denoted by the set definitions used in DIS. It must be an element of the intersection of the sets defined by the interpretation of D1, D2, …, Dn to be an element of IS.

If a IRI set definition is empty, then the set is undefined and IS MUST be considered as the empty set ∅. Formally:

Let IS be an IRI Set, and let DIS be the set of IRI Set definitions denoting the IRIs in IS: if DIS = ∅, then IS = ∅.

1.3 Operational Semantics

In POWDER, an XML schema defines the set of XML elements and attributes to be used for enforcing the operational semantics of an IRI set definition.

More precisely, we define an XML element iriset to take the place of the IRI set, and its child elements denote the set of IRI constraints C1, C2, …, Cn. The example reported in the previous section can therefore be written as follows:

<wdr:iriset>
  <wdr:includehosts>example.org</wdr:includehosts>
  <wdr:includepathstartswith>/foo</wdr:includepathstartswith>
</wdr:iriset>

1.4 Formal Semantics

The operational semantics described above are underpinned by formal semantics. A GRDDL transform is associated with the POWDER namespace that allows the XML data to be rendered and processed as RDF/OWL. In order for string values of RDF properties to be interpreted as IRIs and their components, we define the following semantic extension which relates to the hasIRI property.

<x, uuu> is in IEXT(I(wdr:hasIRI)) iff uuu is in the value space of xsd:anyURI, uuu is in the domain of I, with I(uuu)=x.

Based on this, the RDF/OWL representation of the XML encoding of an IRI set in the example above will give:

<wdr:iriset rdf:nodeID="IRISet1">
  <owl:intersectionOf rdf:parseType="Collection">
    <owl:Restriction>
      <owl:onProperty rdf:resource="&wdr;includehosts" />
      <owl:hasValue>example.org</owl:hasValue>
    </owl:Restriction>
    <owl:Restriction>
      <owl:onProperty rdf:resource="&wdr;includepathstartswith" />
      <owl:hasValue>/foo</owl:hasValue>
    </owl:Restriction>
  </owl:intersectionOf>
</wdr:iriset>

whereas the set of resources having one of the IRIs denoted by the IRI set definition above will be denoted by:

<owl:Restriction>
  <owl:onProperty rdf:resource="&wdr;hasIRI" />
  <owl:someValuesFrom rdf:nodeID="IRISet1" />
</owl:Restriction>

Note that here iriset is an OWL class, whereas includehosts and includepathstarts are RDF properties.

As is described below, the POWDER vocabulary provides many terms that refer to IRI components, such as includehosts, excludepathstartswith etc. Rather than create a series of similar semantic extensions for each of these terms we refer the reader to the above extension as an indicative example of the intended meaning. For clarity however, we offer the following as an example of how vocabulary terms should be interpreted, based on the semantic extension given above.

<uuu,sss> is in IEXT(I(wdr:includehosts)) iff uuu is in the value space of xsd:anyURI, sss is a string. sss is understood as a space-separated list of strings lll, including an element hhh, such that when uuu is normalized as specified in section 2.1.3 of this document, the final host components of uuu match hhh.

2 Defining a Resource Set

A Resource Set is defined in terms of the IRIs of resources that are its members. Determining whether a candidate resource is, or is not, a member of the set, can therefore be done by comparing its IRI with the data in the set definition. Importantly, defining the Resource Set in terms of IRIs allows us to verify whether the candidate resource is in the set without having to fetch and parse it, or perform a DNS lookup, thus maximizing processing efficiency in many environments.

We define a range of methods to support set definition by IRI, and provide support for methods defined in other Recommendations.

2.1 Constraints on IRI components

The syntax of an IRI, as defined in RFC3987 [IRIS], provides a generic framework for identification schemes that goes beyond what is demanded by the POWDER use cases [USECASES]. We therefore limit our work to IRIs with the syntax: scheme://ihost:port/ipath?iquery, as shown below:

http://www.example.com:1234/example1/example2?query=help
\   /  \             / \  /\                / \        /
 ---    -------------   --  ----------------   --------  
  |           |          |          |              |  
scheme      ihost      port       ipath          iquery

The iuserinfo and ifragment components are not supported, as it is felt that these are not useful in defining IRI sets, and may add a layer of unnecessary complexity. That said, it is noteworthy that IRI sets may be defined using additional vocabularies as set out in Section 3. That extension method, or the use of the includeregex and excluderegex properties, means that user info and fragments can be used in IRI set definitions if required.

The following Regular Expression, elaborated from that offered in RFC 3986 [Rabin], provides a means of splitting both URIs and IRIs of this type into their component parts.

(([^:/?#]+):)?(//([^:/?#@]*)(:([0-9]+))?)?([^?#]*)(\?([^#]*))?

This yields the components as shown in Table 1.

Table 1: Mapping between regular expression variables and IRI components
ComponentRE variable
scheme $2
ihost $4
port $6
ipath $7
iquery $9

For the scheme, ihost, port, ipath, and iquery IRI components we define corresponding IRI constraints, in the form of both XML elements and RDF properties, the value of which is a white space-separated list of strings, any one of which must match the relevant portion of the IRI of the candidate resource.

Formally, an IRI set definition D is expressed by one or more IRI constraints of the form C = IRI_component_matches(?x, {string1 | string2 | … | stringn}), where ?x is a variable denoting the IRI component under consideration, and {string1 | string2 | … | stringn} denotes a set consisting either of string string1 OR string2 OR … OR stringn.

Any number of IRI constraints C1, C2, …, Cn can be declared, and, as stated in Section 1.2, the overall IRI set is the intersection of the sets that can be interpreted from IRI set definitions corresponding to Cn. However with some exceptions, each particular IRI constraint can only appear 0 or 1 times and some are mutually exclusive. Greater detail on this is provided as terms are introduced and in Section 2.7.

Strings are matched according to one of four rules:

Recognizing the great diversity of potential uses and set definition requirements, multiple IRI constraints are defined relating to the path and query components. Furthermore, for each constraint there is a ‘negative’ constraint, that is, a constraint whose value is a list of strings that must not be present in the relevant IRI component.

Table 2: IRI constraints used to define resource sets by IRI components. The annotations refer to notes in the following text.
IRI constraintIRI componentMatching ruleNegative constraint
includeschemes scheme exact excludeschemes
includehosts ihost endsWith excludehosts
includeportranges port exact excludeportranges
includeexactpaths ipath exact excludeexactpaths
includepathcontains contains excludepathcontains
includepathstartswith startsWith excludepathstartsWith
includepathendswith endsWith excludepathendsWith
includequerycontains iquery contains excludequerycontains
includeexactqueries exact excludeexactqueries
includepathcontains and includequerycontains may appear any number of times within an IRI set definition, so that it is easy to create one in which multiple strings must be present in paths and/or queries. This is in contrast to all other terms in Table 2 which can only occur 0 or 1 times, since the IRI of a candidate resource can only have one scheme, one host etc.

As a quick example, the set of all resources on example.org, whether fetched using specifically http or https, where the path component of their IRIs starts with foo, and where the path does not end with .png is defined thus:

Example 2-1: An IRI Set definition using four IRI constraints

XML:

<wdr:iriset>
  <wdr:includeSchemes>http https</wdr:includeSchemes>
  <wdr:includeHosts>example.org</wdr:includeHosts>
  <wdr:includePathStartsWith>/foo</wdr:includePathStartsWith>
  <wdr:excludePathEndsWith>.png</wdr:excludePathEndsWith>
</wdr:iriset> 

RDF:

<wdr:iriset>
  <owl:intersectionOf rdf:parseType="Collection">
    <owl:Restriction>
      <owl:onProperty rdf:resource="&wdr;includeschemes" />
      <owl:hasValue>http https</owl:hasValue>
    </owl:Restriction>
    <owl:Restriction>
      <owl:onProperty rdf:resource="&wdr;includehosts" />
      <owl:hasValue>example.org</owl:hasValue>
    </owl:Restriction>
    <owl:Restriction>
      <owl:onProperty rdf:resource="&wdr;includepathstartswith" />
      <owl:hasValue>/foo</owl:hasValue>
    </owl:Restriction>
    <owl:Restriction>
      <owl:onProperty rdf:resource="&wdr;excludepathendswith" />
      <owl:hasValue>.png</owl:hasValue>
    </owl:Restriction>
  </owl:intersectionOf>
</wdr:iriset>

The semantics and constraints of each of the terms in Table 2 is further defined in the POWDER Vocabulary document [VOC]. Precise details of how values for each term are combined is discussed is Section 2.7 below. However, it is worth noting the points made in the following sub-sections.

2.1.1 IRI Constraints Referring to Ports

Ranges of ports are defined as x-y, where xy, that is, the lower and upper values in the range are separated by a hyphen. In case of port ranges consisting of a single port (i.e., x=y), it is possible to use the short notation x, instead of x-x. Multiple ranges can be listed using white space as the separator. Specific ports or port ranges can be excluded using the excludeportranges constraint, so that the set of all resources on example.org via ports 3125 to 5236 excluding ports 4345 and 5000 can be expressed as in Example 2-2.

Example 2-2: An IRI set definition using port ranges

XML

<wdr:iriset>
  <wdr:includeHosts>example.org</wdr:includeHosts>
  <wdr:includePortRanges>3125-5236</wdr:includePortRanges>
  <wdr:excludePortRanges>4345-4345 5000-5000</wdr:excludePortRanges>
<wdr:iriset>

or:

<wdr:iriset>
  <wdr:includeHosts>example.org</wdr:includeHosts>
  <wdr:includePortRanges>3125-5236</wdr:includePortRanges>
  <wdr:excludePortRanges>4345 5000</wdr:excludePortRanges>
<wdr:iriset>

RDF:

<wdr:iriset>
  <owl:intersectionOf rdf:parseType="Collection">
    <owl:Restriction>
      <owl:onProperty rdf:resource="&wdr;includehosts" />
      <owl:hasValue>example.org</owl:hasValue>
    </owl:Restriction>
    <owl:Restriction>
      <owl:onProperty rdf:resource="&wdr;includeportranges" />
      <owl:hasValue>3125-5236</owl:hasValue>
    </owl:Restriction>
    <owl:Restriction>
      <owl:onProperty rdf:resource="&wdr;excludeportranges" />
      <owl:hasValue>4345-4345 5000-5000</owl:hasValue>
    </owl:Restriction>
  </owl:intersectionOf>
</wdr:iriset>

or:

<wdr:iriset>
  <owl:intersectionOf rdf:parseType="Collection">
    <owl:Restriction>
      <owl:onProperty rdf:resource="&wdr;includehosts" />
      <owl:hasValue>example.org</owl:hasValue>
    </owl:Restriction>
    <owl:Restriction>
      <owl:onProperty rdf:resource="&wdr;includeportranges" />
      <owl:hasValue>3125-5236</owl:hasValue>
    </owl:Restriction>
    <owl:Restriction>
      <owl:onProperty rdf:resource="&wdr;excludeportranges" />
      <owl:hasValue>4345 5000</owl:hasValue>
    </owl:Restriction>
  </owl:intersectionOf>
</wdr:iriset>

2.1.2 IRI Constraints Referring to Query Strings

Query strings typically contain a series of name-value pairs separated by ampersands thus:

?name1=value1&name2=value2

These are usually acted on by the server to generate content in real time and the order of the name-value pairs is unimportant. For practical purposes ?name1=value1&name2=value2 is equivalent to ?name2=value2&name1=value1. Therefore, if the candidate resource's IRI includes a query string, and if the IRI set definition refers to the query string then:

* If a server is known to use a different delimiter, then a different IRI constraint must be defined, see Section 3.

N.B. If using the IRI constraint relating to the query string of an IRI, then the real-time generation of content should be taken into account. It may be difficult, if not impossible, to predict with certainty what the content of the resource will be and therefore the IRI Set may not be fully defined. It follows that query string-based IRI constraints should be used with caution.

2.1.3 IRI Canonicalization

This section and section 2.1.4 to be updated to be consistent with talking solely about IRIs and to take account of comments received from Thomas Rseller, Eric Prud'hommeaux et al.

Before any IRI or URI matching can take place the following canonicalization steps should be applied to the candidate resource's IRI. These steps are consistent with RFC3986 [URIS], RFC3987 [IRIS] and URISpace [URISpace].

The following table gives some examples.

Table 3: Examples of canonicalized IRIs and URIs
Input IRI/URICanonical form
www.example.comhttp://www.example.com/
http%3A%2F%2Fwww.example.com%2Ffoohttp://www.example.com/foo
HTTp%3a%2f%2fwww.Example.Com:80%2Ffoohttp://www.example.com/foo
http://www.example.com./foohttp://www.example.com/foo
HTTPS://WWW.EXAMPLE.COM/FOOhttps://www.example.com/FOO
http://example.com/staff/Fran%c3%a7oishttp://www.example.com/staff/François
http://example.com/my%20doc.dochttp://www.example.com/my doc.doc

2.1.4 Data encoding

To complement the URI/IRI canonicalization steps described in the previous section, related processing steps must also be carried out on the strings supplied as set defining data, that is, the values for the RDF properties listed in Table 2.

Bear in mind that if the data is serialized in XML, IRI strings specified in the IRI constraint will be escaped according to the XML syntax using entity references for specific characters (escaping < with &lt; and & with &amp; is mandatory, others may also be used). Moreover, since IRI constraints take a white space-separated list of URI strings as their value, whenever a URI string contains an unescaped white space (i.e., a white space not encoded as %20), it will be substituted by %20.

The following steps should therefore be applied to each item in the list separately.

If the set definition includes values related to the port then matching of the data against the candidate resource's URI/IRI must be carried out as follows:

2.2 Grouping using Wildcards: The includeiripattern and excludeiripattern constraints

Enabling Read Access for Web Resources [WAF] defines a method for encoding the domains and sub-domains from which access to resources on a given Web site should be granted or denied. The includeiripattern and excludeiripattern properties support this syntax directly. Domains and sub-domains may be substituted by a wildcard character (*) according to the following EBNF:

access-item    ::= (scheme "://")? domain-pattern (":" port)? | "*" 
domain-pattern ::= domain | "*." domain

scheme and port are used as defined in RFC 3986. domain is an internationalized domain name as defined in RFC 3490.

It follows that:

<wdr:includehosts>example.com</wdr:includehosts>

and

<wdr:includeiripattern>example.com</wdr:includeiripattern>

are equivalent. However, *.example.com, meaning resources on sub-domains of example.com but not on example.com itself, is not a valid value for includeHosts.

Note that paths and query strings MUST NOT be included in the pattern. If these are required in an IRI set definition, the relevant IRI constraints from Table 2 can be used.

2.3 Grouping by Regular Expression: The includeregex and excluderegex constraints

The IRI constraints discussed above all take white space-separated lists of strings as their values. It is believed that these properties will be easy to use and cover the overwhelming majority of cases. However, the use of strings with fixed matching rules clearly presents a restriction on flexibility. To support fully flexible set definition by IRI, the includeregex and excluderegex properties take a Regular Expression (RE) and should be applied to the candidate resource's complete IRI (after following the canonicalization steps above).

The RE syntax used is defined by XML schema as modified by XQuery 1.0 and XPath 2.0 Functions and Operators [XQXP].

N.B. The value of the includeregex and excluderegex properties MUST be a single Regular Expression, not a white space-separated list.

As an example, the set of all the resources hosted either by example.org or example.net, where the path component of their IRIs starts either with foo or bar, can be defined thus:

Example 2-3: IRI set definition by regular expression (not including character escaping)

XML:

<wdr:iriset>
  <wdr:includeregex>^(([^:/?#]+):)//([^:/?#]+.)?example.(org|net)/(foo|bar)</wdr:includeregex>
</wdr:iriset> 

RDF:

<wdr:iriset>
  <owl:intersectionOf rdf:parseType="Collection">
    <owl:Restriction>
      <owl:onProperty rdf:resource="&wdr;includeregex" />
      <owl:hasValue>^(([^:/?#]+):)//([^:/?#]+.)?example.(org|net)/(foo|bar)</owl:hasValue>
    </owl:Restriction>
  </owl:intersectionOf>
</wdr:iriset>

It is important to note that Example 2-3 does not take account of the need to escape certain characters.

The following characters are used as meta characters in Regular Expressions and MUST therefore be escaped if used in an RE pattern given as the value of the includeRegEx property:

. \ ? * + { } ( ) [ ]

In addition, the < (less than) character MUST always be escaped since, if the set definition is given in RDF/XML, it could be mistaken for the beginning of the closing <wdr:includeregex> tag.

As a safeguard against unintended consequences, other characters that always or typically have special meaning within URI strings and/or XML SHOULD also be escaped, namely:

! " # % & ' , - / : ; = > @ [ ] _ ` ~

As a result, Example 2-3 should properly be written as shown in Example 2-4 below.

Example 2-4: Set definition by regular expression, including character escaping

XML:

<wdr:iriset>
  <wdr:includeregex>^(([^\:\/\?\#]+)\:)//([^\:\/\?\#]+\.)?example\.(org|net)/(foo|bar)</wdr:includeregex>
</wdr:iriset> 

RDF:

<wdr:iriset>
  <owl:intersectionOf rdf:parseType="Collection">
    <owl:Restriction>
      <owl:onProperty rdf:resource="&wdr;includeregex" />
      <owl:hasValue>^(([^\:\/\?\#]+)\:)//([^\:\/\?\#]+\.)?example\.(org|net)/(foo|bar)</owl:hasValue>
    </owl:Restriction>
  </owl:intersectionOf>
</wdr:iriset>

2.3.1 Safe Use of includeregex

Example 2-4 uses a modified version of the RE given Section 2.1. This is the safest method but is not, perhaps, the most natural way to proceed. If a less rigorous approach is taken it is easy to make mistakes when specifying REs, and incorrect REs in set definitions will have one of two possible (and obvious) consequences

  1. the corresponding set does not include the intended resources;
  2. the corresponding set includes resources not intended to be included.

Example 2-5 shows how this can happen.

Example 2-5: An example of a bad set definition by regular expression

XML:

<wdr:iriset>
  <wdr:includehosts>example.org</wdr:includehosts>
  <wdr:includeregex>https</wdr:includeregex>
</wdr:iriset> 

The intention in the RE given in Example 2-5 is probably to say "all resources on example.org with a URI beginning with https." However, as the RE is not anchored at either end, what this actually means is "all resources on example.org where the URI includes https". Thus this Resource Set includes both:

Adding in anchors at the beginning and end of the RE can have equally undesirable consequences.

Example 2-6: A second example of a bad set definition by regular expression

XML:

<wdr:iriset>
  <wdr:includehosts>example.org</wdr:includehosts>
  <wdr:includeregex>^https$</wdr:includeregex>
</wdr:iriset> 

In Example 2-6, the intention is, again probably, to define the set of "all resources on example.org fetched using https only". However, adding both the ^ and $ anchors at the beginning and end of the RE means that the whole IRI must be https from start to finish — which can never be true so this Resource Set is equivalent to the empty set.

Example 2-7 shows one possible way to encode the intended set definition.

Example 2-7: An example of a correct set definition by regular expression

XML:

<wdr:iriset>
  <wdr:includehosts>example.org</wdr:includehosts>
  <wdr:includeregex>^https</wdr:includeregex>
</wdr:resourceset> 

Whilst Example 2-7 'works', the potential dangers of using REs mean that it is generally better to use component strings where possible. Example 2-7 is therefore better written as shown in Example 2-8 below.

Example 2-8: A re-write of Example 2-7 without using a regular expression

XML:

<wdr:iriset>
  <wdr:includehosts>example.org</wdr:includehosts>
  <wdr:includeschemes>https</wdr:includeschemes>
</wdr:iriset> 

2.4 Grouping by IP Address

We define includeips, which takes a lis of single IP addresses, and includeipranges which takes a CIDR block. Could we simplify this so that if an IP address were given without any /x at the end, /32 would be implicit? Or maybe we should explicitly allow IP addresses as values for includehost?

A set of resources can be defined in terms of the IP address(es) from which the resources are served. To support this we define two IP constraints: includeips, which takes a white space-separated list of single IP addresses, and includeipranges which takes a white space separated list of CIDR blocks [CIDR]. Negative versions of the these constraints are also defined: excludeips and excludeipranges respectively.

Note that includeips and includeipranges are mutually exclusive, that is, an IRI set may include one or other, but not both of these constraints.

The includeips constraint is simple enough: Example 2-9 defines the set of IRIs whose ihost component is equal to or can be resolved to IP address 123.123.123.123 (i.e., the set of resources available from IP address 123.123.123.123).

Example 2-9: An IRI set definition using the includeips constraint

XML:

<wdr:iriset>
  <wdr:includeips>123.123.123.123</wdr:includeips>
</wdr:iriset>

RDF:

<wdr:iriset>
  <owl:intersectionOf rdf:parseType="Collection">
    <owl:Restriction>
      <owl:onProperty rdf:resource="&wdr;includeips" />
      <owl:hasValue>123.123.123.123</owl:hasValue>
    </owl:Restriction>
  </owl:intersectionOf>
</wdr:iriset>

The includeipranges constraint allows the definition of a resource set based on a range of IP addresses, specified in a CIDR block. A CIDR block has the form <IP address>/x, where the CIDR prefix x is a number ranging from 1 to 32, denoting the leftmost x bits which a set of IP addresses shares. For instance, the CIDR block 123.234.245.254/8, denotes the range of IP addresses sharing the leftmost 8 bits, i.e., starting with 123.

As an example, suppose that an IRI Set definition should denote all the resources hosted by the machines with IP addresses 123.234.245.254 and 123.234.245.255. This can be expressed by the following IRI Set definition:

Example 2-10: An IRI set definition using the includeipranges constraint

XML:

<wdr:iriset>
  <wdr:includeipranges>123.234.245.254/31</wdr:includeipranges>
</wdr:iriset> 

RDF:

<wdr:iriset>
  <owl:intersectionOf rdf:parseType="Collection">
    <owl:Restriction>
      <owl:onProperty rdf:resource="&wdr;includeipranges" />
      <owl:hasValue>123.234.245.254/31</owl:hasValue>
    </owl:Restriction>
  </owl:intersectionOf>
</wdr:iriset>

2.4.1 Safe Usage of the includeipranges Constraint

In order to use CIDR blocks correctly, it must be taken into account that a CIDR prefix refers to the binary representation of an IP address. For instance, the binary representation of IP address 123.234.245.254 corresponds to

01111011 11101010 11110101 11111110

A CIDR block 123.234.245.254/31 denotes a range of IP addresses

01111011 11101010 11110101 1111111b

i.e., the range of IP addresses sharing the leftmost 31 bits with b either 1 or 0 (formally b ∈ {0,1}). Consequently, the CIDR block 123.234.245.254/31 denotes the following IP addresses:

01111011 11101010 11110101 11111110 = 123.234.245.254

01111011 11101010 11110101 11111111 = 123.234.245.255

This also means that the CIDR block 123.234.245.255/31 is equivalent to 123.234.245.254/31.

It is important to note that the number N of IP addresses denoted by a CIDR block corresponds to 232−x. Therefore, if x = 32, N = 20 = 1, if x = 31, N = 21 = 2, etc. Therefore, it is possible to denote a range of IP addresses using wdr:includeIpRanges only when the number N of IP addresses is a power of 2. Otherwise, it is necessary to provide a white space separated list of CIDR blocks or, alternatively, individual IP addresses. For instance, the resources hosted by the machines with IP addresses 123.234.245.253, 123.234.245.254, and 123.234.245.255 can be expressed as shown in Example 2-11.

Example 2-11: IRI set definition across several IP addresses

<wdr:iriset>
  <wdr:includeipranges>123.234.245.253/32 123.234.245.254/31</wdr:includeipranges>
</wdr:iriset> 

OR

<wdr:iriset>
  <wdr:includeips>123.234.245.253 123.234.245.254 123.234.245.255</wdr:includeips>
</wdr:iriset> 

Incidentally, as already noted, includeips and includeipranges are mutually exclusive. It is perhaps tempting to create a Resource Set definition like that shown in Example 2-12, however, this would require a candidate resource to be available from both 123.234.245.253 AND either 123.234.245.254 OR 123.234.245.255 which is impossible so that Example 2-12 is tantamount to the empty set.

Example 2-12: Erroneous IRI set definition across several IP addresses

<wdr:iriset>
  <wdr:includeipranges>123.234.245.254/31</wdr:includeipranges>
  <wdr:includeips>123.234.245.253</wdr:includeips>
</wdr:iriset> 

Defining Resource Sets by IP address puts a burden on the processor since it will often have to perform a DNS look up to determine whether a candidate resource is, or is not, a member of the Resource Set. Furthermore, it is particularly easy to include resources in the set by accident using such a broad-sweep approach. If a Web site is hosted on a shared server, for example, it is very likely that the set will include resources by mistake.

Defining a Resource Set by IP address would, however, be appropriate where a content provider operates a large network of servers, or where particular types of content to be described are hosted on servers that can easily be identified by their IP address.

2.5 Enumerating Elements of an IRI Set: the includeresources and excluderesources Constraints

It is useful to be able to include or exclude resources from sets by simple listing. The includeresources and excluderesources constraints support this, both of which take white space separated lists of IRIs. To give a simple example, the set of all resources on example.org except its stylesheet and JavaScript library can be encoded as shown in Example 2-13.

Example 2-13: IRI Set definition using excluderesources constraint

XML:

<wdr:iriset>
  <wdr:includehosts>example.org</wdr:includehosts>
  <wdr:excluderesources>http://www.example.org/stylesheet.css http://www.example.org/jslib.js</wdr:excluderesources>
</wdr:iriset>

RDF:

<wdr:iriset>
  <owl:intersectionOf rdf:parseType="Collection">
    <owl:Restriction>
      <owl:onProperty rdf:resource="&wdr;includehosts" />
      <owl:hasValue>example.org1</owl:hasValue>
    </owl:Restriction>
    <owl:Restriction>
      <owl:onProperty rdf:resource="&wdr;excluderesources" />
      <owl:hasValue>http://www.example.org/stylesheet.css http://www.example.org/jslib.js</owl:hasValue>
    </owl:Restriction>
  </owl:intersectionOf>
</wdr:iriset>

As emphasized throughout this document, each constraint and its value creates a set definition of its own and the full IRI Set is the intersection of those sets. Thus an alternative way of looking at Example 2-13 is to say that a candidate resource is a member of the Resource Set IF it is on example.org AND does not have the IRI http://www.example.org/stylesheet.css AND does not have the IRI http://www.example.org/jslib.js.

2.6 Redirection: the includeredirection constraint

This section needs updating, with POWDER and POWDER-S examples

If a Resource Set is defined in terms of the IRIs of the resources that are elements of the set then resolving the IRIs may lead to redirection through 3xx HTTP status codes [HTTPCODE]. By default, such redirection MUST lead to the 'new' resource itself being compared with the Resource Set definition. That is, if the resource identified by IRI1 is an element of the Resource Set but, when resolving it, the user agent is redirected via a 3xx HTTP response code to URI2, then the resource identified by URI2 MUST itself be compared with the Resource Set definition to determine whether or not it is an element of the set.

Recognizing that there may be circumstances where this default behavior may cause unnecessary latency, redirected resources MAY be included by use of the includeRedirection property. The value of this constraint allows for any of HttpAnyRedirect, HttpPermRedirect or HttpTempRedirect to be given as its value. These classes are all based on those defined in the HTTP in RDF vocabulary [HTTPRDF]. See the POWDER Vocabulary [VOC] for details. As their names suggest, the HTTP redirection classes allow Resource Set definitions to allow any redirection, specifically permanent redirection (i.e. HTTP response code 301) or any of the temporary redirection HTTP response codes (302, 303 and 307).

Example 2-16: Resource Set definition using includeRedirection property

<wdr:ResourceSet>
  <wdr:includeHosts>example.org</wdr:includeHosts>
  <wdr:includeRedirection rdf:resource="http://www.w3.org/2007/05/powder#HttpPermRedirect" />
</wdr:ResourceSet>

Example 2-16 encodes that if, when resolving any URI on the example.org domain (or its sub-domains), the user agent is redirected through a 301 (permanent) HTTP response code then the target resources are elements of the Resource Set, even if those resources are on a different domain. Resources resolved following other redirects would not be included unless they were also on the example.org domain.

2.7 Complex Sets: Conjunction and Disjunction

Earlier versions of this document relied on using owl:unionOf and owl:intersectionOf. These are not applicable in the (XML-based) POWDER model and therefore a new encoding needs to be devised that maps to those OWL properties in POWDER-S.

3 Extension Mechanism

This section needs further work but the basic extension mechanism is to create an XSLT to transform custom IRI constraints into native POWDER. Note that in a POWDER document, only POWDER's IRI constraints can be used.

In this document we have specified a vocabulary [VOC] for defining sets of resources by making reference to resource identifiers and whitespace-separated lists thereof. This vocabulary is clearly designed to be used with information resources available on the Web, identified by URLs containing host names, directory paths, IP addresses, port numbers, and so on.

However, there is no fundamental reason to constraint the domain of such datatype properties to URLs, so there should not be unnecessary constraints on how the protocol works. In other words, the domain of these new properties does not need to be URL identifiers, but may be any kind of URI/IRIs; for example, ISBN or ISAN numbers, encoded by using the corresponding URN namespace [URN].

Creators of POWDER documents may extend the vocabulary used in specifying IRI Sets, by defining new datatype properties for URI/IRIs. All such extentions to the POWDER vocabulary MUST be defined by means of GRDDL transformations [GRDDL] to terms of the POWDER vocabulary.

Developers of POWDER tools MAY directly implement extensions they know about, but MUST include a mechanism for retrieving and applying the GRDDL transformations to extensions they do not know about.

3.1 Extension Example 1: Custom IRI Patterns

As an example of a service-specific extension, consider a service which uses unix shell wildcards instead of regular expressions, so that www.example.org/* means "all the resources on www.example.org." Such a system is easily used within an IRI set, only requiring the definition of a single IRI constraint shell:pattern.

A publisher of a POWDER document using shell:Pattern must include links to GRDDL transformations that will replace it with appropriate terms from the POWDER vocabulary, as shown in Example 3-1.

Example 3-1 An IRI set definition using a custom IRI pattern and the corresponding definition using standard regular expressions.

Custom IRI pattern:

<wdr:iriset>
  <shell:pattern>www.example.org/*</shell:pattern>
</wdr:iriset>

Corresponding POWDER IRI constraint:

<wdr:iriset>
  <wdr:includeregex>http://www.example.org/.*</wdr:includeregex>
</wdr:iriset>

3.2 Extension Example 2: Custom Site Structure

Some content providers serve dynamic content stored in a database, so that IRIs express queries to the database. This kind of IRIs have certain structure, but this structure is neither obvious nor easily human-interpreted. Furthemore, conventional grouping mechanisms cannot be used to group resources, as the site structure does not match any directory hierarchy.

As an example, consider sport.example.gr, a Greek sports news site, where IRIs look like the one shown in Example 3-2-1. The adopted scheme is systematic so that sport=2&countryID=16 provides a front page with news about Greek basketball and links to various Greek basketball leagues, sport=3&countryID=16 a front page about Greek volleyball, etc.

Example 3-2-1 Sample IRI from site serving dynamic content. sport=1 stands for football and countryID=16 stands for Greece.

http://sport.example.gr/matches.asp?sport=1&countryID=16&champID=2

A POWDER document providing metadata about this web site would have to use regular expression matching with explicit reference to the numerical values in the country and sport fields of the query. This process is error-prone, and requires extensive changes if the underlying database schema is modified or extended.

As an alternative, the site developer may provide a POWDER vocabulary extension that abstracts away from the database schema to allow reference to sports and countries, as shown in Example 3-2-2. POWDER document authors can then use the properties in this extension to create POWDER documents that are valid even if the site schema is modified, as long as the site developer updates the relevant transformations.

Example 3-2-2 An IRI set definition using site-specific extensions and the equivalent definition using standard POWDER vocabulary.

Custom IRI constraint:

<wdr:iriset>
  <sport:countries>Greece</sport:countries>
  <sport:sports>Football Basketball</sport:sports>
</wdr:iriset>

Corresponding POWDER IRI constraint:

<wdr:iriset>
  <wdr:includeregex>country=16</wdr:includeregex>
  <wdr:includeregex>sport=[12]</wdr:includeregex>
</wdr:iriset>

3.3 Extension Example 3: ISAN

The International Standard Audiovisual Number [ISAN1] is a voluntary numbering system for the identification of audiovisual works. Following ISO 15706, the numbers are written as 24 bit hexadecimal digits in the following format [ISAN2].

-----root----- episode -version-
ISAN 1881-66C7-3420 - 0000 -7- 9F3A-0245 -U

The root of an ISAN number is assigned to a core work with the other numbers being used for things like episodes, different language versions, promotional trailers and so on.

Since ISAN numbers are URNs [URN], and hence IRIs of the urn: scheme [URIS], a vocabulary can readily be defined to allow IRI Sets to be defined based on ISAN numbers. The terms might be along the lines of:

includeRoots — the value of which would be a white space separated of hexadecimal digits and hyphens that would be matched against the first three blocks in the ISAN number.

includeEpisodes — a white space separated list of hexadecimal digits and hyphens that would be matched against the 4th block of 4 digits in the ISAN number.

includeVersions — a white space separated list of hexadecimal digits and hyphens that would be matched against the 5th and 6th blocks of 4 digits in the ISAN number.

The set of all audio visual resources that relate to two particular works might then be defined as shown in Example 3-3.

Example 3-3: An IRI set definition using an ISAN number pattern and the corresponding definition using standard POWDER vocabulary

Custom ISAN pattern:

<wdr:iriset>
  <ex_isan:includeRoots>1881-66C7-3420 1881-66C7-3421</ex_isan:includeRoots>
</wdr:iriset>

Corresponding POWDER IRI constraint:

<wdr:iriset>
 <wdr:hasregex>^urn:isan:(1881-66C7-3420)|(1881-66C7-3421)</wdr:hasregex>
</wdr:iriset>

References

Normative References

RFC2119
Key words for use in RFCs to Indicate Requirement Levels, S. Bradner. IETF, March 1997. This document is at http://ietf.org/rfc/rfc2119.
[IRIS]
RFC 3987 — Internationalized Resource Identifiers (IRIs), M. Duerst and M. Suignard, IETF, January 2005. This document is at http://www.ietf.org/rfc/rfc3987.txt
[URIS]
RFC 3986 — Uniform Resource Identifiers (URI): Generic Syntax, T. Berners-Lee, R. Fielding and L. Masinter, IETF, January 2005. This document is http://tools.ietf.org/html/rfc3986.
[URN]
Official IANA Registry of URN Namespaces. This document is http://www.iana.org/assignments/urn-namespaces.
[GRDDL]
Gleaning Resource Descriptions from Dialects of Languages (GRDDL) W3C Recommendation 11 September 2007. D. Connolly. This document is at http://www.w3.org/TR/grddl/
[UTF-8]
RFC 3629 — UTF-8, a transformation format of ISO 10646, F. Yergeau, November 2003. This document is at http://www.ietf.org/rfc/rfc3629.txt
[RFC3490]
RFC 3490 — Internationalizing Domain Names in Applications (IDNA) P. Faltstrom, P. Hoffman, A. Costello. This document is at http://www.ietf.org/rfc/rfc3490.txt
[WAF]
Enabling Read Access for Web Resources A van Kesteren. This document is at http://www.w3.org/TR/access-control/
[XQXP]
XQuery 1.0 and XPath 2.0 Functions and Operators, A. Malhotra, J. Melton, N. Walsh. W3C Recommendation 23 January 2007. This document is at http://www.w3.org/TR/xpath-functions/
[CIDR]
RFC 1518 — An Architecture for IP Address Allocation with CIDR, Y. Rekhter and T. Li, editors, IETF, September 1993. This document is http://tools.ietf.org/html/rfc1518.
[HTTPCODE]
Part of Hypertext Transfer Protocol -- HTTP/1.1, RFC 2616 Fielding, et al. This document is http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html .
[HTTPRDF]
HTTP Vocabulary in RDF J Koch, C Velasco, S Abou-Zahra. This document is at http://www.w3.org/TR/HTTP-in-RDF/
ATOM
Atom Format Nottingham & Sayre. This document is at http://www.ietf.org/rfc/rfc4287.txt
HTMLPROF
HTML 4.01 D. Raggett, A. Le Hors, I. Jacobs. This document is at http://www.w3.org/TR/1999/REC-html401-19991224/struct/global.html#profiles
[OWLSO]
OWL Web Ontology Language Guide: Set Operators M. Smith, C. Welty, D. McGuinness. This document is at http://www.w3.org/TR/2004/REC-owl-guide-20040210/
[WSDL]
Web Services Description Language (WSDL) 1.1, E Christensen, F Curbera, G Meredith, S Weerawarana. This document is at http://www.w3.org/TR/wsdl

Sources

[USECASES]
POWDER: Use Cases and Requirements, W3C Working Group Note 31 October 2007, P. Archer. This document is at http://www.w3.org/TR/powder-use-cases/
[DR]
Protocol for Web Description Resources (POWDER): Description Resources, K Smith, P Archer, A Perego. This document is at http://www.w3.org/TR/powder-dr/
[VOC]
Protocol for Web Description Resources (POWDER): Web Description Resources (WDR) Vocabulary, A Perego, P Archer. This document is at http://www.w3.org/TR/powder-voc/
[WDRD]
Protocol for Web Description Resources (POWDER): Web Description Resources Datatypes (WDRD), A Perego, P Archer, K Smith. This document is at http://www.w3.org/TR/powder-xsd/
[PRIMER]
Protocol for Web Description Resources (POWDER): Primer 2008, K. Scheppe, D. Pentecost. (URI TBC)
[TESTS]
Protocol for Web Description Resources (POWDER): Test Suite 2008, P. Nasikas. (URI TBC)
[Rabin]
URI Pattern Matching for Groups of Resources J Rabin, Draft 0.1 17 June 2006. This document is at http://www.w3.org/2005/Incubator/wcl/matching.html
[WCL-XG]
W3C Content Label Incubator Group February 2006 - February 2007
[URISpace]
URISpace 1.0, M. Nottingham, W3C Note 15 February 2001
[SPARQL]
SPARQL Query Language for RDF E Prud'hommeaux, A Seaborne. This document is at http://www.w3.org/TR/rdf-sparql-query/
[SOAP]
See, for example, SOAP Version 1.2 Part 0: Primer (Second Edition) N Mitra, Y Lafon. This document is at http://www.w3.org/TR/soap12-part0/.
[ROBOTS]
robotstxt.org This document is at http://www.robotstxt.org/.
[ISAN1]
International Standard Audiovisual Number
[ISAN2]
ISAN FAQs: What is the ISAN? This document is at http://www.isan.org/portal/page?_pageid=166,41960&_dad=portal&_schema=PORTAL.
[Google]
Google Custom Search Engine URL Patterns

Acknowledgments

The editors duly acknowledge the earlier work in this area carried out by Jo Rabin. Jeremy Carroll and David Booth developed the operational and formal semantics model and Stasinos Konstantopoulos rewrote the extension section. The editors gratefully acknowledge the further contributions made by all members of the POWDER Working Group.

Change Log

Changes since First Public Working Draft

  1. Updated introduction to refer to vocabulary and XML data types documents. Corrected erroneous use of 'QNames'.
  2. Small addition to the introduction to Grouping by address paragraph.
  3. Update status section
  4. Renumbering of sections previous 2.2 - 2.5
  5. Insertion of Grouping using Wildcards following discussion with Web Application Formats Working Group
  6. Resolution of open question on choice of Regular Expression syntax. Now use XML Schema REs as modified by XPath/XQuery for consistency with other W3C work - the syantx more than meets POWDER's requirements. Data type to be defined in POWDER's own XML Schema
  7. Added hyperlinks to the first mention of each Class and property, pointing to its entry in the vocabulary document
  8. Removed includeUserInfo and includeFragments properties since these are not strictly part of HTTP, the former can cause security issues, especially when written as username:password, and grouping by fragments is very vague since there is no sure way to define the end of a fragment.
  9. Section 3 completely rewritten. Feature at Risk marker removed.

Changes since 31 October 2007 draft

  1. Status section updated to reflect substantial change since previous version
  2. Intro extended to include mention of primer and test suites, plus added namespace tabel etc.
  3. Section 1.2 amended and sections 1.3 adn 1.4 added to explain XML to RDF/OWL model via GRDDL, with Semantic Extension defined insection 1.4
  4. Resource Set changed to IRI set, and all mention of URI changed to IRI throughout.
  5. Regular Expressions in examples 2-3 and 2-4 corrected
  6. The section on grouping resources by the properties of those resources has been removed completely - we now only support grouping by IRI constraint
  7. As noted in the text above, the section on conjunction and disjunction needs to be rewritten to work in the POWDER/POWDER-S model. The section on logical inconsistency has been removed for now too.
  8. The extension mechanism section has been re-written
  9. Several sections have been renumbered.
  10. Acknowledgements section extended to cite Jeremy Carroll, David Booth and Stasinos Konstantopoulos