Re: String Matching -> Reg Ex is not always easy

On Wed Mar 26 11:10:17 2008 Phil Archer said:

> True, but you've not negated the query strings... would you keep 
> excluderegex??

Well, you know, copy-pasting...

> Could I ask you please to create a POWDER-S OWL class that captured this?

No prob.

On Wed Mar 26 12:08:03 2008 Phil Archer said:

> More on this...
> 
> I've been playing with the regular expressions that one would need to 
> write to capture the meaning of the string elements. To do this I've set 
> up a little tool at [1] that allows you to put in a Reg ex and a string 
> and see if the two match.
> 
> Let's start with includehosts. The Reg Ex needs to be pretty specific so 
> 
> [...]
> 
> I ended up with
> 
> ^\w+://[^\:\/\?\#\@ ]+\/foo
> 
> And so on.
> 
> The question is... is mapping each IRI constraint to a regular 
> expression like this actually better than just using the element names? 
> What's the benefit Stasinos?

Trying to answer these and also react to other stuff hapeening around
the POWDER mailing list, I caught myself having a hard time remembering
all the various stuff I have proposed or said. So I sat down and put
everything together in a single text. It comprises the multi-iriset
idea, the its-all-regexps-anyway idea, a new suggestiosn about
flat-string tags, and a revisit of the original resource sets.

There is some boring semantics stuff around the middle, involving two
alternative ways of substantiating the resource to IRI string leap.
Alternative 1 is more directly based on jjc's suggestion, but extends it
handle regexps, port ranges, and IP ranges, as opposed to the original
hasValue restriction over string literals. Alternative 2 is an attempt
to restrict the added expressive to exactly what is needed without
opening the pandora expressivity box. There's some stuff about XML types
that I had no idea about and had to read up on today. Kevin, please have
a look and let me know it's all sound.

After he boring semantics stuff there's the IRI Sets and Extensions
sections.

s



Intro
=====

POWDER/XML documents receive formal semantics through a GRDDL
transform, associated with the POWDER namespace, that allows the XML
data to be rendered and processed as OWL/RDF. Or, rather, POWDER-S, a
fragment of OWL/RDF extended in a way that allows to referring to and
operating upon the string representation of a resource.

The POWDER/XML format specifies a number of elements denoting
attribution, validity time, and other issues relating to the level of
trust assigned to a POWDER document. These fall though the transform
and are not meant to be interpreted in OWL/RDF; they are only
meaningful when used by POWDER tools that use them as input to an
extra-logical procedure which MAY use this data to decide whether the
POWDER document _as a whole_ should be taken into account or
discarded. We shall not deal with these elements any further, and
proceed under the assumption that our document has passed all relevant
tests.

Unqualified names should be assumed to be in the wdr: namespace.


DR Semantics
============

POWDER documents are used to describe sets of resources using
description vocabularies defined in RDF or plain string literals (tags).
POWDER/XML documents have <dr/> elements, each assigning all and every
member of a set of descriptors to a set of resources.

As an example, consider:

<dr>
 <iriset>...</iriset>
 <descriptorset>
   <voc:colour ref="http://rgb.org/colours.rdf#red"/>
   <voc:shape>square</voc:shape>
   <tag>red</tag>
   <tag>light red</tag>
   <taglist>light red</taglist>
 </descriptorset>
</dr>

where <iriset/> specifies a set or resources in a way that will be
dealt with later, and voc: is an arbitrary RDF vocabulary.

The <voc:colour/> element specifies that the <voc:colour/> relation
holds between all resources in specified by <iriset/> and the
http://rgb.org/colours.rdf#red resource.

The content of <voc:shape/> is interpreted as a string literal. The 
<voc:shape/> element specifies that all resources in <iriset/>
has the value "square" for the <voc:shape/> dataproperty.

<tag/> is a string property defined by POWDER. Its content is a
single string literal, possibly including spaces.
<taglists/> is a string property defined by POWDER. Its content is a
space-separated list of string literals.

The overall description of the resources in <iriset/> is the union of
the descriptions in the <descriptorset/>. In our example:
 a voc:colour relation to http://rgb.org/colours.rdf#red
AND
 a voc:shape "square"
AND
 the tags "red", "light", and "light red"

We formally interpret the above as follows: there is an OWL class
containing all resources that share all of these properties, and there
is an OWL class of all resources denoted by <iriset/>, and the latter
is a subset of the former. In OWL/RDF we say:

<RDF>

  <owl:Class rdf:ID="resourceset_1">
    all resources specified by <iriset>...</iriset>
  </owl:Class>

  <owl:Class rdf:ID="description_1">
     <owl:intersectionOf rdf:parseType="Collection">
       <owl:Restriction>
         <owl:onProperty rdf:resource="voc:color"/>
         <owl:hasValue rdf:resource="http://rgb.org/colours.rdf#red"/>
       </owl:Restriction>
       <owl:Restriction>
         <owl:onProperty rdf:resource="voc:shape"/>
         <owl:hasValue>square</owl:hasValue>
       </owl:Restriction>
       <owl:Restriction>
         <owl:onProperty rdf:resource="wdr:tag"/>
         <owl:hasValue>red</owl:hasValue>
       </owl:Restriction>
       <owl:Restriction>
         <owl:onProperty rdf:resource="wdr:tag"/>
         <owl:hasValue>light</owl:hasValue>
       </owl:Restriction>
       <owl:Restriction>
         <owl:onProperty rdf:resource="wdr:tag"/>
         <owl:hasValue>red light</owl:hasValue>
       </owl:Restriction>
     </owl:intersectionOf>
  </owl:Class>
  
  <owl:Class rdf:about="#resourceset_1">
    <rdfs:subClassOf rdf:ID="description_1"/>
  </owl:Class>

</RDF>

It is possible to have more than one <iriset/> elements, in which case
a resource receives all of the the descriptions by belonging to any
one of the corresponding resource sets. For example:

<dr>
 <iriset>.1.</iriset>
 <iriset>.2.</iriset>
 <descriptorset>
   <voc:colour ref="http://rgb.org/colours.rdf#red"/>
   <taglist>light red</taglist>
 </descriptorset>
</dr>

receives the following semantics:
 
<RDF>

  <owl:Class rdf:ID="resourceset_1">
    all resources specified by <iriset>.1.</iriset>
  </owl:Class>

  <owl:Class rdf:ID="resourceset_2">
    all resources specified by <iriset>.2.</iriset>
  </owl:Class>

  <owl:Class rdf:ID="description_1">
     <owl:intersectionOf rdf:parseType="Collection">
       <owl:Restriction>
         <owl:onProperty rdf:resource="voc:color"/>
         <owl:hasValue rdf:resource="http://rgb.org/colours.rdf#red"/>
       </owl:Restriction>
       <owl:Restriction>
         <owl:onProperty rdf:resource="wdr:tag"/>
         <owl:hasValue>red</owl:hasValue>
       </owl:Restriction>
       <owl:Restriction>
         <owl:onProperty rdf:resource="wdr:tag"/>
         <owl:hasValue>light</owl:hasValue>
       </owl:Restriction>
     </owl:intersectionOf>
  </owl:Class>
  
  <owl:Class>
    <owl:unionOf rdf:parseType="Collection">
      <owl:Class rdf:about="#resourceset_1"/>
      <owl:Class rdf:about="#resourceset_2"/>
    </owl:unionOf>
    <rdfs:subClassOf rdf:ID="description_2"/>
  </owl:Class>

</RDF>

A POWDER/XML implementio is free to choose any traversal policy for
treating miltiple </iriset> elements in a DR: first match wins, last
match wins, shortest irisets first, and so on, as long as all irisets
are tried before deciding that DR does not apply to a resource.

The ordering of irisets is not important and a POWDER/XML
implementation is free to try them in any order whatsoever (in order
listed, shorter first, etc), as long as all irisets are tried before
deciding that a resource is outside the scope of the DR.

DR authors may use the order of the irisets to suggest an efficient
scope evaluation strategy, by putting the irisets with the widest
coverage first, so that an implementation that chooses to follow the
suggested evaluation order is more likely to terminate the evaluation
after fewer checks.


POWDER Semantics
================ 

A POWDER document may have any number of <dr> elements, all of which
are simultaneously asserted and ordering is not important. So, for
example:

<powder>
  <dr>
   <iriset>.1.</iriset>
   <descriptorset>
     <voc:shape>square</voc:shape>
   </descriptorset>
  </dr>
  <dr>
   <iriset>.2.</iriset>
   <descriptorset>
     <voc:colour ref="http://rgb.org/colours.rdf#red"/>
   </descriptorset>
  </dr>
</powder>

receives the following semantics:

<RDF>
  <owl:Class rdf:ID="resourceset_1">
    all resources specified by <iriset>.1.</iriset>
  </owl:Class>

  <owl:Class rdf:ID="resourceset_2">
    all resources specified by <iriset>.2.</iriset>
  </owl:Class>

  <owl:Class rdf:ID="description_1">
     <owl:intersectionOf rdf:parseType="Collection">
       <owl:Restriction>
         <owl:onProperty rdf:resource="voc:shape"/>
         <owl:hasValue>square</owl:hasValue>
       </owl:Restriction>
     </owl:intersectionOf>
  </owl:Class>
  
  <owl:Class rdf:ID="description_2">
     <owl:intersectionOf rdf:parseType="Collection">
       <owl:Restriction>
         <owl:onProperty rdf:resource="voc:color"/>
         <owl:hasValue rdf:resource="http://rgb.org/colours.rdf#red"/>
       </owl:Restriction>
     </owl:intersectionOf>
  </owl:Class>

  <owl:Class rdf:about="#resourceset_1">
    <rdfs:subClassOf rdf:resource="#description_1"/>
  </owl:Class>

  <owl:Class rdf:about="#resourceset_2">
    <rdfs:subClassOf rdf:resource="#description_2"/>
  </owl:Class>
</RDF>

The <owl:intersectionOf/> of a singleton collection is the latter's
single element anyway, so it is better to keep the
<owl:intersectionOf/> element even though it is redundant, in order to
keep the transform simple and not require the extra check.

Note that resourceset_1 and resourceset_2 are not necessarity
disjoint, so that some resources may be both red AND square.

A POWDER document may have an <ol/> element with is an ordered list of
<dr> elements, which receives a first-match semantics. <ol/> elements
are meant to be used to express exceptions to more general rules. So,
for example:

<powder>
  <ol>
    <dr>
     <iriset>.1.</iriset>
     <descriptorset>
       <voc:shape>square</voc:shape>
     </descriptorset>
    </dr>
    <dr>
     <iriset>.2.</iriset>
     <descriptorset>
       <voc:shape>round</voc:shape>
     </descriptorset>
    </dr>
    <dr>
     <iriset>.3.</iriset>
     <descriptorset>
       <voc:shape>triangle</voc:shape>
     </descriptorset>
    </dr>
  </ol>
</powder>

receives the following formal semantics, where belonging to
description_1 automatically precludes belonging to description_2 and
description_3; and belonging to description_2 automatically precludes
belonging to description_3:

<RDF>
  <owl:Class rdf:ID="resourceset_1">
    all resources specified by <iriset>.1.</iriset>
  </owl:Class>

  <owl:Class rdf:ID="resourceset_2">
    all resources specified by <iriset>.2.</iriset>
  </owl:Class>

  <owl:Class rdf:ID="resourceset_3">
    all resources specified by <iriset>.3.</iriset>
  </owl:Class>

  <owl:Class rdf:ID="description_1">
     <owl:intersectionOf rdf:parseType="Collection">
       <owl:Restriction>
         <owl:onProperty rdf:resource="voc:shape"/>
         <owl:hasValue>square</owl:hasValue>
       </owl:Restriction>
     </owl:intersectionOf>
  </owl:Class>
  
  <owl:Class rdf:ID="description_2">
     <owl:intersectionOf rdf:parseType="Collection">
       <owl:Restriction>
         <owl:onProperty rdf:resource="voc:shape"/>
         <owl:hasValue>round</owl:hasValue>
       </owl:Restriction>
     </owl:intersectionOf>
  </owl:Class>

  <owl:Class rdf:ID="description_3">
     <owl:intersectionOf rdf:parseType="Collection">
       <owl:Restriction>
         <owl:onProperty rdf:resource="voc:shape"/>
         <owl:hasValue>triangle</owl:hasValue>
       </owl:Restriction>
     </owl:intersectionOf>
  </owl:Class>

  <owl:Class rdf:about="#resourceset_1">
    <rdfs:subClassOf rdf:resource="#description_1"/>
  </owl:Class>

  <owl:Class>
    <owl:intersectionOf rdf:parseType="Collection">
      <owl:Class rdf:about="#resourceset_2"/>
      <owl:complementOf>
        <owl:Class rdf:about="#resourceset_1"/>
      </owl:complementOf>
    </owl:intersectionOf>
    <rdfs:subClassOf rdf:ID="description_2"/>
  </owl:Class>

  <owl:Class>
    <owl:intersectionOf rdf:parseType="Collection">
      <owl:Class rdf:about="#resourceset_3"/>
      <owl:complementOf>
        <owl:Class rdf:about="#resourceset_2"/>
      </owl:complementOf>
      <owl:complementOf>
        <owl:Class rdf:about="#resourceset_1"/>
      </owl:complementOf>
    </owl:intersectionOf>
    <rdfs:subClassOf rdf:ID="description_3"/>
  </owl:Class>
</RDF>


IRISet Semantics
================


The last missing bit of the transformation now is the one that builds
the <owl:Class rdf:ID="resourceset_X"/> descriptions from <iriset/>
elements.

<iriset/> elements subsume one or more elements, each
representing a range of values for IRIs. An IRI is in the <iriset/> if
it is covered by ALL of the elements in <iriset/>. The following six
range specifications are supported:

 <includepattern/>,<excludepattern/>,
 <includeports/>,<excludeports/>,
 <includeCIDRranges/>,<excludeCIDRranges/>

Patterns are a single <xsd:pattern/> element, as defined in the XML
Schema [1]. <includepattern/> can be applied to any IRI, regardless of
whether it is resolvable or not. Ports are a space-speparated list of
ports or port ranges. CIDR ranges are specified a space-speparated
list of CIDR IP range specifications. Port and CIDR range elements can
be applied to URLs (is there an IRL acronym?) only, and are
meaningless for other kinds of IRIs.

For example:

<iriset>
  <includepattern>
    <xsd:pattern value="^http://[\w\.]+.example\.org(:(\d)+)?/" />
  </includepattern>
  <includeports>80 8080-8100</includeports>
  <excludeports>8085 8090-8095</excludeports>
</iriset>

specifies all resources on http://example.org and any subdomain
thereof, fetched from ports 80, 8080-8084, 8086-8089, or 8096-8100.

It might sometimes be easier to concetrate on parts of an IRI and
specify constraints as a series of regexps, all of which must match.
Fon instance, the IRISet:

<iriset>
  <includepattern>
    <xsd:pattern value="^http://[\w\.]+.example\.org(:(\d)+)?/" />
  </includepattern>
  <includepattern>
    <xsd:pattern value="^[^?]+\?(.*&)?s=football[&$]" />
  </includepattern>
  <includepattern>
    <xsd:pattern value="^[^?]+\?(.*&)?c=gr[&$]" />
  </includepattern>
  <includepattern>
    <xsd:pattern value="^[^?]+\?(.*&)?l=first[&$]" />
  </includepattern>
</iriset>

is a way of requesting three query conjuncts in any order, and is much
shorter and clearer than having to list all possible permutations.

The <iriset/> mechanism allows a DR to express any grouping of
resources whatsoever, no matter how complex:

(A) each include* and exclude* element expresses an atomic
    proposition. For all X, if includeX exists, excludeX also exists
    and vice versa; furthermore includeX and excludeX are mutually
    exclusive. Hence, one can negate all atomic propositions, although
    not complex propositions.

(B) An <iriset/> may contain multiple include* and exclude* tags, and
    all must hold for the iriset to hold. Hence one can express the
    conjunction of any set of atomic propositions and negations of
    atomic propositions.

(C) A DR may contain multiple <iriset/> elements, and if any of them
    holds, then the DR holds. Hence one can express the disjunction of
    conjunctions of sets of atomic propositions and negations of
    atomic propositions.

The three expressions above allow the expression of Disjunctive Normal
Form proposition. Since arbitrarily complex propositions can be
brought into DNF, the three expressions above allow the expression of
any proposition.


ALTERNATIVE 1

Providing OWL/RDF semantics for <iriset/> elements is not directly
possible, since RDF does not provide any means for accessing or
manipulating the string representation of an IRI. We extend OWL/RDF
with a built-in hasIRI data property as follows:

hasIRI rdf:type owl:DatatypeProperty .
hasIRI rdf:type owl:Property .
hasIRI rdfs:domain owl:Thing .
hasIRI rdfs:range xsd:string .

and the further stipulation that
 R owl:hasIRI s .
iff the string representation of resource R is s.

It is now possible to provide semantics to <iriset/> by deriving the
XML datatype that only includes the strings specified by
pattern p [1]. So now:

<includepattern>
  <xsd:pattern value="p1"/>
</includepattern>

<excludepattern>
  <xsd:pattern value="p2"/>
</excludepattern>

specify these classes of resources:

<xsd:simpleType name="iritype_1">
  <xsd:restriction base="string">
    <xsd:pattern value="p1" />
  </xsd:restriction>
</xsd:simpleType>

<xsd:simpleType name="iritype_2">
  <xsd:restriction base="string">
    <xsd:pattern value="p2" />
  </xsd:restriction>
</xsd:simpleType>

<owl:Class>
  <owl:Restriction>
    <owl:onProperty rdf:resource="owl:hasIRI"/>
    <owl:hasValue rdf:datatype="&xsd;iritype_1" />
  </owl:Restriction>
</owl:Class>

<owl:Class>
  <owl:ComplementOf>
    <owl:Restriction>
      <owl:onProperty rdf:resource="owl:hasIRI"/>
      <owl:hasValue rdf:datatype="&xsd;iritype_2" />
    </owl:Restriction>
  <owl:ComplementOf>
</owl:Class>

which means: "Here are the definitions of xsd:iritype_1, xsd:iritype_2,
sub-types of xsd:string. I don't know the exact value to put in
hasValue, but is must be of type xsd:iritype_1, xsd:iritype_2."

Port ranges are treated similarly, by defining the relevant hasPort
property, ranging over appropriate XML type. The xsd:pattern
restriction is not useful here, but xsd:integer supports
xsd:maxInclusive, xsd:minInclusive numerical restrictions. So:

<includeports>80 8080-8100</includeports>

means:

<xs:simpleType name="iritype_3">
  <xsd:restriction base="integer">
    <xsd:minInclusive value="80" />
    <xsd:maxInclusive value="80" />
  </xsd:restriction>
</xs:simpleType>

<xs:simpleType name="iritype_4">
  <xsd:restriction base="integer">
    <xsd:minInclusive value="8080" />
    <xsd:maxInclusive value="8100" />
  </xsd:restriction>
</xs:simpleType>

<owl:Class>
  <owl:unionOf>
    <owl:Restriction>
      <owl:onProperty rdf:resource="owl:hasPort"/>
      <owl:hasValue rdf:datatype="&xsd;iritype_3" />
    </owl:Restriction>
    <owl:Restriction>
      <owl:onProperty rdf:resource="owl:hasPort"/>
      <owl:hasValue rdf:datatype="&xsd;iritype_4" />
    </owl:Restriction>
  <owl:unionOf>
</owl:Class>

CIDR ranges are trickier, as they require bit-wise calculations.
Assume a hasIP property, as before, ranging over a complex
XML type [2] of 4 bytes.

<includeCIDRranges>x.y.z.w/r</includeCIDRranges>

<xs:complexType name="iritype_5">
  <xs:sequence>
    <xs:element>
      <xsd:enumeration base="byte">x</xsd:enumeration>
    </xs:element>
    <xs:element>
      <xsd:enumeration base="byte">y</xsd:enumeration>
    </xs:element>
    <xs:element>
      <xsd:enumeration base="byte">z</xsd:enumeration>
    </xs:element>
      <xsd:restriction base="byte">
        HARD, TO BE WORKED OUT.
        OTHERWISE JUST ENUMERATE (OUCH!).
      </xsd:restriction>
    </xs:element>
  </xs:sequence>
</xs:complexType>

<owl:Class>
  <owl:Restriction>
    <owl:onProperty rdf:resource="owl:hasIP"/>
    <owl:hasValue rdf:datatype="&xsd;iritype_5" />
  </owl:Restriction>
</owl:Class>

If no /r is given, the class D segment of the IP is simply given as
a sigleton enumeration, just like for classes A, B, and C.

OWL needs to be extended to allow user-defined types, which it
currently does not, [3].


ALTERNATIVE 2

Providing OWL/RDF semantics for <iriset/> elements is not directly
possible, since RDF does not provide any means for accessing or
manipulating the string representation of an IRI. We extend OWL/RDF
with a hasIRIFrom restriction as follows:

We assert the existence of the class of the various IRI classes:
  rdf:IRIClass rdf:type rdfs:Datatype .

We assert the existence of a new class of restriction nodes:
  owl:hasIRIFrom rdf:type rdfs:Class .

The members of this class are OWL restrictions, with
the following abstract OWL syntax:
  restriction(ID, hasIRIFrom(xs:iritype))
where ID is a node ID and xs:iritype is the ID of a a user-defined
type, as above.

If T() is the mapping from node IDs to nodes,
the semantics of such a restriction is that the datatype is also and
rdfs:Class, with the constraint that resources in this class have a
IRI the string representation of which is in the scope of xs:iritype.
It is then straightforward to provide the semantics of the restriction:

  T(xs:iritype) rdf:type rdfs:Datatype .
  T(xs:iritype) rdfs:subClassOf rdf:IRIClass .
  T(xs:iritype) rdf:type rdfs:Class .
  _:x rdf:type owl:Restriction .
  _:x rdf:type owl:Class .
  _:x rdf:type T(xs:iritype) .

We can now say:

<owl:Class>
  <owl:Restriction>
    <owl:hasIRIFrom>
      <xsd:simpleType>
        <xsd:restriction base="string">
          <xsd:pattern value="p" />
        </xsd:restriction>
      </xsd:simpleType>
    </owl:hasIRIFrom>
  </owl:Restriction>
</owl:Class>

to mean "the class of all things that have an IRI that has a string
representation that matches "p".

In Description Logic terms, we have allowed defining concepts based on
restrictions on the form of the string representations of abstract
instances, but the restricted the usage of such concepts in universal
quantification constructs.


COMPARISON

I will have to look into this more closely, but my first impression is
that ALT 2 provides the necessary expressivity to enable resource
grouping, but restricts the extension so that it does not allow any
other kind of reference to IRI strings. The logic remains agnostic as
to the internal reresentation of resources, except for their appearing
as members of various IRI Classes for no (logically) apparent reason.

ALT 1, on the other hand, creates a hasIRI property which it then
exposes to the concrete domain of the logic, permitting the full
expressivity of the logic to operate on it.


IRISet Extensions
=================

In Sect "IRISet Semantics" above, a vocabulary of 6 tags was specified for
defining sets of resources through their IRIs. Except for the
numerical port and IP restrictions over URLs, the only operation
supported over generic IRIs is regular expession matching.

Creators of POWDER documents may extend the vocabulary used in
specifying IRI Sets, by defining new <iriset/> elements. All such
extentions to the POWDER vocabulary MUST be defined by means of GRDDL
transformations [GRDDL] to terms of the basic POWDER vocabulary in the
wdr: namespace.

Extensions do not need to, but are well advised to, define pairs
of complementary vocabulary items (includeX and excludeX) for the
reasons explained above.

Developers of POWDER tools MAY directly implement extensions they know
about, but MUST include a mechanism for retrieving and applying the
GRDDL transformations to extensions they do not know about.


The URLSet Extension
====================

POWDER's basic use cases involve information resources available on
the Web, identified by URLs containing host names, directory paths, IP
addresses, port numbers, and so on. POWDER-WG provides the URLSet
extension to IRISet, by defining the following vocabulary items under
the wdrurl namespace:

<wdrurl:includeschemes/>        <wdrurl:excludeschemes/>
<wdrurl:includehosts/>          <wdrurl:excludehosts/>
<wdrurl:includeexactpaths/>     <wdrurl:excludeexactpaths/>
<wdrurl:includepathcontains/>   <wdrurl:excludepathcontains/>
<wdrurl:includepathstartswith/> <wdrurl:excludepathstartsWith/>
<wdrurl:includepathendswith/>   <wdrurl:excludepathendsWith/>
<wdrurl:includequerycontains/>  <wdrurl:excludequerycontains/>
<wdrurl:includeexactqueries/>   <wdrurl:excludeexactqueries/>

pathcontains and querycontains may appear any number of times within
an IRI set definition, but the rest may appear up to once.

These receive semantics in terms of the POWDER IRISet vocabulary as
follows:

<wdrurl:includeschemes>sch1 sch2</wdrurl:includeschemes>

means:

<includepattern>
    <xsd:pattern value="^(sch1)|(sch2)://" />
</includepattern>

And

<wdrurl:includehosts>host1 host2</wdrurl:includehosts>

means:

<includepattern>
  <xsd:pattern value="^[^:]://([\w\.]+\.)?(host1)|(host2)[:\?/]" />
</includepattern>

And so on. So that the URL Set:

<iriset>
  <wdrurl:includeschemes>http</wdrurl:includeschemes>
  <wdrurl:includehosts>example.org example.net</wdrurl:includehosts>
  <wdrurl:includequerycontains>s=football</wdrurl:includequerycontains>
  <wdrurl:includequerycontains>c=gr</wdrurl:includequerycontains>
  <wdrurl:includequerycontains>l=first</wdrurl:includequerycontains>
</iriset>

translates this, much more verbose, vanilla POWDER/XML IRI Set:

<iriset>
  <includepattern>
    <xsd:pattern value="^http://" />
  </includepattern>
  <includepattern>
    <xsd:pattern value="^[^:]://([\w\.]+\.)?(example\.org)|(example\.net)[:\?/]" />
  </includepattern>
  <includepattern>
    <xsd:pattern value="^[^?]+\?(.*&)?s=football[&$]" />
  </includepattern>
  <includepattern>
    <xsd:pattern value="^[^?]+\?(.*&)?c=gr[&$]" />
  </includepattern>
  <includepattern>
    <xsd:pattern value="^[^?]+\?(.*&)?l=first[&$]" />
  </includepattern>
</iriset>


The WAF Extension
=================

Q to group: does POWDER also need to provide this transformation?
Or have the WAF people already written it?

Enabling Read Access for Web Resources WG jas defined a Unix
shell-like wildcard mechanism.

<waf:includeiripattern>*.example.org</waf:includeiripattern>

<wdr:includepattern>
    <xsd:pattern value="http://.*\.example.org(/.*)?" />
</wdr:includepattern>


Multiple Layers of Extensions
=============================

It might sometimes be useful to also build upon already defined
extensions. For example, some content providers serve dynamic content
stored in a database, so that IRIs express queries to the database.
This kind of IRIs have certain structure, but this structure is
neither obvious nor easily human-interpreted. Furthemore, conventional
grouping mechanisms cannot be used to group resources, as the site
structure does not match any directory hierarchy.

As an example, consider sport.example.com, a sports news site,
where IRIs look like the one shown in Example 3-2-1. The adopted
scheme is systematic so that sport=2&countryID=16 provides a front
page with news about Greek basketball and links to various Greek
basketball leagues, sport=3&countryID=16 a front page about Greek
volleyball, etc. Eg:
  http://sport.example.com/matches.asp?sport=1&countryID=16&champID=2

A POWDER document providing metadata about this web site would have to
use regular expression matching with explicit reference to the
numerical values in the country and sport fields of the query. This
process is error-prone, and requires extensive changes if the
underlying database schema is modified or extended.

As an alternative, the site developer may provide a POWDER vocabulary
extension that abstracts away from the database schema to allow
reference to sports and countries. POWDER document authors can then
use the properties in this extension to create POWDER documents
are valid even if the site schema is modified, as long as the site
developer updates the relevant transformations.

So a POWDER/XML document might look like this:

<wdr:iriset>
  <wdrurl:includeschemes>http</wdrurl:includeschemes>
  <wdrurl:includehosts>sport.example.com</wdrurl:includehosts>
  <sport:countries>Greece</sport:countries>
  <sport:sports>Football Basketball</sport:sports>
</wdr:iriset>

A POWDER/XML tool specifically built for sport.example.com other site
following the same query patterns will immediately know how to handle
this information. Other POWDER tools will apply the GRDDL transform
associated with the sport: namespace to get the following translation:

<wdr:iriset>
  <wdrurl:includeschemes>http</wdrurl:includeschemes>
  <wdrurl:includehosts>sport.example.com</wdrurl:includehosts>
  <wdrurl:includequerycontains>countryID=16</wdrurl:includequerycontains>
  <wdrurl:includequerycontains>countryID=16</wdrurl:includequerycontains>
  <wdrurl:includequerycontains>sport=1 sport=2</wdrurl:includequerycontains>
</wdr:iriset>

A web-oriented POWDER/XML tool will immediately know what to do with 
wdrurl: vocabulary items. Other POWDER tools will apply the GRDDL transform
associated with the wdrurl: namespace to get the following translation:

<iriset>
  <includepattern>
    <xsd:pattern value="^http://" />
  </includepattern>
  <includepattern>
    <xsd:pattern value="^[^:]://([\w\.]+\.)?(sport\.example\.com)[:\?/]" />
  </includepattern>
  <includepattern>
    <xsd:pattern value="^[^?]+\?(.*&)?countryID=16[&$]" />
  </includepattern>
  <includepattern>
    <xsd:pattern value="^[^?]+\?(.*&)?(sport=1)|(sport=2)[&$]" />
  </includepattern>
</iriset>

Finally, an even more generic RDF/OWL tool will apply the transform
associated with the wdr: namespace to get the even more verbose
RDF/OWL translation, as described above.


Non-URL Identifiers
===================

Although POWDER is mostly involved with resources that are identified
by URLs, there is a number of other use cases; for example one might
use POWDER to provide meta-data about physical, off-line resources
like books or DVDs.

The International Standard Audiovisual Number [ISAN1] is a voluntary
numbering system for the identification of audiovisual works.
Following ISO 15706, the numbers are written as 24 bit hexadecimal
digits in the following format [ISAN2].

 -----root-----   episode   -version-  
ISAN  1881-66C7-3420  -  0000  -7-  9F3A-0245  -U

The root of an ISAN number is assigned to a core work with the other
numbers being used for things like episodes, different language
versions, promotional trailers and so on.

Since ISAN numbers are URNs [URN], and hence IRIs of the urn: scheme
[URIS], a vocabulary can readily be defined to allow IRI Sets to be
defined based on ISAN numbers. The terms might be along the lines of:

includeroots — the value of which would be a white space separated of
hexadecimal digits and hyphens that would be matched against the first
three blocks in the ISAN number.

includeepisodes — a white space separated list of hexadecimal digits
and hyphens that would be matched against the 4th block of 4 digits in
the ISAN number.

includeversions — a white space separated list of hexadecimal digits
and hyphens that would be matched against the 5th and 6th blocks of 4
digits in the ISAN number.

The set of all audio visual resources that relate to two particular
works might then be so defined:

Custom ISAN pattern:

<wdr:iriset>
  <isan:includeroots>1881-66C7-3420 1881-66C7-3421</isan:includeroots>
</wdr:iriset>

Corresponding vanilla POWDER/XML:

<iriset>
  <includepattern>
    <xsd:pattern value="^urn:isan:(1881-66C7-3420)|(1881-66C7-3421)" />
  </includepattern>
</iriset>

This example demonstrates one major extendability glitch in the
approach described here: numerical constraints (like, here, defining
numerical ranges for, say, the 3rd block) cannot be defined using wdr:
primitives. As the reader might also have noticed, port and IP ranges
(although specific to URLs) were hard-coded in the IRI level and not
defined as wdrurl: extensions. This is because XML types do not
provide a mechanism for using regexps to extract character groups from
strings, and then apply further numerical or other tests on the
extracted groups; a string either matches a regexp or does not, and
that is all.

One interesting approach would be to license use of XSLT 2 [XSLT2] in
the extension definitions, which provides for using regexps to extract
character groups. To be investigated.


Resource Sets
=============

One of the original desiderata of the group, later abandonded, was the
ability to group resources by property as well as by name. This is a
considerable expressivity leap for the POWDER/XML language.

This idea was abandonded in the Athens F2F, when it became obvious
that the POWDER grouping mechanism should not refer to the resources
themselves, but to the string representations of their IRIs. Since it
is the resources that have properties like being blue and not the
IRIs, the whole idea of grouping by property collapsed.

If it is important enough for POWDER, some limited expressivity
might be re-introduced in the form of a parallel grouping mechanism,
by intersecting the results of the two mechanism before finally
applying the descriptors. In other words:

<dr>
 <iriset>
   <wdrurl:includehosts>example.com</wdrurl:includehosts>
 </iriset>
 <resourceset>
   <voc:colour ref="http://rgb.org/colours.rdf#blue"/>
 </resourceset>
 <descriptorset>
   <voc:shape>square</voc:shape>
 </descriptorset>
</dr>

might be used to express that "on example.com, all blue resources are
also square". A resouce has to both be on example.com AND be blue in
order to also be described as square.

This can be very naturally expressed in OWL, and OWL tools will be
able to figure out which resources are blue, but it might be a
considerable strain on POWDER/XML tools which will care more about
efficiency than reasoning completeness. Furthermore, this opens a hole
through which circular definitions can creep, and loop detection will
also be a considerable strain to POWDER/XML implementations. My
suggestion is to drop it in the sake of efficiency or, at most, leave
an extension door open for logical statements that fall through to the
underlying POWDER-S; just in case one really needs to express such a
thing in POWDER/XML instead of OWL.



REFERENCES
==========

[1] http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/datatypes.html#rf-pattern
[2] http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/structures.html#Complex_Type_Definitions
[3] http://www.w3.org/TR/owl-semantics/syntax.html#2.1
[4] http://www.w3.org/TR/owl-semantics/mapping.html
[GRDDL] http://www.w3.org/TR/grddl/
[URN] http://www.iana.org/assignments/urn-namespaces
[ISAN1] http://www.isan.org/
[ISAN2] http://www.isan.org/portal/page?_pageid=166,41960&_dad=portal&_schema=PORTAL
[XSLT2] http://www.w3.org/TR/xslt20/

Received on Sunday, 30 March 2008 17:19:39 UTC