XML Attribute Value Namespace Expansion

Author: Eric Prud'hommeaux <eric@w3.org>

Status of this document

This is the first public draft of a discussion document for the RDF Interest Group. This document has no formal standing within W3C Process. This document is a work in progress, and does not represent the activity of any chartered working group within W3C process.

Comments from the public on this document are invited and should (with the exception of minor editorial suggestions) be sent to the RDF Interest Group, www-rdf-interest@w3.org which is an automatically archived email list. Information on how to join the RDF Interest Group mailing list can be found on the group's home page.

This document is made available for discussion only. This indicates no endorsement by W3C of its content, nor that the Consortium has, is, or will be allocating any resources to the issues addressed by this document.

Abstract

The purpose of this document is to address and remedy two practical limitations in a more abbreviated, "context-free" XML serialization of RDF, XML-native URI references, and a single, complete property per nested of XML tag. It is also possible and desirable to achieve this without sacrificing the design principal of having a parser capable of parsing the XML into triples without accessing a separate or embedded schema

- may end up splitting these into separate documents - EGP

URI References
Property to Nesting Correlation
Eric's Metadata Rant

URI References

The current XML reccomendation has no facilities for external references. The IDREF is defined to be an internal reference to an tag marked with an ID attribute.This is only enforcable or even knowable if the DTD has been processes and the attributes in the role of ID and IDREF are present in the DTD.

Why a Have a Native Reference Attribute Value

RDF and other graph-describing symatics require a mechanism enabling one "node" to identify a link to another "node". While this may generate a graph without departing from the local scope of the document, that would needlessly prevent a uniform system for asserting simple links to external documents. This is key to producing usefull metadata (see my rant below).

Namespaces

The XML Namespace document provides an mechanism to conveniently fully qualify tags and attributes. It does not, however, provide such a mechanism for expanding namespaces in the values of these attributes. There was little need of such a mechanism given IDREFs may not be fully qualified. Given the desirability of natively understood external references, it becomes the responsibility of namespaces to conveniently scope these references.

Shoehorning This into Existing XML

While some may view this as a dubious goal, I beleive it is a practical objective that will greatly accellerate the adoption of a formal data model. The requirements are:

The syntax must differentiate between attribute values that are resource references, and those that are text strings that may or may not look like references.
This differentiatiation must be in the attibute-value pair in order to associate it with the value which is being described.
It must fit in the XML production [41] Attribute ::= Name Eq AttValue

Furthur nicities are:

Use the ':' character as it is already associated with specifying and using namespaces.
Associate it as closely as possible with the value (Tim's observation).
It must not be in the actual contents of the value in order to be trivially differentiatable from a coincidentally similar looking string value.

At the bottom of this list of requirements, I was left with the syntax

attrns:attr:="valuens:value"

where the ':=' signals the namespace handler to perform a namespace expansion on the value as well as the attribute. Here attrns and valuens are namespaces defined with the

xmlns:ns="<identifying URI>"

syntax from the XML Namespaces document.

Property to Nesting Correlation

The motivating factor behind this refinment is the desire to extract reasonable and usefull data from what appears to be the "colloquial" way of writing xml data. A typical xml document may look like this:

<HR>
  <Employee>
    <name>Renè</name>
    <addr>home</addr>
  </Employee>
</HR>

Note the lack of elements containing both #PCDATA and sub-elements. this appears to be a requirement for this "colloquial" form. The implicit relationships between these nodes appear to be:

   object1
     |HR
     V
  object2
     | Employee
     V
 object3
 |name  |addr
 V      V
"Renè" "home"

The RDF expression for this would be:

<rdf:Description rdf:about="object1">
  <HR>
    <rdf:Description rdf:about="object1">
      <Employee>
        <rdf:Description rdf:about="object3">
          <name>Renè</name>
          <addr>home</addr>
        </rdf:Description>
      </Employee>
    </rdf:Description>
  </HR>
</rdf:Description>

where the rdf:about="objectn" would likely be omitted; they are included here only to tie clearly to the previous example.

The previous example uses a fraction of the functionality of RDF. The above data could the written more descriptively:

<rdf:Description rdf:about="http://myCo/">
  <corp:HR>
    <rdf:Description rdf:about="http://myCo/hr-dept/">
      <corp:Employee>
        <rdf:Description rdf:about="http://myCo/employee-register/Descarte">
          <corp:name>Renè</corp:name>
          <corp:addr>home</corp:addr>
        </rdf:Description>
      </corp:Employee>
    </rdf:Description>
  </corp:HR>
</rdf:Description>

Note the "striping" effect in which alternating nestings are predicates (highlighted in green). This allows the subject, predicate, and object types to all be fully qualified by a namespace. The "colloquial" and the "formal RDF" formats are very different in that the "colloquial" format requires no context for parsing the sub-elements contained in any element. An empirical study of the "colloquial" form gives these rules for triple generation:

Each tag name is a predicate in a triple where the subject is the object of the triple formed by the containing tag and the object is the subject of the contained predicate.
Prediactes containing #PCDATA use the #PCDATA as the object.
Predicates containing both #PCDATA and sub-elements cause us to thow up our hands and cry "the sky is falling."

Notably lacking from the "colloquial" data model are:

Subject and object (node) identificiation with formal uris.
Subject and object type identification.
Native references to specify an object by a means other than nesting it.
This is necessary for describing a graph rather than a simple tree.
External references to specify resources about which this metadata applies.
Mixed Data Elements
Containers and iterators

Node Identification

This is easily handled with an attribute just as it is in RDF. I propose the use of the same attributes as in RDF, ie. rdf:ID and rdf:about. The series of nested predicates is most colloquial if an rdf:ID or rdf:about attribute identifies the object of that predicate. This begs the addition of an extra attibute or tag to identify the subject of the outermost predicate. For consistancy's sake, this should not be an attribute as it would, if used on inner predicates, lead to type conflicts if it did not match or subclass the object type of the nesting predicate.

example: <Employee rdf:ID="employee#32">

Type Identification

Reuse of the RDF:type attribute seems appropriate as it has identical meaning and scope. This type identifier describes the object of the element predicate, just as an rdf:ID or rdf:about attribute identifies the object of that predicate.

example: <Employee rdf:type:="hr:GruntLabor">

External Subject Identification

Again I propose the reuse of an RDF attribute, rdf:about. The value for this attribute is an external reference to an object that may be in this resource but may be anywhere else in URI space. This replaces the rdf:ID as the containing predicate must pick one (the rdf:about) to the object of its triple.

example: <Employee rdf:about="http://myco/register/employee#32">

Namespace Expansion for RDF Values

The motivation for defining a way to identify and namespace expand XML external refs is to formalize and render consistant the above attributes. The values for these attributes may fully specify and external reference (accept for rdf:ID) and indicate that the namespace handler must expand these values.

Mixed Data Elements

Since there is no rule in RDF that there can't be triples (p1 s1 o1) and (p1 s1 o2), I beleive that:

  <a>
    text1
    <b>text2</b>
    <b>text3</b>
  </a>

results in:

  (a genid1 text1)
  (a genid1 genid2)
  (a genid1 genid3)
  (b genid2 text2)
  (b genid2 text3)

Containers and iterators

(see RDF Bag and aboutEach). - Daniel LaLiberte's observation - @@@ needs work

I had a talk with Ralph about containers. I wanted to see why, if there are repeated properties, containers were needed and could not be handled by a reusable schema. The answer appears to be that if we want a way to add ordered elements to a container without assigning them each an ordinal, we must provide the mechanism in the basic syntax.

EXPALANATION OF THE ABOVE FAULKNERESQUE STATEMENT:

If create an ordered list schema, I can stipulate that the ordinal must be provided as the predicate of each list entry. Reworking the above example:

  <b rdf:type:="list:ordered">
    <list:_1>text2</list:_1>
    <list:_2>text3</list:_2>
  </b>

The problem arrises if I don't want the author to have to enter ordinals:

  <b rdf:type:="list:ordered">
    <list:li>text2</list:li>
    <list:li>text3</list:li>
  </b>

This doesn't work as there is no mechanism for the app that understands the list schema to retrieve these as anything but a series of unordered repeated properties:

  (b genid1 genid2)  <-- genid2 is the ordered list (like Rdf:Seq) (list:li genid2 text3) (list:li genid2 text2)

I placed them in the reverse order from how it was serialized to start an argument. I contend that it would be prohibitive to make the triple database regurgitate triples in the same order they were added. I don't, however, see any reason why:

  <b rdf:type:="list:unordered">
    <list:li>text2</list:li>
    <list:li>text3</list:li>
  </b>

can't be implemented as a schema. If we dispence with the requirement for a mechanism to enter ordered repeated properties without a unique ordinal, I beleive we can punt this issue for the MS-friendly syntax.

Colloquial XML

Incorporating node IDs into the "colloquial" example, we can enrich the XML to give us the same node speceficity as the RDF Example:

<rdf:Description rdf:about="http://myCo/">
  <corp:HR rdf:about="http://myCo/hr-dept/">
    <corp:Employee rdf:about="http://myCo/employee-register/Descarte">
      <corp:name>Renè</corp:name>
      <corp:addr>home</corp:addr>
    </corp:Employee>
  </corp:HR>
</rdf:Description>

The use of type identificatication allows us to map a type-rich form of the RDF:

<corp:Corporation rdf:about="http://myCo/">
  <corp:HR>
    <corp:Department rdf:about="http://myCo/hr-dept/">
      <corp:Employee>
        <corp:Person rdf:about="http://myCo/employee-register/Descarte">
          <corp:name>Renè</corp:name>
          <corp:addr>home</corp:addr>
        </corp:Person>
      </corp:Employee>
    </corp:Department>
  </corp:HR>
</corp:Corporation>

to:

<rdf:Description rdf:about="http://myCo/" rdf:type:="corp:Corporation">
  <corp:HR rdf:about="http://myCo/hr-dept/" rdf:type:="corp:Department">
    <corp:Employee rdf:about="http://myCo/employee-register/Descarte" rdf:type:="corp:Person">
      <corp:name>Renè</corp:name>
      <corp:addr>home</corp:addr>
    </corp:Employee>
  </corp:HR>
</rdf:Description>

(Note the addition of the outermost subject). This, combined with the attributes sprinkled liberally through the previous XML, greatly enhances the XML without using the "striped" RDF notation. All the information present in the RDF is present in the XML. The syntax does not conflict with the "colloquial" form; it merely provides a mechanism to enhance it when the author so chooses.

It is possible that, in certain parsing modes, the RDF parser may "switch on" when it sees an rdf:Description. This would allow the an RDF parser to ignore sections of the XML that the author did not intend to be readable, for instance, XHTML.

syntax limitations

While this syntax is intended to allow people an alternative syntax, one which may better suit their preferred data format, it is still necessarily constraining and may seem awkward for that reason. The constraint is really just a requirement for consistancy. The preferred data format may not be subject to this constraint.

@@@ need example of inconsistant but human-readable statements @@@

Statements about statements

The object of a propery may be an entire statement rather than the subject of that statement. This can be done by adding an rdf:about atribute that points to the reficiation of the statement. The statement may also be placed in line by placing an rdf:Description in the object position of the property:

    <rdf:Description rdf:about="http://myCo/employee-register/Descarte" rdf:type:="corp:Person">
      <corp:name>Renè</corp:name>
      <corp:says>
        <rdf:Description rdf:ID="theStatement" rdf:type:="corp:Statement">
          <sophist:raison-d'être>I think</sophist:raison-d'être>
        </rdf:Description>
      </corp:says>        
    </rdf:Description>

This says that Descarte says that his reason for existence is that he thinks.

Other attribute-value pairs

The attributes ID, about, aboutEach, aboutEachPrefix, bagID, resource and type are reserved by the above syntaxes. Any attribute-value pair not listed in there creates a triple where the predicate is the attribute, the subject is the object of this node, and the object is the value of the attribute.

DTD Compliance

One interesting benifit to this reorganization is that the resulting XML conforms more closely to a DTD. This should enhance the effectiveness of tools that compile XML DTDs from RDF Schemas. This will allow greater use of existing SGML editing tools for authoring and validating RDF data.

Eric's Metadata Rant

Any piece of data that describes or referrs to another piece of data may be termed metadata. In the publishing realm, the author, publisher, and copywrite information are critical peices of metadata. In database work, metadata is more likely to refer to table descriptions and valid ranges of the data contained therein. Each of these domains has its native referencing mechanism, book references for Library schemas, and table references for database metadata. In the web world, the critical reference, the thing that all applications that read or process and web metadata at all must understand, is a resource reference.

A convenient artifact of our use of referencable media to record metadata is that we can make references to the metadata itself and extrapolate to arbitrary levels of meta. Try doing that in a card catalog or even an online catalog.

Eric Prud'hommeaux <eric@w3.org
Last updated: 1999-03-29T13:43:47Z
CVS $Date: 2014/02/24 22:02:29 $ by $Author: sysbot $

XML Attribute Value Namespace Expansion

Status of this document

Abstract

Table of Contents

EXPALANATION OF THE ABOVE FAULKNERESQUE STATEMENT: