author:
This paper describes a specific way to use XML in order to serialize graphs of data such as database tables and relations, nodes and edges from directed labeled graphs, and similar constructions. By graphs, we mean objects having properties and relations to other objects, where the relations are directed (and have inverses) and where there may be multiple paths to an object. A graph of data serialized according to the described rules is said to be in "canonical form." Other representations of the same data can be mapped into and out of the canonical form.
This paper does not change the fact that every validatable XML document conforms to a specific grammar. Rather, it proposes a way to mechanically generate, from a database's or graph's schema, a particular grammar that can be used to serialize data from the database or graph, and into which any other serialization of that data can be mapped.
When designing the canonical form, the following criteria were considered:
It is not a requirement that all serializations must be fully decodable without schema. That is, while it is beneficial if basic information can be extracted from a document lacking its schema, it is acceptable if full decoding requires schema.
Canonical syntax is syntax which obeys these five rules:
For example, consider a database or other graph (described by a UML diagram or other notation) and containing
A serialized instance would look like
<School> <Class id="Class:19" name="Western Civilization" taughtBy="#Teacher83"/> <Class id="Class:253" name="English Literature" taughtBy="#Teacher83"/> <Student id="Student:30006" name="Raphael" home="Address:1" attends="Class:19"> <Address id="Address:1" street="950 Greenhill Rd" city="Mill Valley" state="CA"/> </Student> <Student id="Student:2567" name="Michael" home="Address:3" attends="Class:19 Class:253"> <Address id="Address:3" street="28 Mountain Road" city="Lark Creek" state="CA"/> </Student> <Student id="Student:31415" name="Sandro" home="Address:4" attends="Class:253"> <Address id="Address:4" street="14 16 Street" city="San Raphael" state="CA"/> </Student> <Teacher id="Teacher:83" name="Thorsten"> </School>
Entities may have relations to entities not in the serialized graph using the same general mechanism, but where the attribute's datatype is "uri". For example:
<Student id="Student:31415" name="Linda" webPage="http://home.navisoft.com/lindamann"/>
As mentioned earlier, if several entities are related by the same relation type, they are expressed as a single attribute with datatype "IDREFS". The order in which the ids are listed is presumed significant, and expresses the ordering (if any) of the collection of related entities (e.g. chapters in a book). When significant, it is fundamentally an aspect of the relations between the elements (e.g. between the chapters, such that chapter 1 precedes chapter two, and so on).
This does not preclude application domains designing vocabulary for collections with more specialized semantics. In these cases, the semantics would be indicated by explicit collection elements, or by information in the schema for the relation attribute, as appropriate. Similarly, while these rules permit the serialization of any graph, they neither include nor preclude elements or attributes with specific semantics, including elements or attributes designed to layer on additional graph facilities such as reference, attribution or subsumption. All of these facilities can be effected by designing appropriate vocabularies and namespaces.
Given an arbitrary UML diagram, we can mechanically produce a canonical grammar.
A fully-explicit, canonical syntax makes it easy to convert from syntax to a graph of objects. Provided one has a schema telling which attributes are IDREFs, one merely interprets all attributes as either properties or relations via IDREF. However, the canonical syntax is not the only syntax that could be used to serialize a graph. In many cases, alternative syntaxes may be used, either due to historical or political factors, or to take advantage of compressions that are available if one has domain knowledge. We call all of these "abbreviated syntaxes." For example, we might find an instance such as this:
<Class> <name>Western Civilization</name> <taughtBy>Thorsten</taughtBy> <attendedBy>Raphael</attendedBy> <attendedBy>Smith</attendedBy> </Class>
Here, the class's name was expressed by a sub-element, and teachers and students were identified only by their name. We need a means to convert such abbreviated syntax to a fully-explicit (canonical) syntax. There are two basic approaches possible. One is to have some declarative information in the schema that restores the missing elements. The other is to use a transform language such as XSL to convert the abbreviated to a explicit syntax.
The declarative approach is initially simpler. each abbreviated syntactic schema declares its relation to a canonical schema and provides appropriate declarative mappings. The drawback to this is that it requires additions to the schema vocabulary, and can only handle a limited number of simple cases. In the real world, judging by the experience with Architectural Forms, especially given the deployment of systems that evolve over several years, declarative mapping either fails or becomes very complex.
If we take the transform language approach, then each abbreviated syntactic schema declares its relation to a canonical schema and provides appropriate transforms to and from.
We right now favor a composite approach. For a small number of very common and simple cases we can annotate schemas with declarative mapping information in the form of attributes of the element types. The exact details of what constitutes "common and simple" should be determined, but candidates appear to be (a) simple renaming of elements or attributes, (b) conversion of a sub-element to an attribute, (c) inference of a relation based on element containment, (d) reference by a "foreign key" converted to reference by IDREF or URI. For more complex cases we should look to a transform language such as XSL.
Finally, one might reasonably ask why we have a canonical syntax at all. Why not provide mappings directly to the graph's schema? But if we ask that, we need to also ask what those mappings would look like. In effect, they would map elements and attributes to objects and properties, much as XSL maps things today, but using new keywords to signal the difference in result types. Having done all that introducing a new vocabulary for syntax to graph mapping we would not have any greater functionality than provided by the canonical syntax approach, but we would have doubled the vocabulary needed. Further, we would require that all clients of XML implement mapping machinery (while with the canonical syntax approach a server could choose to emit canonical syntax, thereby avoiding any need for a special mapper). We would not be able to leverage future developments in XSL Finally, we would not be providing any clear suggestions for syntax that people should use, and would therefore greatly increase the actual amount of mapping that would need to occur.
This sample schema uses XML-Data notation to describe a vocabulary and syntax for serializing the example data of Classes, Students and Teachers.
<?xml version="1.0" encoding="windows-1252" ?> <!-- Schema for package ClassesStudentsTeachers --> <Schema xmlns="urn:schemas-microsoft-com:xml-data" xmlns:dt="urn:schemas-microsoft-com:datatypes" xmlns:x="urn:schemas-microsoft-com:xml-data-ex"> <!-- ***** TYPE Address ***** --> <ElementType name="Address"> <AttributeType name="id" dt:type="id"/> <attribute type="id" /> </ElementType> <!-- ***** TYPE Class ***** --> <ElementType name="Class"> <AttributeType name="id" dt:type="id"/> <attribute type="id" /> <AttributeType name="name" dt:type="string"/> <attribute type="name" required='yes' /> <AttributeType name="taughtBy" dt:type="idref" /> <attribute type="taughtBy" required='yes' x:range="Teacher"/> </ElementType> <!-- ***** TYPE Student ***** --> <ElementType name="Student"> <AttributeType name="id" dt:type="id"/> <attribute type="id" /> <AttributeType name="attends" dt:type="idrefs" /> <attribute type="attends" x:range="Class"/> <AttributeType name="home" dt:type="idref" /> <attribute type="home" x:range="Address"/> <AttributeType name="name" dt:type="string"/> <attribute type="name" required='yes' /> <group seq='many'> <element type="Address" minOccurs="0" maxOccurs="1" /> </group> </ElementType> <!-- ***** TYPE Teacher ***** --> <ElementType name="Teacher"> <AttributeType name="id" dt:type="id"/> <attribute type="id" /> <AttributeType name="name" dt:type="string"/> <attribute type="name" required='yes' /> </ElementType> <!-- The PACKAGE --> <!-- ***** TYPE School ***** --> <ElementType name="School"> <AttributeType name="id" dt:type="id"/> <attribute type="id" /> <AttributeType name="classes" dt:type="idrefs" /> <attribute type="classes" x:range="Class"/> <AttributeType name="students" dt:type="idrefs" /> <attribute type="students" x:range="Student"/> <AttributeType name="teachers" dt:type="idrefs" /> <attribute type="teachers" x:range="Teacher"/> <group seq='many'> <element type="Student" minOccurs="0" maxOccurs="*" /> <element type="Class" minOccurs="0" maxOccurs="*" /> <element type="Teacher" minOccurs="0" maxOccurs="*" /> </group> </ElementType> </Schema>
The "Northwind" database is a sample database supplied with Microsoft Access, containing representative tables for a hypothetical business.
<?xml version="1.0" encoding="windows-1252" ?> <!-- Schema for package Northwind Database --> <Schema xmlns="urn:schemas-microsoft-com:xml-data" xmlns:dt="urn:schemas-microsoft-com:datatypes" xmlns:x="urn:schemas-microsoft-com:xml-data-ex"> <!-- ***** TYPE Category ***** --> <ElementType name="Category"> <AttributeType name="id" dt:type="id"/> <attribute type="id" /> <AttributeType name="categoryID" dt:type="string" x:type="int"/> <attribute type="categoryID" required='yes' /> <AttributeType name="categoryName" dt:type="string"/> <attribute type="categoryName" required='yes' /> <AttributeType name="description" dt:type="string"/> <attribute type="description" required='yes' /> </ElementType> <!-- ***** TYPE Customer ***** --> <ElementType name="Customer"> <AttributeType name="id" dt:type="id"/> <attribute type="id" /> <AttributeType name="address" dt:type="string"/> <attribute type="address" required='yes' /> <AttributeType name="city" dt:type="string"/> <attribute type="city" required='yes' /> <AttributeType name="companyName" dt:type="string"/> <attribute type="companyName" required='yes' /> <AttributeType name="contactName" dt:type="string"/> <attribute type="contactName" required='yes' /> <AttributeType name="contactTitle" dt:type="string"/> <attribute type="contactTitle" required='yes' /> <AttributeType name="country" dt:type="string"/> <attribute type="country" required='yes' /> <AttributeType name="customerID" dt:type="string"/> <attribute type="customerID" required='yes' /> <AttributeType name="fax" dt:type="string"/> <attribute type="fax" required='yes' /> <AttributeType name="phone" dt:type="string"/> <attribute type="phone" required='yes' /> <AttributeType name="postalCode" dt:type="string"/> <attribute type="postalCode" required='yes' /> </ElementType> <!-- ***** TYPE Employee ***** --> <ElementType name="Employee"> <AttributeType name="id" dt:type="id"/> <attribute type="id" /> <AttributeType name="address" dt:type="string"/> <attribute type="address" required='yes' /> <AttributeType name="birthDate" dt:type="string" x:type="date"/> <attribute type="birthDate" required='yes' /> <AttributeType name="city" dt:type="string"/> <attribute type="city" required='yes' /> <AttributeType name="country" dt:type="string"/> <attribute type="country" required='yes' /> <AttributeType name="employeeID" dt:type="string" x:type="int"/> <attribute type="employeeID" required='yes' /> <AttributeType name="firstName" dt:type="string"/> <attribute type="firstName" required='yes' /> <AttributeType name="hireDate" dt:type="string" x:type="date"/> <attribute type="hireDate" required='yes' /> <AttributeType name="homePhone" dt:type="string"/> <attribute type="homePhone" required='yes' /> <AttributeType name="lastName" dt:type="string"/> <attribute type="lastName" required='yes' /> <AttributeType name="notes" dt:type="string"/> <attribute type="notes" required='yes' /> <AttributeType name="postalCode" dt:type="string"/> <attribute type="postalCode" required='yes' /> <AttributeType name="region" dt:type="string"/> <attribute type="region" required='yes' /> <AttributeType name="reportsTo" dt:type="idref" /> <attribute type="reportsTo" required='yes' x:range="Employee"/> <AttributeType name="title" dt:type="string"/> <attribute type="title" required='yes' /> <AttributeType name="titleOfCourtesy" dt:type="string"/> <attribute type="titleOfCourtesy" required='yes' /> </ElementType> <!-- ***** TYPE Order ***** --> <ElementType name="Order"> <AttributeType name="id" dt:type="id"/> <attribute type="id" /> <AttributeType name="customer" dt:type="idref" /> <attribute type="customer" required='yes' x:range="Customer"/> <AttributeType name="employee" dt:type="idref" /> <attribute type="employee" required='yes' x:range="Employee"/> <AttributeType name="freight" dt:type="string" x:type="float"/> <attribute type="freight" required='yes' /> <AttributeType name="orderDate" dt:type="string" x:type="date"/> <attribute type="orderDate" required='yes' /> <AttributeType name="orderID" dt:type="string" x:type="int"/> <attribute type="orderID" required='yes' /> <AttributeType name="requiredDate" dt:type="string" x:type="date"/> <attribute type="requiredDate" required='yes' /> <AttributeType name="shipAddress" dt:type="string"/> <attribute type="shipAddress" required='yes' /> <AttributeType name="shipCity" dt:type="string"/> <attribute type="shipCity" required='yes' /> <AttributeType name="shipCountry" dt:type="string"/> <attribute type="shipCountry" required='yes' /> <AttributeType name="shipName" dt:type="string"/> <attribute type="shipName" required='yes' /> <AttributeType name="shippedDate" dt:type="string" x:type="date"/> <attribute type="shippedDate" required='yes' /> <AttributeType name="shipPostalCode" dt:type="string"/> <attribute type="shipPostalCode" required='yes' /> <AttributeType name="shipRegion" dt:type="string"/> <attribute type="shipRegion" required='yes' /> <AttributeType name="shipVia" dt:type="idref" /> <attribute type="shipVia" required='yes' x:range="Shipper"/> </ElementType> <!-- ***** TYPE OrderDetail ***** --> <ElementType name="OrderDetail"> <AttributeType name="id" dt:type="id"/> <attribute type="id" /> <AttributeType name="discount" dt:type="string" x:type="float"/> <attribute type="discount" required='yes' /> <AttributeType name="order" dt:type="idref" /> <attribute type="order" required='yes' x:range="Order"/> <AttributeType name="product" dt:type="idref" /> <attribute type="product" required='yes' x:range="Product"/> <AttributeType name="quantity" dt:type="string" x:type="float"/> <attribute type="quantity" required='yes' /> <AttributeType name="unitPrice" dt:type="string" x:type="float"/> <attribute type="unitPrice" required='yes' /> </ElementType> <!-- ***** TYPE Product ***** --> <ElementType name="Product"> <AttributeType name="id" dt:type="id"/> <attribute type="id" /> <AttributeType name="category" dt:type="idref" /> <attribute type="category" required='yes' x:range="Category"/> <AttributeType name="discontinued" dt:type="string" x:type="boolean"/> <attribute type="discontinued" required='yes' /> <AttributeType name="productID" dt:type="string" x:type="int"/> <attribute type="productID" required='yes' /> <AttributeType name="productName" dt:type="string"/> <attribute type="productName" required='yes' /> <AttributeType name="quantityPerUnit" dt:type="string"/> <attribute type="quantityPerUnit" required='yes' /> <AttributeType name="reorderLevel" dt:type="string" x:type="int"/> <attribute type="reorderLevel" required='yes' /> <AttributeType name="supplier" dt:type="idref" /> <attribute type="supplier" required='yes' x:range="Supplier"/> <AttributeType name="unitPrice" dt:type="string" x:type="float"/> <attribute type="unitPrice" required='yes' /> <AttributeType name="unitsInStock" dt:type="string" x:type="int"/> <attribute type="unitsInStock" required='yes' /> <AttributeType name="unitsOnOrder" dt:type="string" x:type="int"/> <attribute type="unitsOnOrder" required='yes' /> </ElementType> <!-- ***** TYPE Shipper ***** --> <ElementType name="Shipper"> <AttributeType name="id" dt:type="id"/> <attribute type="id" /> <AttributeType name="companyName" dt:type="string"/> <attribute type="companyName" required='yes' /> <AttributeType name="phone" dt:type="string"/> <attribute type="phone" required='yes' /> <AttributeType name="shipperID" dt:type="string" x:type="int"/> <attribute type="shipperID" required='yes' /> </ElementType> <!-- ***** TYPE Supplier ***** --> <ElementType name="Supplier"> <AttributeType name="id" dt:type="id"/> <attribute type="id" /> <AttributeType name="address" dt:type="string"/> <attribute type="address" required='yes' /> <AttributeType name="city" dt:type="string"/> <attribute type="city" required='yes' /> <AttributeType name="companyName" dt:type="string"/> <attribute type="companyName" required='yes' /> <AttributeType name="contactName" dt:type="string"/> <attribute type="contactName" required='yes' /> <AttributeType name="contactTitle" dt:type="string"/> <attribute type="contactTitle" required='yes' /> <AttributeType name="country" dt:type="string"/> <attribute type="country" required='yes' /> <AttributeType name="fax" dt:type="string"/> <attribute type="fax" required='yes' /> <AttributeType name="homePage" dt:type="string" x:type=""/> <attribute type="homePage" required='yes' /> <AttributeType name="phone" dt:type="string"/> <attribute type="phone" required='yes' /> <AttributeType name="postalCode" dt:type="string"/> <attribute type="postalCode" required='yes' /> <AttributeType name="region" dt:type="string"/> <attribute type="region" required='yes' /> <AttributeType name="supplierID" dt:type="string" x:type="int"/> <attribute type="supplierID" required='yes' /> </ElementType> <!-- The PACKAGE --> <!-- ***** TYPE Nwind ***** --> <ElementType name="Nwind"> <AttributeType name="id" dt:type="id"/> <attribute type="id" /> <AttributeType name="categories" dt:type="idrefs" /> <attribute type="categories" x:range="Category"/> <AttributeType name="customers" dt:type="idrefs" /> <attribute type="customers" x:range="Customer"/> <AttributeType name="employees" dt:type="idrefs" /> <attribute type="employees" x:range="Employee"/> <AttributeType name="orderDetails" dt:type="idrefs" /> <attribute type="orderDetails" x:range="OrderDetail"/> <AttributeType name="orders" dt:type="idrefs" /> <attribute type="orders" x:range="Order"/> <AttributeType name="products" dt:type="idrefs" /> <attribute type="products" x:range="Product"/> <AttributeType name="shippers" dt:type="idrefs" /> <attribute type="shippers" x:range="Shipper"/> <AttributeType name="suppliers" dt:type="idrefs" /> <attribute type="suppliers" x:range="Supplier"/> <group seq='many'> <element type="Order" minOccurs="0" maxOccurs="*" /> <element type="Customer" minOccurs="0" maxOccurs="*" /> <element type="Employee" minOccurs="0" maxOccurs="*" /> <element type="Supplier" minOccurs="0" maxOccurs="*" /> <element type="Category" minOccurs="0" maxOccurs="*" /> <element type="Product" minOccurs="0" maxOccurs="*" /> <element type="OrderDetail" minOccurs="0" maxOccurs="*" /> <element type="Shipper" minOccurs="0" maxOccurs="*" /> </group> </ElementType> </Schema>