ACTION-158 punctuation schema for the DM

During the last meeting, we discussed picking a punctuation schema but
asking the community for feedback on picking from a set of choices
(perfectly legit in an LC document). This can help us pick:


= Problem =
  Define rules which create unambiguous identifiers for database rows,
  columns and references (foreign keys).
  Extra credit if they are easy to parse by human or machine and easy
  to express in SPARQL, Turtle, RIF, RDF/XML ("STRR" below).

These URIs are composed from table and attribute names, attribute
values, and miscelaneous punctuation. This email is about tweaking
the punctuation to get the most simplicity in the most use cases.

Rules in in <http://www.w3.org/2001/sw/rdb2rdf/directMapping/explicitFK>:
 Row IRI: base + table + '/' + attr¹ + '-' + val¹ + '.' … attrⁿ + '-' + valⁿ
 Column IRI: base + table + '#' + attr
 Reference IRI: base + table + '#' + 'ref-' + attr¹ + '.' … attrⁿ

This uses the '-' separator between attributes in both row IRIs and
reference IRIs. The attrⁿ/valⁿ separator is '.' (for simplicity in
STRR). Outlining some popular choices:

         row IRI              ref IRI
① attr¹-val¹.attrⁿ-valⁿ   ref-attr¹.attrⁿ
② attr¹.val¹-attrⁿ.valⁿ   ref-attr¹-attrⁿ
③ attr¹-val¹.attrⁿ-valⁿ   ref-attr¹-attrⁿ
④ attr¹=val¹,attrⁿ=valⁿ   ref-attr¹-attrⁿ
⑤ attr¹.val¹.attrⁿ.valⁿ   ref.attr¹.attrⁿ


= Examples =
 Given some tables with PKs:
┌┤Simple├────┬───────┐  ┌┤People├────┬─────────┐  ┌┤Events├────┬────────────┬─────────┐
│┌pk┐│       │       │  │┌──────────pk────────┐│  │┌────pk────┐│┌─────↬People.pk─────┐│
│ PK │ attrA │ attrB │  │   fname    │  lname  │  │    date    │    orgfn   │  orgln  │
│  1 │ valA1 │ valB2 │  │      "Bob" │ "Smith" │  │ 2012-01-01 │      "Bob" │ "Smith" │
│  2 │ valA2 │ valB2 │  │  "Madonna" │      "" │  │ 2011-12-25 │  "Madonna" │      "" │
└────┴───────┴───────┘  │     "T in" │ "Ya-Li" │  │ 2012-04-06 │     "T in" │ "Ya-Li" │
                        │ "أكرم.عبد" │   "كور" │  │ 2011-10-01 │ "أكرم.عبد" │   "كور" │
                        └────────────┴─────────┘  └────────────┴────────────┴─────────┘

┤Simple├ has your run-of-the-mill integer primary key and alphanumeric
attribute names and values. ┤People├ and ┤Events├ have alphanum attribute
names. (Attribute names which are not exclusively alpha-numeric are
horrible no matter what; they don't help us descriminate our options.)

== Example Row IRIs ==
We see these Row IRIs (eliding <base + ...>) for the first rows of
these tables, given the choices of punctuation listed above.

  ①  Simple/PK-1 │ People/fname-Bob.lname-Smith │ Events/date-2012-01-01
  ②  Simple/PK.1 │ People/fname.Bob-lname.Smith │ Events/date.2012%2D01%2D01
  ③  Simple/PK.1 │ People/fname.Bob-lname.Smith │ Events/date.2012%2D01%2D01
  ④  Simple/PK=1 │ People/fname=Bob,lname=Smith │ Events/date=2012-01-01
  ⑤  Simple/PK.1 │ People/fname.Bob.lname.Smith │ Events/date.2012-01-01

== Reference (predicate) IRIs ==
Reference (predicate) IRIs for ┤Simple├ are simple and boring: table#ref-attr .
┤Events├'s references to ┤People├ take to two attributes:

  ①  Events/ref-orgfn.orgln
  ②  Events/ref-orgfn-orgln
  ③  Events/ref-orgfn-orgln
  ④  Events/ref-orgfn-orgln
  ⑤  Events/ref.orgfn.orgln


= What needs escaping =
The character used to separate attr/value pairs dictates which
characters require escaping in values. ②③ require escaping '-'s;
①⑤ requires escaping '.'s and ④ requires escaping ','s. Row
identifiers for rows 3 and 4 of ┤People├ illustrate this:

  ①  People/fname-T%20in.lname-Ya-Li   │ People/fname-أكرم%2Dعبد.lname-كور
  ②  People/fname.T%20in-lname.Ya%2DLi │ People/fname.أكرم.عبد-lname%2Dكور
  ③  People/fname.T%20in-lname.Ya%2DLi │ People/fname.أكرم.عبد-lname%2Dكور
  ④  People/fname=T%20in,lname=Ya-Li   │ People/fname=أكرم.عبد,lname=كور
  ⑤  People/fname.T%20in.lname.Ya-Li   │ People/fname.أكرم%2Dعبد.lname.كور

(We can also follow the HTML5, WSDL, ... url-encoding spec and
 turn ' ' into '+' instead of '%2D'.)


= SPARQL, Turtle, RIF, RDF/XML =
RDF Rules (RIF BLD, SPARQL CONSTRUCT) generally express patterns over
predicates, without having to identify Row IRIs. Queries include Row
identifiers a bit more (the savvy user or tool will select an entity
by identifier rather than distinguishing attributes) and Turtle (the
data) will of course include both.

All of these languages allow the use of relative IRIs and prefixed
names. A prefixed query of a People table for ① looks like:

  PREFIX pplinst: <http://hr.myco.example/2011/schemas/People/>
  PREFIX pplschm: <http://hr.myco.example/2011/schemas/People#>
  SELECT ?event
   WHERE {
     pplinst:fname-Bob.lname-Smith pplschm:atEvent ?event
   }

And the relative IRI query looks like:

  BASE <http://hr.myco.example/2011/schemas/>
  SELECT ?event
   WHERE {
     <People/fname-Bob.lname-Smith> <People#atEvent> ?event
   }

Extending the use case to gain some SemWeb utility, we join two
databases, those of the HR and catering departments:

  PREFIX pplinst: <http://hr.myco.example/2011/schemas/People/>
  PREFIX pplschm: <http://hr.myco.example/2011/schemas/People#>
  PREFIX cater: <http://hr.myco.example/2011/schemas/People#>
  SELECT ?start ?end
   WHERE {
     pplinst:fname-Bob.lname-Smith pplschm:atEvent ?event
     ?event cater:start ?start ; cater:end ?end
   }

The customary URI escape character, '%', is not permitted in prefixed
names (nor are ',' and '='). The various row ID schemas have different
impacts on the expressivity in prefixed names given different values:

         row ID            pos int   neg int   alphanum   date   float
① attr¹-val¹.attrⁿ-valⁿ       ✓         ✓         ✓        ✓
② attr¹.val¹-attrⁿ.valⁿ       ✓                   ✓                ✓
④ attr¹=val¹,attrⁿ=valⁿ
⑤ attr¹.val¹.attrⁿ.valⁿ       ✓         ✓         ✓        ✓

(③ varies from ① only in the reference IRIs)

For an example of negative integer primary keys, this table uses -2
and -1 to represent a couple access control groups common to all
apache servers:

┌┤AccessRoles├───────┐
│┌pk┐│               │
│ ID │  desc         │
│ -2 │ "known users" │
│ -1 │       "world" │
│  1 │   "marketing" │
│  2 │  "management" │
└────┴───────────────┘


= The balance =
I see us as pushing a slider around between optimizing between
readability ("attr¹=val¹,attrⁿ=valⁿ") and usability (being able to
write/query the data with prefixed names). As Richard points out, we
can write/query the data for an individual database using an @base
directive and relative IRIs. This choice helps users write
data/queries as prefixed names (e.g. queries connecting multiple
databases).

IMO, ④ is the most readable and ⑤ is the most usable, with ① being my
idea of the sweet spot. ⑤ gives us the simplest encoding rules and ②
is less likely to be confused with the '.' addressing scheme used in
SQL.

-- 
-ericP

Received on Thursday, 8 September 2011 20:35:05 UTC