Re: Direct Mapping Spec - Comments

* David McNeil <dmcneil@revelytix.com> [2011-08-10 10:12-0500]
> I read the latest Direct Mapping spec [1] (only skimmed Appendices). Below,
> identified by section number, are the comments I had while reading it.
> 
> -David
> 
> [1] http://www.w3.org/2001/sw/rdb2rdf/directMapping/EGP
> 
> ====
> 
> 1 - "intended to provide a default behavior for R2RML" - It might be worth
> reconsidering the wording of this to avoid implying that R2RML prescribes
> this as default behavior.

PROPOSE: strike the "default behavior for R2RML" sentence and reduce
the paragraph to:
[[
The Direct Mapping can be used to materialize RDF graphs or define
virtual graphs, which can be queried by SPARQL or traversed by an RDF
graph API.
]]

> 1 - "It can be also used" - awkward sentence structure

fixed - "It can also be used"

> 2 - Wrong URL for RFC3987 link.

fixed (assuming you mean the internal href to the references section)

> 2 - I found the sudden transition to talking about FKs to be a bit jarring.
> Maybe there is a way to make this flow better?

I tweaked the wording and put it in a new paragraph.

> 2 - "This graph is composed of relative IRIs" - I know this has been
> discussed on the mailing list, but this is non-standard, eh? Isn't IRI
> prefixing a serialization issue? Also, does the user provide the base IRI?
> As I recall a goal was for the direct mapping to run without any user
> configurable options beyond pointing it at a database.

I don't believe we'll ever avoid the base IRI, but the nice thing
is that any web interface or any materialization accessible via a
filesystem or web will provide that base IRI. This text about the
base IRI serves to advise those who provide web-less access, e.g.
SMTP, to their database that in order to manipulate it with web
tools (say, query with Jena ARQ), they need to make up a base IRI.

I've tweaked the text a bit to hopefully provide that advice.

[[
The algorithms in this document compose a graph of relative IRIs which
must be resolved against a *base* IRI [RFC3987] to form an RDF graph.
]]

> 2.1 - For clarity, the "People" PRIMARY KEY clause should not be on the same
> line as the "addr" field.

moved to next to the ID declaration:

[[
CREATE TABLE Addresses (
 ID INT, PRIMARY KEY(ID), 
 city CHAR(10), 
 state CHAR(2)
)

CREATE TABLE People (
 ID INT, PRIMARY KEY(ID), 
 fname CHAR(10), 
 addr INT, 
 FOREIGN KEY(addr) REFERENCES Addresses(ID)
)
]]


> 2.1 - Per standard SQL, I think the string literals in the INSERT statements
> should have single quotes so they are not interpreted as identifiers.

done

> 2.1 - I think using the first row of a table for DB metadata is confusing
> given the widely understood model of having the column names as the first
> row. Especially considering that the fonts are the same. Maybe if the
> metadata were in non-bold italics it would be easier to read?

added
[[
tr.SQLconstraints th { font-weight: normal;
         font-style: italic; }
]]
to <local.css>. 'zat work?


> 2.2 - "compound and composite" - At first glance this seems redundant.

agreed. Every compound key is a composite key, and it's the
composite-ness we're after here:

[[
More complex schemas include composite primary keys.  In this example,
the columns deptName and deptCity in the People table reference name
and city in the Department table:
]]


> 2.2 - "People tables's" - It is still early, but that can't be the right
> apostrophe use?

sure, and we voice it "people tabelzezez", or better, a post-positive
adjective form, "People's table", which should be on a banner at a May
day rally.

now: "People table's"


> 2.2 - "The referent identifier (object of the above predicate)" - For
> clarity, I would just say "the object"

done

> 2.2 - +1 to Souri's observation that this approach does not handle multiple
> foreign keys from the same columns.

The current text says
[[
a reference triple for each <column name list> in a table's foreign
keys where none of the column values is NULL.
]]

which means that we generate multiple arcs for the same column
list. Given an access control scenario:

CREATE TABLE Principles (ID INT PRIMARY KEY, created STRING);
INSERT INTO Principles (ID, created) VALUES (2, "2011-09-10");
INSERT INTO Principles (ID, created) VALUES (3, "2011-09-10");
CREATE TABLE Users (ID INT PRIMARY KEY, name STRING, FOREIGN KEY (ID) REFERENCES Principles(ID));
INSERT INTO Users (ID, name) VALUES (2, "Bob");
CREATE TABLE IPAddrs (ID INT PRIMARY KEY, ip STRING, FOREIGN KEY (ID) REFERENCES Principles(ID));
INSERT INTO IPAddrs (ID, ip) VALUES (3, "81.23.2.200");
CREATE TABLE Roles (ID INT PRIMARY KEY, permissions STRING, FOREIGN KEY (ID) REFERENCES Users(ID), FOREIGN KEY (ID) REFERENCES Principles(ID));
INSERT INTO Roles (ID, permissions) VALUES (2, "rwx");

            ┌┤Principles├─────┐
            │ ID │ created    │
            │  2 │ 2011-09-10 │
            │  3 │ 2011-09-10 │
            └────┴────────────┘
            /    \
┌┤Users├─────┐  ┌┤IPAddrs├─────────┐
│ ID │ name  │  │ ID │ ip          │
│  2 │ "Bob" │  │  3 │ 18.23.2.200 │
└────┴───────┘  └────┴─────────────┘

┌┤Roles├─────────────┐
│ user │ permissions │
│    2 │       "rwx" │
└──────┴─────────────┘

Roles could be argued to be a foreign key to both Users and Principles
(though presumably, Users.ID already has a foreign key constraint on
Principles.ID so (Roles.user) → (Principles (ID)) is redundant). At
present, the DM gives you multiple arcs for the foreign key name (ID):

  <Roles/ID.2> a <Roles> ;
        <Roles#ID> <Users/ID.2> , <Principles/ID.2> ;
        <Roles#permissions> "rwx" .

which matches my intuition of what we've told the system with your two
foreign keys. BTW, you can go to
  <http://this-db-really.does-not-exist.org/>
and enter the above DDL and an identity CONSTRUCT: [[
CONSTRUCT {
  ?s ?p ?o .
} WHERE {
  ?s ?p ?o .
}
]] to see this in action.


> 2.2 - ":(deptName, deptCity) is a multi-column foreign key in the table
> People which references the multi-column candidate key (name, city) in the
> table Department." This is awkward to read and is just a repetition of the
> formal FK definition in the DDL. I would omit it.

gone

> 2.3 - I realized that I wasn't sure if I was reading the spec, or reading an
> example. It seems to me that the text needs to be more clearly identified,
> on a paragraph basis as to whether it is an informal description of the spec
> or a concrete example. For example, the R2RML spec highlights examples with
> an alternate color and a surrounding label/box. Personally I think I would
> swap the order of sections 3 & 2 or intersperse the examples from section 2
> into section 3.

I'm sympathetic to that, but I think it needs a real work-up and
presentation to the WG. I would work with you on it, but I'm not
likely to do this on my own.

> 2.3 "would have been generated" - Seems the text is clearer to read if we
> can stick to a more active voice.

I agree in principle but I don't think [Multi-column primary keys]
worth an entirely new example.

> 2.4 "(for keeping track of tweets in Twitter)" - I would find a way to
> remove the parens.

done -
[[
A Tweets table can be added to the above example to keep track of
employees' tweets in Twitter:
]]
(I know, subjunctive mood; wording proposals welcome.)

> 2.4 "It is not possible to dereference blank nodes" - I don't immediately
> see what the point of this statement is.

yeah, that's a linked data issue; not a big deal for folks working in
the query world (though an argument for linked data vigilance is that
motivates quality data which db-heads want to query).

trying [[
It is not possible to dereference blank nodes ("_:a" and "_:b" above)
per the dereferencability goals of Linked Data.  Queries or updates
may be made to these nodes via SPARQL queries.
]]
with a link to
<http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/#datamodel>
. (Request to Richard, can we have anchors on the minor headings like
"Dereferencing HTTP URIs"?


> 2.5 - I suspect this has been discussed at great length in the past, but
> from my perspective the way blank nodes identifiers are used in this example
> seems to create implementation pain. In particular the processing of a row
> in a table is not simply a function of that row. Rather, it must access the
> "global" list of what blank node identifier is used for each database value
> that is used as an FK to a non-PK. For this reason, the way we solved this
> problem at Revelytix was to use the data value itself to form the
> identifier. I think this applies whether an IRI or a blank node is used to
> identify the PK-less row.

That works until the lack of unique constraints reflects actual
redundancy in the data. This happens occasionally in base tables
but is pretty frequent in views which are used for aggregate metrics.


> 3 - At Revelytix we have found it useful to define two base URIs: one for
> ontology URIs and another for data

At first blush, I'd guess that the reason Revelytix finds this
ontology parameter useful is that you intend the structure to map into
an at least somewhat shared ontology. I believe the narrative around
the direct graph is that this shared ontology is the next step in the
tool chain.


> 3 - "all labels are generated by appending to a base." - I think someone
> else mentioned this already, but it seems referring to the IRIs as "labels"
> is confusing and we should use more precise words here.

But I'm searching for *less* precise words here.

<http://www.w3.org/TR/rdf-concepts/#section-data-model> says that RDF
is a graph. The fact that it has directional arrows with labels means
it's a Directed. Labeled Graph. The fact that the concepts doc doesn't
mention the labels on the nodes (it simply says they exists) is a
minor pain in my butt. Bringing the construction of these nodes into
the domain of discourse makes it difficult to not discuss the fact
that they do in fact provide the identifiers which adorn the graph.

<http://www.w3.org/TR/rdf-concepts/#section-URI-Vocabulary> implies
that DLG labels are called "names" in the RDF world:
[[
A blank node is a node that is not a URI reference or a literal. In
the RDF abstract syntax, a blank node is just a unique node that can
be used in one or more RDF statements, but has no intrinsic name.
]]
Should I use that?


> 3 - "the percent-encoded form of the column value" - This presupposes a text
> representation of the column value. Is it specified elsewhere how to get a
> text representation?

Good point, same is true for the column names etc.

SQL gives us a unicode version of each table or column name, as well
as the values (for e.g. equivalence testing). We need to work out the
related wording for both R2RML and this doc.

> 3 - "fresh blank node" - Personally, seems ok to me, but do we need more
> precise words for this?

I think "fresh" is the term of art in computer science, but maybe
there's something closer to the hearts of RDF modelers.

> 3 - "A (potentially unary)" - I encountered several places like this where I
> found the parens distracting.

Ditto Michael, but I'm pretty sure this is the text that minimizes the
opportunities for misinterpretation.
[[
A (potentially unary) list of column names in a table form a property
IRI
]]
Removing the ()s puts it more in the readers face than is necessary
for some readers (who are already aware that this is potentially
unary), but eliding it leaves the rest without a valuable reminder.


> 3 - "Definition property IRI:" At one point I found myself mis-reading this
> as a definition of the term "definition property IRI". The R2RML spec seems
> to define terms more clearly with a formatted construct like: "A _data
> error_ is a condition of the data in the"

Oops, forgot to commit a change to the stylesheet. Take a look now.

> A.1 - I think the English Syntax should be shown by default.

anyone want to second this?

> A & B - I stopped reading it closely, but (at the risk of stating the
> obvious and of stirring up previous compromises) it seems like an
> over-abundance of notations. Truly it is hard to tell how many of them there
> are and seems it will be challenging to keep them all in synch as the spec
> evolves. I would remove some of them.

I'm still unifying the set-builder notation and the set notation with
the current English. In principle, the English text can be identical
to the definition above, though then we lose the opportunity to
customize it to say e.g. "each row <r>" instead of "each row".

> Other thoughts, perhaps these have been addressed in past discussions
> already and I just don't know the answers:
> 
>  * do we need to say anything about how a direct mapping generator finds a
> database?

I think protocol and parameters are best left to tools and specs which
use the direct graph.

>  * do we need to say anything about which schema to map?

ditto.

>  * how about synonyms in the database? We have found this to be a pain point
> in practice.

Could you elaborate? I'm apparently not ignorant of their existence.

>  * does it need a mechanism for omitting the schema tables from the mapping?

ditto (-2 responses)

>  * I notice the spec is silent about case sensitivity of database
> identifiers. I suppose it is implied that the casing used in the database
> metadata is preserved?

I'd say that every way you can hand a schema to a DM tool will involve
some serialization and that will have one of ("FNAME", "fname",
"Fname", "fNaMe"...). I don't see sensitivity being an issue yet.
-- 
-ericP

Received on Wednesday, 10 August 2011 18:52:53 UTC