Re: Proposal for the Direct Mapping

* Richard Cyganiak <richard@cyganiak.de> [2011-08-04 12:30+0100]
> On 3 Aug 2011, at 18:46, Eric Prud'hommeaux wrote:
> >>> The functions scalar and reference extract the scalar and reference
> >>> attributes (those participating in a foreign key) respectively:
> >> 
> >> Why does this have to be formulated as “functions”?
> > 
> > Is there a more intuitive way to say that there's an exact mapping from the input onto the outputs?
> > And isn't that exactly what an implementor wants to know?
> 
> Well, technically speaking, “first letter of a word” is a function that exactly maps from the input to the outputs. But defining a first-letter function for that would be silly. It's more intuitive to just call it the “first letter of the word”. I think the same applies here -- just make really clear what exactly a “non-foreign key column” is, and then call it a non-foreign key column.

Ahh, your issue is not that I picked "function" from the myriad of words to describe the mapping, but that I voiced it at all. Dropped, and getting the reader into the term space with this introductory test:

  An (addressable) SQL table has a set of uniquely-named columns and a
  set of foreign keys, each mapping a <column name list> to a <unique
  column list> (a list of columns in some table).


> >>> dfn scalars: the attributes in a table which are NOT in any foreign
> >>>  key.
> >> 
> >> How about: The non-foreign key columns of a table are the columns which are not in any foreign key.
> > 
> > Looking at it in-situ <http://www.w3.org/2001/sw/rdb2rdf/directMapping/EGP#defn-scalars>, I'm not convinced that the "defintion X: X is..." redundancy will be helpful.
> 
> Ah, ok. Still, I don't like this much. I think the boxes around the definitions, with bits of text announcing the boxes, breaks up the flow of the text. I'm fine with a style that uses basically pairs of “$term: $definition” instead of “A $term is $definition”.

I'm pretty confident that the boxes help the far majority of people who can scan the blue text to build their model and scan up for the introductory text when some of the blue text appears to be, well, out of the blue. This is how I've observed people reading the SPARQL definition: blah blah blah 


> >>> dfn references: the attributes in a table's foreign keys.
> >> 
> >> How about: The foreign key columns of a table are the columns which are in some foreign key.
> > 
> > ditto
> 
> Note my comments about naming those things. “non-foreign key columns” and “foreign key columns” seems easier on the reader than “scalars” and “references”.

I get it now (had thought it was a comment about the narrative).

The tricky bit is that each reference is a list of column names so references is a set of lists of column names (which matters when we create reference triples). Choices seem to be:
  • Introduce a new term, e.g. "references".
  • Introduce a new phrase, e.g. "foreign key column lists".
  • Expand in-place, e.g. "for each <column name list> in a table's foreign keys where the list has more than one column name and none of the corresponding column values are NULL:"

== new term ==
[[
Definition references: the set of <column name list>s in a table's foreign keys.
Definition literals: the columns in a table which are NOT the soul column in any foreign key.
…
Definition row graph: an RDF graph consisting of the following triples:
  • the row type triple.
  • a reference triple for each <column name list> in the list of references where none of the column values is NULL.     
  • a literal triple for each column in the list of literals where the column value is non-NULL.
]]

== new phrase ==
[[
Definition foreign key lists: the set of <column name list>s in a table's foreign keys.
Definition columns not in unary foreign keys: the columns in a table which are NOT the soul column in any foreign key.
…
Definition row graph: an RDF graph consisting of the following triples:
  • the row type triple.
  • a reference triple for each <column name list> in the list of foreign key lists where none of the column values is NULL.     
  • a literal triple for each column in the list of columns not in unary foreign keys where the column value is non-NULL.
]]

== expand in-place ==
[[
Definition references: the set of <column name list>s in a table's foreign keys.
Definition literals: the columns in a table which are NOT the soul column in any foreign key.
…
Definition row graph: an RDF graph consisting of the following triples:
  • the row type triple.
  • a reference triple for each <column name list> in the table's foreign keys and where none of the column values is NULL.     
  • a literal triple for each column in the table which are NOT the soul column in any foreign key and where the column value is non-NULL.
]]

Of the three, I find <new term> the hardest to misread.


> >>> In the direct graph, there is an identifier for each row in a database
> >>> table. If the row is in a table with a primary key, this is formed
> >>> from the table name and the attribute names and values of each attribute
> >>> in the primary key. If there is no primary key for the table, the row
> >>> identifier is a fresh blank node:
> >>> 
> >>> dfn row identifier:
> >>> 
> >>>  if the table has a primary key with attributes, the relative IRI for
> >>>  the row identifier is the concatenation of the table name, '/', and
> >>>  a ','-separated concatenation of each attribute name, '=', and the
> >>>  attribute value.
> >>> 
> >>>  if the table has no primary key, the row identifier is a fresh blank
> >>>  node.
> >> 
> >> This doesn't need to be repeated twice. I'd call it row IRI for maximum clarity.
> > 
> > I'm not sure what's repeated.
> 
> Read it. The second two sentences of the initial paragraph say the same as the definitions, while omitting some details. Why?

Got it, I thought you meant that something was repeated between the dfns for row identifier and property iri.

I've reduced the introductory text to:
[[
There is either a blank node or IRI assigned to each each row in a table:
]]

> > If you mean that there are two clauses, they deal with different cases.
> 
> Sure.
> 
> > Re: "row IRI", we could say that "row identifier" is either a "row IRI" or "row blank node". 
> 
> Good point. In that case, “row node” or “row RDF term”, because a blank node is not an identifier.

done

> > Proposed text?
> 
> The “row node” for a row is the following:
> 1. If the table has a primary key, then it is a relative IRI obtained by concatenating:
>    - the percent-encoded form of the table name,
>    - the slash character '/',
>    - for each column in the primary key, in order:
>         - the percent-encoded form of the column name,
>         - an equals character '=',
>         - the percent-encoded form of the attribute value,
>         - if it is not the last column in the primary key, a comma character ','
> 2. If the table has no primary key, then it is a fresh blank node that is unique to this row.

done

> >>> A (potentially unary) list of attribute names in a table form a
> >>> property IRI:
> >>> 
> >>> dfn property IRI: the concationation of the table name, '/', and a
> >>>  ','-separated concatonation of each attribute name, and a '#' at
> >>>  the end of the property IRI.
> >> 
> >> This doesn't need to be repeated one-and-a-half times.
> > 
> > The property IRI is simpler than the earlier definition (doesn't include column values).
> 
> Again, the words before the definition just repeat what's in the definition in slightly less words, without benefit.

You're really not down with this (introductory text, defn) pairing, are you?

I adopted your earlier proposed style for the concatenation definition.


> >> This should use the standard SQL 2008 types, including BOOLEAN and BINARY string types. (Probably the Direct Mapping can re-use the outcome of R2RML ISSUE-48 here.)
> > 
> > Labeled as an issue. Have you incorporated that into R2RML (when there's not rr:datatype) so I can steal the text?
> 
> No, as ISSUE-48 still is under exploration.
> 
> >> I'd say, the table graph of a table is the union of the row graphs for each row.
> > 
> > If I understand this, it implies the definition of table graph which might then be defined row graphs. Is this your proposal?
> 
> Yes.

So the mapping is a function from a database instance to a graph. I believe your proposal is to not voice the existance of that mapping but instead leave it implicit in the definitions. I'm down with that. Note that this bubbles upwards as well as downwards; rev 1.4 prototypes defining the direct graph instead of the direct mapping. Too bad we already have the short name "directMapping" as "directGraph" would be better.


> >>> dfn row mapping: using a row identifier S for the row,
> >>> the type triple:
> >>>   (S, rdf:type, <table type>)
> >>> plus the scalar triples:
> >>>   for each attribute in the list of <scalars> where the attribute
> >>>     value is non-NULL:
> >>>     (S,
> >>>      the <property IRI> for the attribute,
> >>>      the <literal map> for the attribute value).
> >>> plus the reference triples:
> >>>   for each list of attributes in the <non-unary references> where none
> >>>     of the attribute values are NULL:
> >>>     (S,
> >>>      the <property IRI> for the attributes,
> >>>      the <row identifier> for the referenced triple)
> >>> ]]
> >> 
> >> I'd decompose this a bit: The row graph of a row is a graph consisting of the following triples:
> >> - the row type triple
> >> - a data triple for each non-foreign key column where the data value is non-null
> >> - a reference triple for each foreign key column ...
> >> 
> >> And then:
> >> 
> >> The row type triple of a row is an RDF triple with the following components:
> >> - subject: the row IRI of the row
> >> - predicate: rdf:type
> >> - object: the table class IRI of the row's table
> >> 
> >> et cetera.
> > 
> > I worked from this angle for a bit, but the challenging thing was ensuring the same subject without introducing some sort of hand-waiving about "the current subject" or some such.
> > Recall that the containing table may not have a primary key (or even any candidate keys).
> 
> Just say that the subject is the “row node” of the row. “Row node” is a hyperlink to the place where “row node” is defined, see above. (I don't object to introducing “local variables”, but if “row node of the row” is already defined then why not just refer to that.

Modulo some xmlspec XSLT work to fix up the references, I've prototyped this in rev 1.4 .


> Best,
> Richard

-- 
-ericP

Received on Friday, 5 August 2011 14:26:20 UTC