Re: Proposal for the Direct Mapping from Richard Cyganiak on 2011-08-04 (public-rdb2rdf-wg@w3.org from August 2011)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Thu, 4 Aug 2011 12:30:02 +0100
To: Eric Prud'hommeaux <eric@w3.org>
Cc: rdb2RDF WG <public-rdb2rdf-wg@w3.org>
Message-Id: <38DFCC08-D903-498F-A952-9065EACBA7DA@cyganiak.de>
On 3 Aug 2011, at 18:46, Eric Prud'hommeaux wrote:
>>> The functions scalar and reference extract the scalar and reference
>>> attributes (those participating in a foreign key) respectively:
>> 
>> Why does this have to be formulated as “functions”?
> 
> Is there a more intuitive way to say that there's an exact mapping from the input onto the outputs?
> And isn't that exactly what an implementor wants to know?

Well, technically speaking, “first letter of a word” is a function that exactly maps from the input to the outputs. But defining a first-letter function for that would be silly. It's more intuitive to just call it the “first letter of the word”. I think the same applies here -- just make really clear what exactly a “non-foreign key column” is, and then call it a non-foreign key column.

>>> dfn scalars: the attributes in a table which are NOT in any foreign
>>>  key.
>> 
>> How about: The non-foreign key columns of a table are the columns which are not in any foreign key.
> 
> Looking at it in-situ <http://www.w3.org/2001/sw/rdb2rdf/directMapping/EGP#defn-scalars>, I'm not convinced that the "defintion X: X is..." redundancy will be helpful.

Ah, ok. Still, I don't like this much. I think the boxes around the definitions, with bits of text announcing the boxes, breaks up the flow of the text. I'm fine with a style that uses basically pairs of “$term: $definition” instead of “A $term is $definition”.

>>> dfn references: the attributes in a table's foreign keys.
>> 
>> How about: The foreign key columns of a table are the columns which are in some foreign key.
> 
> ditto

Note my comments about naming those things. “non-foreign key columns” and “foreign key columns” seems easier on the reader than “scalars” and “references”.

>>> In the direct graph, there is an identifier for each row in a database
>>> table. If the row is in a table with a primary key, this is formed
>>> from the table name and the attribute names and values of each attribute
>>> in the primary key. If there is no primary key for the table, the row
>>> identifier is a fresh blank node:
>>> 
>>> dfn row identifier:
>>> 
>>>  if the table has a primary key with attributes, the relative IRI for
>>>  the row identifier is the concatenation of the table name, '/', and
>>>  a ','-separated concatenation of each attribute name, '=', and the
>>>  attribute value.
>>> 
>>>  if the table has no primary key, the row identifier is a fresh blank
>>>  node.
>> 
>> This doesn't need to be repeated twice. I'd call it row IRI for maximum clarity.
> 
> I'm not sure what's repeated.

Read it. The second two sentences of the initial paragraph say the same as the definitions, while omitting some details. Why?

> If you mean that there are two clauses, they deal with different cases.

Sure.

> Re: "row IRI", we could say that "row identifier" is either a "row IRI" or "row blank node". 

Good point. In that case, “row node” or “row RDF term”, because a blank node is not an identifier.

> Proposed text?

The “row node” for a row is the following:
1. If the table has a primary key, then it is a relative IRI obtained by concatenating:
   - the percent-encoded form of the table name,
   - the slash character '/',
   - for each column in the primary key, in order:
        - the percent-encoded form of the column name,
        - an equals character '=',
        - the percent-encoded form of the attribute value,
        - if it is not the last column in the primary key, a comma character ','
2. If the table has no primary key, then it is a fresh blank node that is unique to this row.

>>> A (potentially unary) list of attribute names in a table form a
>>> property IRI:
>>> 
>>> dfn property IRI: the concationation of the table name, '/', and a
>>>  ','-separated concatonation of each attribute name, and a '#' at
>>>  the end of the property IRI.
>> 
>> This doesn't need to be repeated one-and-a-half times.
> 
> The property IRI is simpler than the earlier definition (doesn't include column values).

Again, the words before the definition just repeat what's in the definition in slightly less words, without benefit.

>> This should use the standard SQL 2008 types, including BOOLEAN and BINARY string types. (Probably the Direct Mapping can re-use the outcome of R2RML ISSUE-48 here.)
> 
> Labeled as an issue. Have you incorporated that into R2RML (when there's not rr:datatype) so I can steal the text?

No, as ISSUE-48 still is under exploration.

>> I'd say, the table graph of a table is the union of the row graphs for each row.
> 
> If I understand this, it implies the definition of table graph which might then be defined row graphs. Is this your proposal?

Yes.

>>> dfn row mapping: using a row identifier S for the row,
>>> the type triple:
>>>   (S, rdf:type, <table type>)
>>> plus the scalar triples:
>>>   for each attribute in the list of <scalars> where the attribute
>>>     value is non-NULL:
>>>     (S,
>>>      the <property IRI> for the attribute,
>>>      the <literal map> for the attribute value).
>>> plus the reference triples:
>>>   for each list of attributes in the <non-unary references> where none
>>>     of the attribute values are NULL:
>>>     (S,
>>>      the <property IRI> for the attributes,
>>>      the <row identifier> for the referenced triple)
>>> ]]
>> 
>> I'd decompose this a bit: The row graph of a row is a graph consisting of the following triples:
>> - the row type triple
>> - a data triple for each non-foreign key column where the data value is non-null
>> - a reference triple for each foreign key column ...
>> 
>> And then:
>> 
>> The row type triple of a row is an RDF triple with the following components:
>> - subject: the row IRI of the row
>> - predicate: rdf:type
>> - object: the table class IRI of the row's table
>> 
>> et cetera.
> 
> I worked from this angle for a bit, but the challenging thing was ensuring the same subject without introducing some sort of hand-waiving about "the current subject" or some such.
> Recall that the containing table may not have a primary key (or even any candidate keys).

Just say that the subject is the “row node” of the row. “Row node” is a hyperlink to the place where “row node” is defined, see above. (I don't object to introducing “local variables”, but if “row node of the row” is already defined then why not just refer to that.

Best,
Richard
Received on Thursday, 4 August 2011 11:30:46 UTC