Re: Addressing ISSUE-47 (invalid and relative IRIs) from Richard Cyganiak on 2011-07-11 (public-rdb2rdf-wg@w3.org from July 2011)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Mon, 11 Jul 2011 18:56:28 +0100
To: David McNeil <dmcneil@revelytix.com>
Cc: RDB2RDF WG <public-rdb2rdf-wg@w3.org>
Message-Id: <F0C41860-33D2-4159-A258-FAEC6D4D45C7@cyganiak.de>
On 11 Jul 2011, at 16:03, David McNeil wrote:
> Richard - I appreciate the ongoing discussion. I find it helpful, I hope it is useful to you as well.

Absolutely. Your input is improving the spec. More eyes and more perspectives are always helpful.

> What are your thoughts on a usage pattern where a mapping is defined, SPARQL is executed against the mapping, and only column values are returned. The resource identifiers are used internally by the query processing, but are not produced as an output. Is there any merit in relaxing the requirements for valid IRIs in this case? 

I see two difficulties with this. First, currently the output of the mapping is defined in terms of an RDF dataset, and this way of defining things isn't really compatible with the idea of having different queries essentially work on different graphs. Second, it would have some unpleasant consequences.

   SELECT ?name ?jobdesc FROM {
       ?emp ex:name ?name
       ?emp ex:jobdesc ?jobdesc
   }

This would produce results even if the mapping makes broken IRIs for ?emp. But now if you add ?emp to the SELECT list, these results would disappear. I think this would be a problem.

> I think the data validator spec would need to accommodate usage modes other than the batch processing mode that you describe. If the triples defined by the mapping are not materialized, but queried as virtual triples, then it seems we should allow the data validator to be executed at query-time on the virtual triples accessed by the query. 

Having thought more about this, I think a better option might be to state that a conforming R2RML processor MUST check for data errors at startup time. Implementations might still decide to offer a --nocheck option that switches off the check at startup time for performance reasons. Running in this mode would make them non-conforming, but this is a trade-off that some might be willing to make. The good thing from the editors' point of view is that we wouldn't have to say whether a data error with --nocheck results in dropped triples or a runtime error.

> So column references would never percent-encode, templates always would. If the user wants to build a URI from pre-encoded parts they would define it as part of a logical table (i.e. a SQLQuery in the mapping), reference the resulting column in a term map, and R2RML would not attempt to re-encode the column.

Exactly.

> For cases (like the WordPress example) where snippets of URLs are pre-built in the database  columns. This could mean the columns contain URL separators or they are already URL-encoded. These column values could be in the underlying data or in the columns of logical tables defined in the mapping itself. I think this is a valid usage pattern that we would need to support. If I understand your position you would say R2RML can accommodate this because the user can always define a SQL query to produce the IRI and thus avoid the automatic URL-encoding of the R2RML templates?

Exactly.

> I can understand this position, although personally I would prefer to define a way for the user to control the URL-encoding performed by templates. 

I'm not vehemently opposed to this, although I'd strongly prefer if the default were to do the encoding. And I'd strongly prefer if the other user choice would be to just *reduce* the set of characters that would be encoded, similar to URI templates, rather than turning it off completely. I don't see the use case for turning it off completely.

> However, I can see that we might declare this to be a post-R2RML-1.0 feature/

This would work for me.

>> We have two transformations when generating RDF using an rr:sqlQuery:
>> 
>>  Values in base table ==1==> values in logical table ==2==> RDF terms
>> 
>> rr:template and percent-encoding are part of step 2. Step 2 is designed so that it is always reversible given the information provided by the user.
>> 
>> rr:inverseExpression is only about reversing step 1.
>> 
> 
> Perhaps I am confused, but I am not able to match this description to my understanding of inverseExpression. 
> 
> If I have a mapping that produces IRI's like: http://John%20Smith from a database column with values like "John Smith", then I would expect to be able to write a SPARQL query that selects data for http://John%20Smith. Furthermore I would expect to be able to write an inverseExpression that allowed the R2RML processor to deconstruct http://John%20Smith and obtain the original data value of "John Smith".

So, assuming that we keep the current design where rr:column values are not percent-encoded, this would have to be done using a template such as:

rr:template "http://{name}"

Now if a column contains "John Smith", you'd get <http://John%20Smith> as the resulting IRI.

Now if you query for that IRI, the R2RML processor would have to figure out that this IRI could have been produced from the rr:template above, and would perform the reversal. First it would figure out that the "John%20Smith" part of the IRI could have come from {name}. Then it would percent-decode that, yielding "John Smith". Then it would use the inverseExpression (if any) to search for that value in the database. The inverseExpression would have to mention "{name}". That string would be replaced with "'John Smith'". If there is no inverseExpression
defined, then it would use the trivial "name = 'John Smith'".

You only ever need to write an inverseExpression if you use derived columns ("SELECT expression AS derived_column, ...") as your rr:column or in your rr:template. If you just use unmodified “base columns” that exist in the underlying base table, then you don't need an inverseExpression, even if you use rr:template. Any rr:templates are reversed automatically by the R2RML processor.

> When I write the inverseExpression it seems that I need to know which parts of the IRI were pre-URL-encoded (i.e. the data value in the database is URL-encoded) and which were URL-encoded by the mapping, so that I could URL-decode the write parts.

The processor matches the IRI against the rr:template to figure out which parts of the IRI were derived from a {column} in an rr:template. Then it automatically percent-decodes these parts All other parts are left intact.

This is what you already get just by applying the steps in the spec in reverse order.

Best,
Richard
Received on Monday, 11 July 2011 17:57:09 UTC