RDB2RDF Teleconference -- 15 May 2012

<nunolopes> Zakim, ??P26 is me

<dmcneil> I can do it

<Ashok> scribenick: David

<MacTed> scribenick: dmcneil

1. Admin PROPOSAL: Accept the minutes of last meeting http://www.w3.org/2012/05/08-RDB2RDF-minutes.html

minutes accepted

2. Implementability for tables w/o primary key

where we were: we spoke last time about what to do

one thing we spoke about was writing some text describing the disconnect between the DM and the R2RML

there was a sense that we should bite the bullet and address this issue by extending R2RML

Richard & Eric wrote up a proposal

<cygri> proposal was here: http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2012May/0075.html

and he thought we had reasonable support for the proposal

but, then David argued that we should do nothing because you can use vendor-specific SQL in R2RML to accomplish the goal of mapping tables without primary keys

<cygri> dmcneil: core issue is that vendor-specific sql is needed

<cygri> ... i argue best approch is to do it with views

<cygri> ... because that puts the burden on the DM implementer

<Ashok> David: I think the problem can be done with views, so no change is needed

<cygri> ... whatever SQL query would be needed to actually compute the DM, just write that into a view

<juansequeda> Which vender does not have a generate_series/rownum function?

<cygri> ... example, postgres has a function for computing sequences. same with other DBs

<cygri> ... then the usual R2RML mechanisms can be used to generate blank nodes

<cygri> ashok: so we need to find out whether all the DBs have some mechanism like that?

<Ashok> Need to find out whether SQL Server and DB2 has such a function

<cygri> dmcneil: suppose DB2 doesn't have it. how does hiding this behind RowBlankNode help then?

cygri: once concern is that the way R2RML is defined in terms of core SQL 2008
... we acknowledged that people would use vendor specific SQL
... the way we dealt with that is that if you wnat to use db specific dialect of SQL then you are defining a vendor specific extension of R2RML
... taking that to it's logical conclusion, then we are saying that R2RML as specified cannot be used to implement the Direct Mapping
... since the general assumption is it is not possible to support this case with generic SQL
... so the user must immediately extend R2RML for this case

scribe: this is not a particularly pleasant situation

ashok: neither of this is perfect

dmcneil: I think that is a specious argument, because we expect that most mappings will have vendor-specific SQL in them

cygri: correction, by default embedded SQL is expected to be SQL CORE 2008

macted: if we did that (not sure what "that" references) it was a serious mistake

<cygri> http://www.w3.org/2001/sw/rdb2rdf/r2rml/#conformance

<Ashok> This specification defines R2RML for databases that conform to Core SQL 2008, as defined in ISO/IEC 9075-1:2008 [SQL1] and ISO/IEC 9075-2:2008 [SQL2]. Processors and mappings may have to deviate from the R2RML specification in order to support databases that do not conform to this version of SQL.

from the spec "The absence of a SQL version identifier indicates that no claim to Core SQL 2008 conformance is made."

MacTed: suggests that he thought we had agreement last week

discussion of who dissented

cygri: eric strongly believes that preserving cardinality is important

macted: let them choose the DM variant that preserves it then

cygri: the argument was made that the cardinality preserving option is more correct
... I argued that that is just an option, implementations are free to implement it
... was also ok with putting in warnings about the non-cardinality preserving option
... from last week we said if we had more time we would define a way to make this work in R2RML
... the point was raised that for backwards compatibility we cannot remove it later
... suggested wording that "we might remove this in the future" was not well received
... also, they don't want options in the DM, just a single monolithic approach

ashok: last week, we spoke about some text that ivan had crafted
... are you speaking about that, or a previous position they had

cygri: I am speaking about some text that I drafted
... ivan drafted some text proposing no change but saying they are incompatible

<cygri> http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2012May/0054.html

<Ashok> PROPOSAL A: In the DM spec, replace the following text: [[ If the table has no primary key, the row node is a fresh blank node that is unique to this row. ]] with this: [[ If the table has no primary key, the row node is a blank node. Distinct blank nodes MUST be generated for rows with distinct column values. For duplicate rows with identical values, implementations SHOULD generate a fresh blank for each duplicate row (resulting in a non-lean RDF graph [R

<juansequeda> I still support Proposal A :)

<Ashok> o In the DM, instead of "is intended to provide a default behavior for R2RML: RDB to RDF Mapping Language" say "is intended to provide a default behavior for R2RML: RDB to RDF Mapping Language for tables which have at least one unique key" o Add to the R2RML document (probably in the intro part): "R2RML implementations are encouraged to provide a default mapping equivalent to the Direct Mapping for tables which have at least one unique key" o Add a Note

cygri: last week we said we would explore adding something to R2RML
... then the objection from david, which i think is reasonable, I can see where he is coming from
... did eric come up with a use case?

<cygri> http://www.w3.org/2001/sw/rdb2rdf/wiki/Non-unique_Tables

ashok: yes, it is on the wiki, but it is a bit complicated
... do you support one of these two proposals

dmcneil: i think, that other than doing nothing, adding a rownum column to R2RML is the most interesting

cygri: the problem with that approach is that it leads too much to a particular implementation
... implementations could choose something more efficient than rownum
... for example, this mysql query pasted (not sure which query) numbers the rows, but doesn't use rownum

<Souri_> how about rr:genRowId ?

cygri: since rownum forces a particular approach that requires the user to spell out what the blank node identifier looks like

<Ashok> In a ROWNUM capable DB, the mapping processor implicitly converts it to the following R2RML mapping (the actual implementation may vary from DB to DB based upon how the equivalent of ROWNUM can be implemented) <Tmap1> rr:logicalTable [ rr:sqlQuery """ Select ROWNUM AS "rr:rownum", t.* from Wonderland t order by "rr:rownum" """ ] rr:subjectMap [ rr:template "http://Wonderland/my_rownum={\"rr:rownum\"}" ] We can also say that rr:rownum cannot be used when log

<cygri> A SQL query is a SELECT query in the SQL language that can be executed over the input database. The string must conform to the production <direct select statement: multiple rows> in [SQL2]

dmcneil: since SQL defaults to not be SQL 2008 in an R2RML view, this means adding an R2RML view by default makes a vendor specific mapping

cygri: actually the SQL version identifier doesn't affect the processing at all, per the spec

souri: regarding the rownum discussion
... for every row we need a unique id
... if we knew the target database, then we could write vendor-specific SQL
... but, if we present that to the user, then the user may not understand that and it is not portable to other database backends
... therefore we want a logical representation
... if "rownum" has too much meaning with it
... then we can use "rowidentifier" or something
... this could be used in blank node generation or URI generation
... we need one R2RML construct, which provides a point of indirection for whatever vendor-specific mechanism it will be translated to

ashok: isn't the ability to add a column that gets it's values from a function part of SQL-2008?

souri: not sure, but this is very common

ashok: so we could add a column in the SQL query that gives distinct numbers

souri: a sequence generator, right?
... in Oracle, access to the sequence generator is a DML operation

cygri: regarding souri's point about semantics of the pseudo-rownum column
... agree all we need is an identifier
... understand that calling it rownum, does not mean it must be literally the db's rownum
... just one tiny step from that to the RowBlankNode proposal
... we leave it completely up to the implementation what the blank node template it

scribe: since blank node identifiers have no semantics, they just must be unique
... so i can't see the usefullness of letting the user put the rownum into a template

dmcneil: we worked out how regular blank node ids work, we based that on whether the blank node template produces the same value
... would need that same capability for these new RowBlankNode things

MacTed: the question comes down to doing a dump of the data

<Zakim> cygri, you wanted to mention jena api

dmcneil: I am talking about within the context of a query, not between queries

cygri: ted, what you say is completely correct for SPARQL queries
... can't tell if the blank node ids between queries are the same resource or not
... per the spec
... but, looking at the jena api
... must look at constraints of jena api
... which expects blank nodes to have a persistent identity
... this leads to the need to go back to the database and graph properties for a specific blank node

ted: once you go back to the database, you cannot rely on the blank nodes IDs

yes

cygri: the RDF working group has been arguing about this for a year

several people said they don't want to talk about blank node ID

dmcneil: this is relevant

<cygri> dmcneil, i believe it is all worked out in the proposal. at least i thought about it hard. i need to answer your email, sorry i didn't get around to do that yet

dmcneil: because we worked out the blank node ID semantics carefully for the existing mechanism, not so much for the new RowBlankNode

ashok: we have two options on the table, how to proceed

<Souri_> Is that all the disagreement about? Nothing else?

dmcneil: there is a third option, let the DM use R2RML views to implement this

ashok: I like that option, but I thought richard disagreed

cygri: my position is: it can be done, but it cannot be done in a way that conforms with the spec
... because it requires vendor-specific SQL
... it is going to be slow, and no way to make it fast

ashok: some would argue that since SQL is such a sprawling spec, anytime someone writes SQL they are using vendor-specific SQL

cygri: if the argument is that the DM can be implemented on R2RML, then the question is how?

dmcneil: the DM is implemented on a specific database, so the DM generates an R2RML view that uses that database's features to implement the DM

cygri: that leads to very in-efficient implementations

<Souri_> Identical rows: We do not care whether they are assigned Ids <1,2,...,n> or <n,...,2,1> (because their contents are identical)

cygri: stable identifiers are needed so the next query produces the same identifiers

macted: you are not going to get it
... the database is free to change the ids

ashok: saying the identifiers are stable is going beyond the spec
... three choices:
... 1) do nothing

<Souri_> Two rows <a, b, c1> and <a, b,c2> may be assigned Ids _:b1 and _:b2 during one access and _:b2 and _:b1 during another access. Is this acceptable or not?

ashok: 2) Richard's proposal
... 3) Souri's proposal

-q

<cygri> souri, yes it is. but if they are _:b3 and _:b4 that's not acceptable

scribe: how do we come to agreement?

macted: do nothing is not an option because currently there is a "must" in there

cygri: "do nothing" means
... DM and R2RML are two entirely separate beasts
... just a violation of the basic premise of the DM

eric: it could still be the default behavior for all but the case of duplicate keys

souri: still trying to understand the requirement
... does the blank node ID need to be stable, or can it change?

cygri: yes, they can be scrambled but they cannot change to a different set of blank node IDs

macted: what!?

<MacTed> _blank1 a, b, c1

<MacTed> _blank2 a, b, c2

<MacTed> _blank2 a, b, c1

<MacTed> _blank1 a, b, c2

cygri: oh, then i misunderstood the example
... still the case without a primary key, right?
... in this case it would have to be a stable blank node label across queries

souri: inside the same translation, you may access the table twice, in two places
... the access order may be different
... based on that you may not be able to join the same rows
... so this even applies within the scope of the same query process

macted: how is this relevant?

souri: if we generate a blank node ID for a row
... then the same row from different parts of a query should generate the same blank node ID so they can be joined

cygri: yes, that is what is required, and that is part of what it makes it so difficult

macted: my model was that the blank node IDs are generated on the result set, not during the query

<Souri_> {?p :fname ?fnm} ... complex stuff ... {?p :lname ?lnm}

macted: this group is not about translating SPARQL to SQL

cygri: the RDF concepts doc says "you don't know anything about blank nodes except whether they are the same"

macted: but that applies to query results, not the underlying data

<cygri> ericP: counter examples: jena, 4store, …

<MacTed> +1 ericP

dmcneil: I think Richard's earlier statement that "doing nothing means DM and R2RML are completely separate" is quite overstated

souri: I am still trying to understand the target for generating blank node IDs

macted: for DM we need to maintain cardinality
... i.e. every row in the result set

eric: the value of a bnode is not something that can be referenced later
... the jena API over a SPARQL endpoint does not allow bnode IDs to be submitted again in subsequent queries
... so it is ok that Jena over SQL does not provide persistent bnode IDs

ashok: we are over time
... it is not clear how to make progress
... need either new proposals, or someone to change their position

juan: can we summarize the current options and who supports them?

ashok: if we only talk about changes to R2RML, then yes there are 3 options
... 1) do nothing
... 2) Richard's idea - add RowBlankNode
... 3) Souri's idea - add psuedo-column: "rowidentifier"
... 4) add wording saying "the DM is different in this special case"

macted: I am less clear on these options than when we started

<cygri> 2) and 3) are variations of “fix R2RML”

<cygri> 4) is proposal A from last time

<cygri> 1) is B from last time

macted: what happened to last week's proposal to strike the word "should"

cygri: that is now option 4

macted: i still like 4

juan: me too

<MacTed> >>> [[

<MacTed> >>> If the table has no primary key, the row node is a blank node. Distinct blank nodes MUST be generated for rows with distinct column values. For duplicate rows with identical values, implementations SHOULD generate a fresh blank for each duplicate row (resulting in a non-lean RDF graph [RDF Semantics]). However, if the underlying database system does not provide any means to reliably differentiate among the rows, then

<MacTed> implementations MAY re-use the same blank node for multiple duplicate rows (resulting in a lean RDF graph). Implementations SHOULD document and advertise their chosen behavior.

<MacTed> >>> ]]

<MacTed> The above replaces the following sentence in the current DM spec --

<MacTed> >>> [[

<MacTed> >>> If the table has no primary key, the row node is a fresh blank node that is unique to this row.

<MacTed> >>> ]]

eric: I object to losing cardinality on the basis of something that R2RML cannot do

ashok: would you be willing to word-smith it?

eric: no, because I disagree with the premise of losing cardinality
... from the DMs perspective there is no reason the MUST should be relaxed to SHOULD

<cygri> i don't know how to handle it *with acceptable performance*, i should say

eric: why are we breaking interop on the DM because R2RML cannot handle this case?

dmcneil: but R2RML can handle it, use R2RML views

eric: why don't we just tell R2RML users that they are losing cardinality in these cases?
... there is no issue in DM, the issue is an interop issue in R2RML

cygri: I was working on the assumption that the DM is a default mapping for R2RML
... that should answer the question of why I expect the DM to accomodate the capabilities of R2RML
... if there are restrictions in R2RML, which is 1.0, then...

eric: we have a case where R2RML cannot preserve cardinality
... we have a DM which provides default mapping for R2RML
... the places where R2RML cannot preserve cardinality, should be identified as the places where problems will occur

cygri: R2RML as it stands, the user specifies the identities of the rows

scribe: by specifying the columns or the templates
... if they lose cardinality they lose it because of how it was mapped, it is transparent

scribe: the other point is: what do you suggest for me as an R2RML implementor
... push a button and get an automatic mapping
... what should that mapping be in the case of a table without primary keys

eric: it should be what it is, just not promise to be the DM

cygri: how to communicate to users that it is not the DM?

eric: tell them R2RML does not have the ability to preserve cardinality in this case

cygri: how should we describe the default mapping we implement?

eric: say "it is similar to the DM except repeated rows will be collapsed into one"

cygri: can we write that into the R2RML spec?

eric: yes, that would be good

cygri: so remove the R2RML reference from DM
... instead add a sentence to the R2RML spec
... saying the default mapping is the "DM - repeated rows"

ashok: why remove R2RML ref from DM?

<cygri> The Direct Mapping is intended to provide a default behavior for http://www.w3.org/TR/2012/CR-r2rml-20120223/ [R2RML].

cygri: there is a sentence in DM saying DM is default behavior for R2RML
... that sentence must go

eric: we could add a caveat to it

ashok: eric can you work with richard and david on this?

eric: yes, i think the wording is close to what ivan proposed

cygri: it would also have to address the repeated rows caveat

seems to be concensus that we will try to develop wording around this

ashok: we will try to work it out in email

thanks: )

- DRAFT -

RDB2RDF Teleconference

15 May 2012

Attendees

Contents

1. Admin PROPOSAL: Accept the minutes of last meeting http://www.w3.org/2012/05/08-RDB2RDF-minutes.html

2. Implementability for tables w/o primary key

Summary of Action Items

Scribe.perl diagnostic output