RDB2RDF Working Group Teleconference -- 22 Oct 2009

<trackbot> Date: 22 October 2009

me dialing in

<iv_an_ru> I'm trying to connect...

<scribe> scribe: hhalpin

"WG Tools"

"Convene RDB2RDF Meeting"

<Souri> Are we doing a roll call?

<mhausenblas> yes, Souri

Ahmed is doing a roll call.

<iv_an_ru> wow, I've got the connection!

PROPOSED: to approve RDB2RDF Weekly -- 15th October 2009 as a true record

<mhausenblas> +1

(note that I fixed the angela's name!)

<angela_UNITN> thanks

http://www.w3.org/2009/10/15-RDB2RDF-minutes.html

(need another +1)

<MacTed> Orri will be here momentarily. timezone mixup.

RESOLUTION: approved RDB2RDF Weekly -- 15th October 2009 as a true record

RESOLUTION: RDB2RDF Weekly meets next Oct 29th.

<mhausenblas> Michael: background about DST see http://www.timeanddate.com/time/aboutdst.html

RESOLUTION: To meet at same time for next 2 weeks just to prevent daylights saving errors, and then adopt whatever the best time from the poll.

mhausenblas: Everyone tried wiki?
... your normal w3c login should work
... you can put your thoughts there

hhalpin: any first?

<Souri> Do you have the URL for the Wiki handy?

<mhausenblas> http://www.w3.org/2001/sw/rdb2rdf/wiki/Main_Page

hhalpin: is talking about scribing
... and I'm taking more notes
... and so on.

<cgi-irc> me: testing my username

<iv_an_ru> (as a last resort, I can provide some comments to Orri's slides, finally, that's my SPARQL2SQL translator. The problem is that my connection is bad as usual)

<cgi-irc> this is ben

http://www.w3.org/2001/sw/rdb2rdf/track/actions/open

To see open actions.

<Ben> That's better

<scribe> ACTION: hhalpin to double-check EricP's status and have him send in his proposals about mapping [recorded in http://www.w3.org/2009/10/22-RDB2RDF-minutes.html#action01]

<trackbot> Created ACTION-8 - Double-check EricP's status and have him send in his proposals about mapping [on Harry Halpin - due 2009-10-29].

http://www.w3.org/2001/sw/rdb2rdf/track/actions/open

ACTION [DONE]: Contact Soeren for presenting Triplify and ask Richard to provide D2R documentation a week in advance of his presentation

<trackbot> Sorry, couldn't find user - [DONE]

ACTION [DONE]: mhausenblas Contact Soeren for presenting Triplify and ask Richard to provide D2R documentation a week in advance of his presentation

<trackbot> Sorry, couldn't find user - [DONE]

<mhausenblas> ACTION-2

<mhausenblas> ACTION-2?

<trackbot> ACTION-2 -- Michael Hausenblas to contact Soeren for presenting Triplify and ask Richard to provide D2R documentation a week in advance of his presentation -- due 2009-10-08 -- OPEN

<trackbot> http://www.w3.org/2001/sw/rdb2rdf/track/actions/2

<mhausenblas> close ACTION-2

<trackbot> ACTION-2 Contact Soeren for presenting Triplify and ask Richard to provide D2R documentation a week in advance of his presentation closed

ACTION-3?

<trackbot> ACTION-3 -- Michael Hausenblas to draft a proposal for presentation order on the Wiki -- due 2009-10-08 -- OPEN

<trackbot> http://www.w3.org/2001/sw/rdb2rdf/track/actions/3

<Souri> A pointer to Zakim tutorial would be good.

close ACTION-3

<trackbot> ACTION-3 Draft a proposal for presentation order on the Wiki closed

http://www.w3.org/2001/12/zakim-irc-bot.html

http://www.w3.org/2002/01/UsingZakim

<mhausenblas> and also

<mhausenblas> http://www.w3.org/2002/03/RRSAgent

close ACTION-3

<trackbot> ACTION-3 Draft a proposal for presentation order on the Wiki closed

close ACTION-4

<trackbot> ACTION-4 Draft first invitation mail to WG closed

close ACTION-5

<trackbot> ACTION-5 Re-send Marcelo invited expert form closed

close ACTION-6

<trackbot> ACTION-6 Add material to the Wiki closed

close ACTION-7

<trackbot> ACTION-7 Send email out to europeans explaining daylight savings time. closed

http://www.w3.org/2001/sw/rdb2rdf/wiki/images/9/96/Relational2RDF.ppt

"Orri Erlang (OpenLink)"

<mhausenblas> ACTION: mhausenb to put Ahmed's proposals regarding R2RML requirements onto the Wiki [recorded in http://www.w3.org/2009/10/22-RDB2RDF-minutes.html#action02]

<trackbot> Created ACTION-9 - Put Ahmed's proposals regarding R2RML requirements onto the Wiki [on Michael Hausenblas - due 2009-10-29].

I think Ivan can give the presentation.

<Ben> yes

ivan: I'll give the presentation
... I'm Ivan from OpenLink
... I'm responsible for all the SPARQL
... at least the transformation parts.

<mhausenblas> (background about Ivan see http://www.linkedin.com/in/ivanmikhailov)

ivan: mapping from relational to RDF why?

<MacTed> ah hah -- Zakim says conference is full, so Orri cannot get in.

ivan: we want to save time

<MacTed> I've dropped phone
...: we want to allow relational data of course to be accessed by RDF.

Orri?

<scribe> ACTION: hhalpin will add 10 more participants to telecon [recorded in http://www.w3.org/2009/10/22-RDB2RDF-minutes.html#action03]

<trackbot> Created ACTION-10 - Will add 10 more participants to telecon [on Harry Halpin - due 2009-10-29].

<Souri> Please mention the slide# or slide title

ivan - can you put the slide number in IRC?

orri: let's categorize structured data on the Web
...: by exposing all content from any Web 2.0 application
... data-warehouse people who are deeply concerned with different identifiers, data not joining cleanly

orri: these two sites have different requirements

<mhausenblas> still slide #2 methinks

orri: if you have a straightforward mapping
... then you just extract RDF and then load it into RDF.
... if you put more effort into mapping
... you can do it both ways.
... translate SPARQL into SQL
... then you don't have to make a logically equivalent RDF store.

<mhausenblas> slide #4

orri: this involves converting everything
... to RDF but it has pros and cons
... the main pro is any query you can do a variable in predicate
... high risk that it won't work well in SQL, big unions
... another pro is if you want to materialize inference.
... union and join across inferred and non-inferred triples, no end of trouble
... lots of different data sources
... pouring them all into bucket of RDF might be a good solution
... so no a priori need to give an exact schema
... if it does not link or havea an intersection, the benefit is neglible.
... the cons are updating
... latency
... large space, bigger than equivalent relational data.
... lots of research into compression of RDF.
... the task specific though will always be more compact and more efficient.

whose speaking there?

<mhausenblas> Ashok

Ashok: the queries will be slower against RDF, no?

orri: certainly the case right now

<mhausenblas> slide #5

orri: what are benefits of mapping on demand?

No synchronization to do.

orri: don't have to cross multiple databases
... if you can push all work into single relational databases
... you get all benefits of optimization
... if you want to add sources
... then you don't have to copy into RDF, saving space
... Cons are the same, non-SQL sources, inferences, but experience
... has shown that inference can be done in mapping
... in particular sub-classes and sub-properties

slide #6

orri: about virtuoso
... we do mapping of SPARQL to SQL schema from any relational database
... not just our own, but do the rest via data access drivers
... this would allow us to join DB2 and Oracle for example
... we map it all into Virtuoso SQL
... then we can deal with it as a single SQL dialect
... and so do RDF mapping across single SQL dialect
... we store physical quads
... we store up to 8 billion quads, with full text index
... can do ranking of entities, and do other relational database functionality, transactions etc.

slide #7

scribe: for mapping to be useful
... one of the large factors
... SPARQL as a query language
... lacked aggregaton and GROUP BY
... but all vendors added, albeit in a vendor-specific manner
... SPARQL 1.1 should standardize all of these.
... SPARQL can without reliance on extensions
... derive subqueries etc.
... within the standard.
... if you have this mapping.
... some mappings are still difficult
... we would prefer it if a single SQL statement were produced even if it was ugly
... as to not have problems re exectuion plans
... you must be fairly clever in mapping
... to do mapping on fly
... must know when some kind of RDF entity (person, article) can come from any table
... union of customers across many tables, queries then become joins between unions, a mess to optimize
... so what can be optimized in mapping layer!
... do NOT do needless joins, this mus be done in mapping

slide #8

scribe: cases for integratin
... similar but hetereogenous schemas combined
... each application has users, comment, comment field
... but each slightly different
... so we map them all into SIOC vocabulary
... six relational tables
... so union of all types of comments to union of all types of posts
... intelligence is needed there.

Ahmed: How is foreign key being handled?
... better covered by direct reference, what does that mean?

orri: we handle foreign keys
... just like primary keys
... so say order has primary key
... of order number and line number
... so we declare it exactly
... we use the names
... give exactly same declaration of one key
... as to another.
... say that ORDER has customer
... translate ORDER primary key
... into constant
... then translates foreign key to a URI.
... print some expression

cygri: I can explain it
... you define a function
... your input to a function is the IDs with an integer
... output is a URI
... if you define a subject
... you put the primary key into that function
... then the order number would go into the order identifier function, which generates the subject
... so to get foreign key to order table
... put foreign key value, a column on order table, into the new table
... and you do the same thing on the customer table
... as you use the same function
... these two will match in the end

orri: you are joining you compare functions to foreign key to primary key
... but if you know they are bijections
... you can just do normal joins without running functions
... normalization
... you may want to flatten things
... extra depth, as in SQL views
... more familiar way is to do it via SQL views
... qualify a mapping rule
... this subject will have this subject with this value ONLY if a certain SQL condition holds
... so we can start making things arbitrary complex
... policy functions can then be implemented in the view, but also in the database
... so we get benefits that you don't get from RDF warehouse view.
... kilometers to miles, store once, have a conversion function. like computed column in a relational table.
... for each entity for each mapping

<mhausenblas> slide #9

orri: you need a way to determine a URI from a key
... %s done like in C
... then a place holder for domain name
... multi-part keys
... line number, order number

slide #10

TPC-H

minimum baseline

these queries given direct into SPARQL.

Souri: Question about multiple tables
... what if I made a function that gave things the same URIs?
... should we prevent that?

<mhausenblas> MacTed?

orri: we don't always forbid that.
... employee and dept and mixing, but blogposts and wiki articles you might want to combine
... ask for number 16
... so you might want that.

<mhausenblas> iv_an_ru or MacTed can you provide the correct URI for the TPC-H demo please?

<Marcelo> The URL http://demo.openlinksw.com/tpc-h/ does not work

orri: might want to infer that some URIs are distinct and should not join
... like joining a qualified subject to an unqualified one

<mhausenblas> well spotted Marcelo, hence I asked Ivan/Ted ... ;)

orri: I will fix URIs!
... all of these 22 queries
... they got SPARQL expression of similar length, all single SQL converted to single SPARQL queries
... for complicated queries, overhead of mapping can be neglible
... some tweaks in the SQL
... but we take care of these idioms in a virtual database layer
... barring pathological queries
... the mapping should not generate substantial penalties

slide #11

scribe: keys in URI
... but the URI does not specifiy what table it came from
... the relational database does not know some of these things aren't meaningful
... little any database can do
... intelligence must be in mapping layer
... one must know about databases
... in particular what a relational optimizer can and can't do.
... where data is located
... avoiding joining too much
... a cost model
... so we need to know how databases work.

Ahmed: Are you expecting mapping to have access to remote database stats

orri: If we have some idea
... we can say, import cardinality
... depends on type of remote database
... so we try to push things to remote database
... o/w use a cost model to determine joins

Ahmed: We have a difficulty in keeping these statistics current even with just relational databases

Orri: We keep them current as much as we can.
... we have maps into integrating vocabularies
... we are presently working on enterprise accounts

slide #14

orri: we use this all internally at OpenLink
... all our customers have URIs.
... questions?

Ahmed: If you have a 50 terabyte warehouse
... if you convert that all into RDF
... how big is blow-up?

orri: depends on compression

<mhausenblas> we're already at the top of the hour, so we should do a round-up

orri: columns are generally preferred
... so we try to stick to that.

<mhausenblas> next week soeren is planned

<mhausenblas> see http://www.w3.org/2001/sw/rdb2rdf/wiki/Initial_Round_of_Presentations

orri: that's the theory
... practice is not quite as good
... 4-5 times bigger is not unreasonable
... so we need to get that better
... but reasons are not fundamental
... we imagine type-specific compression would work in rdf

Ahmed: What's average of slow-down of queries with mapping?

orri: an order of magnitude
... we don't want to replace a warehouse with RDF at the moment.

<mhausenblas> http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/

mhausenblas: send questions to mailing list
... soeren is up next week!
... I might become a father on that day!

<scribe> ACTION: hhalpin to ask EricP for back-up [recorded in http://www.w3.org/2009/10/22-RDB2RDF-minutes.html#action04]

<trackbot> Created ACTION-11 - Ask EricP for back-up [on Harry Halpin - due 2009-10-29].

<iv_an_ru> Congrats, Soeren!

<mhausenblas> http://www.slideshare.net/soeren1611/triplify-1341084

orri: so on warehouse side
... if you have regular relational warehouse
... it's not RDF's strong point
... RDF is unbeatable for structured queries against hetereogenous data

<iv_an_ru> (oops, I've lost the connection)

orri: strong need for queries and schema-less, then we win.

thanks orri!

trackbot, end meeting

- DRAFT -

RDB2RDF Working Group Teleconference

22 Oct 2009

Attendees

Contents

"WG Tools"

"Convene RDB2RDF Meeting"

"Orri Erlang (OpenLink)"

Summary of Action Items

Scribe.perl diagnostic output