Re: Applying the relational model to SPARQL from Eric Prud'hommeaux on 2006-11-09 (public-rdf-dawg-comments@w3.org from November 2006)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Thu, 9 Nov 2006 19:37:55 +0100
To: Andrew Newman <andrewfnewman@gmail.com>
Cc: public-rdf-dawg-comments@w3.org
Message-ID: <20061109183755.GA4201@w3.org>
On Thu, Nov 09, 2006 at 05:52:25AM +1000, Andrew Newman wrote:
> 
> I'd like present to the DAWG public comments list my Honours thesis.
> It discusses a formal model, the relational model, to SPARQL.  It
> builds on the work of by Cyganiak, Frasincar et al., Harris and
> Shadbolt, Pérez et al. and others.  Hopefully, it's appropriate to
> some of the current discussions.
> 
> It's available at (~500K):
> http://jrdf.sourceforge.net/RelationalBasedSPARQL.pdf

thank you very much for this work! it looks very useful to the DAWG
and the community as a whole.

> In it I suggest that the current SPARQL specification is directly
> influenced by implementation specifics such as SQL and not RDF.  It is
> argued that the semantics of SQL is a poor match to the RDF data
> model.  Examples of this mismatch include:
> * The existence of NULL (section 2.3).
> * UNION and other operations may or may not return duplicates (section 2.4).
> * Lack of Compositional Semantics (section 2.5).
> * Order dependent OPTIONAL (like SQL's left outer join) (section 2.6).


====Duplicates====
You note that
[[
SQL is often seen as an implementation of the relational model even
though it has numerous incompatibilities with the relational model
such as bag instead of set semantics, column ordering, duplicates and
handling of nulls [6].
]]
and later state
[[
As both the RDF model and the relational model are both propositional
and set-based it is likely that a compatible model for querying RDF
can be provided.
]]

Does the relational model you propose preserve duplicates?
Specifically, are the ⋈(null-accepting join) and π (sorry, I may have
the wrong codepoint there, using a greek lower case pi) defined to
work on sets or bags?  Is this choice orthogonal to the rest of the
calculus?

§2.4 Duplicates implies that you favor bags. While I don't know all of
Date's arguments against duplicates, I suspect that the use cases for
aggregates motivate him to have duplicates in the algebra, just not in
the report (guessing here). A practical implication of closure is that
one should be able to implement the aggregate downstream of some
report {SELECT ?type WHERE { ?employee a ?type }} , without burdening
the querier to project in something distinct, like the employee Id.
Since the DAWG has posponed aggregates [AG], users will need a bag
semantics to be able to postprocess with their one aggregates.


====NULLs====
[[
Cyganiak says, “The SPARQL model does not, for example, distinguish
between an OPTIONAL variable that is unbound in some solutions, and a
variable that is not used in the query at all.” This result leads to
confusion as to what an unbound result means.
]]
I think this would be analogous to an SQL that allowed me to select
non-existent columns and responded with NULLs. Are there use cases
where this confusion matters?


====DEE and DUM====
I interprated the SPARQL Optional table as specifying the results of
row heading OPT column heading. Substituting in T, F and R:

 ┏━━━┯━━━━━━━━┯━━━━━━━━┯━━━━━━━━━┓
 ┃   │   T    │   F    │   R     ┃ T=DEE
 ┃ T │ T⟕T=T  │ T⟕F=T  │ T⟕Ro=T  ┃ F=DUM
 ┃ F │ F⟕T=F  │ F⟕F=F  │ F⟕Ro=F  ┃ R=non-optional relation
 ┃ R │ R⟕T=Rr │ R⟕F=Rr │ R⟕Ro=Rr ┃ Ro=option relation
 ┗━━━┷━━━━━━━━┷━━━━━━━━┷━━━━━━━━━┛ Rr=resulting relation

Searching for corresponding queries, I get arrive at:

 ┏━━━┯━━━━━━━━━━━━━━━━━━━━━━┯━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃   │     T                │ F │           R                      ┃
 ┃ T │ { OPT {} }           │   │ { OPT { ?sO ?pO ?oO } }          ┃
 ┃ F │                      │   │                                  ┃
 ┃ R │ { ?s ?p ?o OPT { } } │   │ { ?s ?p ?o OPT { ?sO ?pO ?oO } } ┃
 ┗━━━┷━━━━━━━━━━━━━━━━━━━━━━┷━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
Basically, I didn't figure out how to write queries that used F (DUM).


====Order Independent Joins====
If the motivation for the full outer join was the freedom to optimize
by re-ordering, I think we also have to look at the performance of a
re-orderable full outer join vs. the non-reorderable left join.

I am generally in favor of left joins over full outer joins for
intuition and parallelism with SQL (which I suppose is the source of
much of the "intuition"). I need to explore the subsumption scheme you
outlined to see how it interacts with queris for the non-existance of
a particular pattern (NAF is spelled OPTIONAL !BOUND in SPARQL).


> Outcomes presented include:
> * A way of mapping RDF and SPARQL operations to the relational model
> (section 4).
> * Using tuple subsumption to implement UNION and OPTIONAL using
> previous optimisation techniques (section 2.7) that is up to twice as
> fast as an alternate implementation (using join, antjoin and union)
> and up to 8 times faster than ARQ (section 4.5).
> * An order independent version of OPTIONAL using full outer join and
> tuple subsumption (section 4.4).
> 
> Suggested future work includes:
> * Using SQL to implement tuple subsumption OPTIONAL and UNION  (section 5.3)
> * Alternative ways of implementing ASK and CONSTRUCT (section 5.1)
> using the relational model as a basis.
> * Aggregate functions (section 5.1).
> * Other optimisation techniques if compositional semantics are chosen
> (section 5.2).
> 
> The current code is only available through SF subversion:
> svn co https://svn.sourceforge.net/svnroot/jrdf jrdf

[AG] http://www.w3.org/2001/sw/DataAccess/issues#countAggregate
-- 
-eric

home-office: +1.617.395.1213 (usually 900-2300 CET)
     +33.1.45.35.62.14
cell:       +33.6.73.84.87.26

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.
Received on Thursday, 9 November 2006 18:37:06 UTC