Re: proposal on ISSUE-94 ("lossless paging") from Sandro Hawke on 2014-03-24 (public-ldp-wg@w3.org from March 2014)

From: Sandro Hawke <sandro@w3.org>
Date: Mon, 24 Mar 2014 23:17:32 +0300
To: Ted Thibodeau Jr <tthibodeau@openlinksw.com>, public-ldp-wg@w3.org
Message-ID: <533092DC.7010702@w3.org>
On 03/24/2014 07:44 PM, Ted Thibodeau Jr wrote:
> Sandro wrote:
>> PROPOSED: Change LDP paging to no longer be characterized as
>> "lossy". Specifically, a client who pages all the way through
>> (forward or backward) MUST be sent every triple in intersect
>> (g0, g1, ... gn) and MAY be sent any triples in the union (g0,
>> g1,  .. gn), where Gi is the paged graph at the time client
>> makes request i.  Servers MAY refuse to serve any page because
>> they don't want to implement this kind of paging in certain
>> cases; if they do so, they MUST return a 410 GONE.
>
> I think Sandro's suggestion remains “lossy", in that insertions
> *may* be made visible, but any deletions are to be concealed.
> This bothers me.

My intention, which I'm fairly confident is reflected in the proposed 
formulation, is that both insertions and deletions MAY be visible to the 
client.

If a triple has been deleted before Gi, then it will be absent from the 
intersection, and therefore the server is not mandated to send it to the 
client.   But it's in the union, so the server MAY send it to the 
client.     So deletions, like insertions, MAY be seen by the client.

>
> As I said in the con-call, I think the current paging thoughts
> are a conflation of two concepts from the (R)DBMS world -
> “(scrollable) cursors” and “(transaction) isolation”.

I prefer to think of it as an elegant combination of two concepts. :-)

[mostly joking, read on....]

> I strongly recommend some investigation of these concepts.
> These wikipedia pages are imperfect but provide a good, if
> basic, starting point.
>
>     <https://en.wikipedia.org/wiki/Isolation_%28database_systems%29>
>
>     <https://en.wikipedia.org/wiki/Cursor_%28databases%29>
>
> I recommend particular attention to the “Isolation Levels vs
> Read Phenomena” table on the Isolation page.
>
> There are tradeoffs to each and every option — and whether any
> given tradeoff is more or less important depends on the use case —
> and this is why the DBMS world mostly sees every engine implement
> all cursors and isolation levels, and mostly allows clients to
> specify which ones they want/need to use.
>
> I would suggest that we consider writing this version of the
> LDP spec to only require implementation of the lowest-level,
> least-effort options in each.
>
> For the cursor, this is the “FORWARD ONLY” a/k/a “non-scrollable”
> cursor.  The scroll only goes from one end to the other, without
> any reversing midway.  When working with an unORDERed result set,
> you can only go in the arbitrary “forward”  direction.  When
> working with an ORDERed result set, it is really “unidirectionally
> scrollable” because it can be walked in either direction by simply
> inverting the ORDER BY.

That's exactly what I'm proposing.     The proposal only mandates what 
happens if the client does a complete traversal.   For any other use of 
first/last/prev/next pointers, the spec would not define what the client 
gets.    I didn't spell out "pages all the way though" in this brief 
proposal, but I mean the obvious thing: (1) follow the first [last] link 
from the paged resource to get the first sub-page; repeatedly follow the 
next [prev] link from that sub-page to the adjacent sub-page; stop when 
there is no next [prev] link, as that means you've reached the end.   If 
the server is only allowing forward traversal, then there is no rel=last 
link, so reverse traversal is impossible.

> For isolation, this is “READ UNCOMMITTED”, which basically means
> that the client can’t rely on anything being the same *now* as it
> was *then*, no matter how short the window between.  The client
> gets a snapshot-in-time for each request, but has no way to tell
> the server “act on *that* snapshot-in-time” nor how to act based
> on *other* client requests received in the interim (or received
> after mine, but based on the same state as my request was...)

SPARQL and LDP don't use the terminology of "transactions" and don't 
have transactions as they are normally understood.  To me, given my 
years of database programming in the 80s and 90s, a transaction is 
relatively long lived, often consisting of many operations, or one slow 
operations.    In particular, a transaction includes read operations.

What we have with HTTP and LDP is simple, short "modifications", which 
do not contain any reads.    Since modifications do not contain reads, 
the distinctions between, eg, "repeatable reads" and "read committed" 
are not observable,... at least as far a I can tell.

But they are Isolated.   If we didn't have Isolation in LDP, you could 
be doing a PUT while someone else does a GET and they'd see only a few 
of the triples you PUT, but still get a 200 OK.   I think that's 
forbidden by the HTTP spec, and therefore not allowed in LDP.

I don't think paging can or should change that.   Any POST, PATCH, or 
PUT should appear to clients to be instantaneous, whatever GETS are 
happening at approximately the same time, with or without paging.   
(This is what Isolation means.)


> I think this leaves a clear path for future evolution (adding
> the other cursors and/or isolation levels), while making the
> current spec easier to write and early implementations easier
> to produce.

I don't see a spectrum of useful options here.      I just see (1) lossy 
paging (where the client is only guaranteed to see some subset of 
intersection(G0...Gn), (2) lossless paging (my proposal), and (3) static 
paging (where after a complete traversal, the client has received G0).

Static paging isn't in scope right now, but it could be easily handled 
by having resource that offer it include Link rel=snapshot.    Then you 
do normal lossless paging through the snapshot resource, knowing that 
G0...Gn for the snapshot are all the same as G0 of the original resource.


> I hope this is more helpful than painful.

I agree it's important to be clear about all this.     I'm also aware 
that I have only a programmer's expertise in these matters, not any 
formal theory background.  (Ironically, I had lunch with Michael 
Stonebraker yesterday, but all we did was argue about whether RDF was 
useful for real world data integration.   He doesn't like the occasional 
claims that RDF is a panacea for data integration, but I'm not making 
those claims.)

Unfortunately, it's hard to explain what we're doing in LDP to a 
database-theory person.   Maybe one of us has such a person handy who 
wants to write a paper analyzing LDP and proving various properties?   
There are many in my building, but I don't know that I could convince 
any of them to do this.

      -- Sandro

> Ted
>
>
>
>
> --
> A: Yes.                      http://www.guckes.net/faq/attribution.html
> | Q: Are you sure?
> | | A: Because it reverses the logical flow of conversation.
> | | | Q: Why is top posting frowned upon?
>
> Ted Thibodeau, Jr.           //               voice +1-781-273-0900 x32
> Senior Support & Evangelism  //        mailto:tthibodeau@openlinksw.com
>                               //              http://twitter.com/TallTed
> OpenLink Software, Inc.      //              http://www.openlinksw.com/
>           10 Burlington Mall Road, Suite 265, Burlington MA 01803
>       Weblog   -- http://www.openlinksw.com/blogs/
>       LinkedIn -- http://www.linkedin.com/company/openlink-software/
>       Twitter  -- http://twitter.com/OpenLink
>       Google+  -- http://plus.google.com/100570109519069333827/
>       Facebook -- http://www.facebook.com/OpenLinkSoftware
> Universal Data Access, Integration, and Management Technology Providers
>
>
>
>
>
>
>
>
>
Received on Monday, 24 March 2014 20:17:37 UTC