The resource identity guessing game vs. bounded ambiguity [was Re: Comments on "SPARQL 1.1 Uniform HTTP Protocol for Managing RDF Graphs"]

Hi Nathan,

Excellent and very insightful analysis!  The "giant, global 
graph with unique identities" approach that you describe 
is fine for some limited application areas, such as:

 - within a relatively small, controlled environment; or

 - with applications that are willing to assume the risk of
   unstable definitions.

But it is not sufficient as a general approach at web scale.

The reason, in essence, is that it sets up an endless guessing
game between a URI's owner and its users: the URI owner
thinks of a unique resource, but provides a definition that
only gives hints about it, and the users of that URI must
guess its identity.  Each time the URI owner updates the
definition to add more hints, some of those users discover
that they guessed wrong, and, through no fault of their own,
their work is no longer consistent with the URI's definition.

I'll explain in more detail, but the explanation involves
multiple steps, so bear with me.

1. I assume you mean this "giant, global graph" to be
consistent, since otherwise it would be meaningless.
Incidentally, I've been referring to this as myth #2:
http://dbooth.org/2010/ambiguity/paper.html#myth2 .

But how on earth could we expect to know what that giant,
global graph should be?  Obviously we cannot assume that
it consists of the merge of *all* RDF graphs, since that
would clearly be an inconsistent mess.  On the web, anyone
can say anything about anything, and much of it is rubbish.
So we cannot, in advance, *assume* that we have such as graph
and use that as the basis for showing how a "unique identity"
approach works based on that assumption.

Instead, we need to go in the opposite direction: start with
two graphs that we *can* assume are (individually) consistent,
and then *merge* them to come incrementally closer to that
idealized, giant, global graph.  As with proof by induction,
if we can show that an approach to resource identity works
for *one* small graph, *and* we can show how it works when
two graphs are merged, then we have shown how it can work
on increasingly larger graphs.  Thus, in the limit as time t
goes to infinity we would reach nirvana, where all knowledge
of the universe has been formally encoded, and there is only
one, unique interpretation of the graph: every URI uniquely
identifies exactly one resource. ;)

I imagine this was the intent behind your idealized "giant,
global graph", so now let's proceed in this direction.

2. To avoid vagueness, and to prevent the possibility of any
hidden "then a miracle occurs" step,
http://star.psy.ohio-state.edu/coglab/Miracle.html 
let us assume that the resource definition is provided only in
RDF -- not natural language.  This assumption seems reasonable
because: (a) RDF definitions facilitate machine processing,
which is the whole point of using RDF to begin with; and (b)
in principle any natural language definition could be expressed
in RDF.

3.  Now suppose that a URI owner, Oliver, mints a URI u that is
intended to uniquely identify a particular resource that he has
in mind -- Nathan's TV.  As we know already, it is not possible
for Oliver to describe this resource unambiguously, so as a
simple example, let us assume that he (initially) provides a
definition containing only the following assertions in graph gd:

  # Oliver's definition of <u> -- graph gd
  <u> a :TV .
  <u> :hasOwner :Nathan .

4. Next, an RDF statement author, Alice, uses Oliver's URI to
publish a new RDF graph, ga:

  # Alice's graph ga
  <u> :alphaMax 27 .
  . . .

Since <u> is supposed to identify a unique resource globally,
Alice would like to verify that the resource she *thinks* <u>
is supposed to identify determine whether her new RDF graph,
ga, would give the URI the same resource identity than Oliver
has in mind.  But given only the URI's resource definition
(graph gd), how can Alice possibly determine this?  

Clearly it isn't reasonable on web scale to expect Alice to
personally ask Oliver for clarification.  So, barring magic
or miracles, the best Alice can do is to merge her graph ga
with Oliver's resource definition gd and check for consistency.
But, even if the merge is consistent, that does *not* indicate
that Alice's graph ga actually *does* use the URI to denote the
exact same resource that Oliver intended.  It only indicates
that it *could*: the merge admits at least one satisfying
interpretation.

In other words, all that Alice can determine is that the
cloud of possible resources that <u> *might* identify
in gd and ga overlaps, as illustrated in Figure 18 here:
http://dbooth.org/2010/ambiguity/paper.html#figure-18

5. Note that Alice's graph contains an assertion that makes
further assumptions about the identity of <u>.  In essence,
she has made a *guess* about the true, unique identity of <u>.
This is normal: *anything* that Alice's graph may say about
<u> that is not already entailed by Oliver's definition runs
the risk of being "wrong" when Oliver tightens his definition.
And it is likely that Alice *will* make statements about <u>,
because, after all, she has chosen to use <u> in her graph
for a reason.

To phrase this in terms of the RDF Semantics, Alice's
statements add constraints that reduce the set of satisfying
interpretations.  For example, in this case Alice has eliminated
all possible interpretations in which the thing's alpha --
characterized by a :alphaMax and :alphaMin -- is greater
than 27.

6. Next, a different RDF statement author, Bob, 
publishes a different graph gb using Oliver's URI:

  # Bob's graph gb
  <u> :alphaMin 43 .

Bob and Alice know nothing of each other's work.  Bob makes
the same consistency checks that Alice made, and his graph is
also consistent with Oliver's definition.

7. Next, Charlie wishes to merge Alice's graph ga with Bob's
graph gb, but since (we'll assume) something's alpha value
cannot have both a maximum of 27 and a minimum of 43, he finds
that the merge is inconsistent.  What can Charlie do?

Charlie cannot convince either Alice or Bob to "fix" their
data, because neither of them sees a problem with their data.
In theory Charlie could first try to convince Oliver to tighten
up the definition of <u>, and *then* he might convince Alice or
Bob -- whoever had guessed wrong about the alpha value -- to fix
his/her data, but this is not feasible to expect at web scale.

Probably the best that Charlie can do is to either: (a)
make his own guess about whether to side with Alice or Bob,
and manually discard some of the other's assertions;
or (b) split the identity of <u>, as described in
http://dbooth.org/2010/ambiguity/paper.html#splitting

Observation: At web scale we cannot expect RDF statement authors
to be able to influence other people's URI definitions or RDF
data, but statement authors still need to be able to make RDF
statements using other people's URIs.

8. Now let's consider what happens when Oliver *does* decide
to refine his definition, since this is the only way he can
hint at the unique identity of <u>, and the objective is to
continually tighten our definitions until we reach nirvana.  :) 
Oliver adds the following triple to his definition, gd2:

  # Oliver's new definition of <u> -- graph gd2
  <u> a :TV .
  <u> :hasOwner :Nathan .
  <u> :alphaMax 32 .

Through no fault of Bob, Oliver has just broken Bob's graph gb,
because gb is now inconsistent with Oliver's new definition,
gd2.  Regardless of the fact that Bob's graph gb may contain
valuable information, it is now clear that <u> cannot identify
the same resource in gb as it does in gd2.

Furthermore, if we play this through farther, the more Oliver's
definition of <u> is updated and tightened to more precisely
identify the true resource that Oliver intended, the more it
becomes inconsistent with existing graphs that used <u>.

Finally, since Oliver's definition itself may have used other
URIs whose definitions may change, Oliver would likely be forced
to rewrite it *differently* -- not just tighten it -- when
some of those definitions change and it becomes inconsistent,
thus breaking Alice and Bob's graphs in a different way.

In essence then, the very process that was intended to bring us
closer to the goal of a giant, global graph is the same process
that causes instability, and the more we advance toward that
goal, the more instability we create.

This kind of instability may be manageable in a small, closed
environment where you can control all of the definitions and
keep them all in sync.  And it may also be an acceptable risk
to *some* applications.  But it is not a workable approach at
web scale for applications that need a more stable foundation.

9. What is the alternative?  For semantic web architecture
to work at web scale, I see no option but to acknowledge the
essential ambiguity of resource identity, precisely *bound*
that ambiguity with URI definitions (a/k/a URI declarations),
and learn to live with it.  Each definition will be precise
*enough* for some applications even as it is ambiguous for
others.

Specifically, instead of assuming that a URI definition is
an incomplete description of a globally unique resource,
assume that the definition is the *complete* description
of the resource: the definition is all you get, and *any*
interpretation that is consistent with it is legitimate.

This permits an application to know just enough about a URI's
resource identity to get its job done, while providing a stable
foundation for RDF authors.

     ------------------

A few more inline comments below . . .

On Wed, 2011-03-23 at 01:12 +0000, Nathan wrote: 
> Hi Pat,
> 
> Here's how I see it (discussing things we can't see again).
> 
> On a universal scale (as in giant global graph) we have a set of nodes, 
> each node is associated with one or more unique names, and one or more 
> propositions. Each node can be seen as having a 1-1 relation with a 
> single distinct thing (whether real or abstract), and the set of 
> propositions bound to that node can be seen as characterizing (not 
> defining) the thing which the node is related to. Exactly what those 
> propositions characterize is open to interpretation, and when you're 
> only working with subsets of the global graph (as is the norm) what the 
> node is interpreted as characterizing gets increasingly less specific 
> ever more ambiguous.
> 
> If we split the previous paragraph in half, then by looking at only the 
> first half we can argue that each name has at most one referent, and 
> each thing can have multiple names (a many-1 relation). If we look at 
> the second half then we can argue that each name can have multiple 
> referents, and each thing multiple names (a many-many relation).
> 
> An application may not need to consider or know every property of a 
> thing to answer the question it is being asked, and may not need to (or 
> be able to) make distinctions between unique things.
> 
> So, to what does a name refer?
> 
> To me it is important to view each name as having at most one referent, 
> then if you tell me that you interpret the name as referring to 
> something else, I can offer some more propositions and refine my 
> description, in order that we may collectively describe the world and 
> hopefully start to understand each thing.

If you add those propositions to your existing URI definition then you
risk breaking downstream applications that used your URI.  This may be
the policy that you want, and if so it is important to publish your
change policy, so that others can choose whether to accept this risk.
But for more stability, you can instead mint a new URI with a tighter
definition.  There is a trade-off between the two policies.

David Booth

> 
> So, whilst I understand that the distinctions don't always matter, and 
> that it's generally nigh on impossible to define a thing unambiguously, 
> I still feel it is critically important to view each name as having a 
> single referent, and to view each name as identifying a unique thing, 
> unless told otherwise (by proposition or inference).
> 
> in-line:
> 
> Pat Hayes wrote:
> > On Mar 20, 2011, at 10:30 AM, Nathan wrote:
> >> This is why we couple descriptions to names, to give an indication of what we are using a name to refer to, sure our descriptions may be ambiguous and open to refinement, but our names are not; because we are not using simple string token names "everest" or "lightbulb", we're using distinct URIs.
> > 
> > So, are you saying it is the *syntax* of URIs which gives them this magical quality? So one gets unambiguous reference by putting a colon in the name somewhere?  OK, forgive my sarcasm: but if this is not what you are saying, just what ARE you saying, that gives URIs this amazing ability to reach out into the world and seize upon their single unique referent?
> 
> The point I was trying to make (badly) was two fold:
> 
> 1: Rather than saying "when I say X I mean this" and "when you say X you 
> mean that" (where this != that) as humans with limited vocabulary often 
> do. We can instead use URIs with gives us a wider vocabulary and greater 
> opportunity to have one or more unique names for each referent.
> 
> 2: The magical quality is in the specs and a social agreement, that we 
> will typically consider each URI as having at most one referent, thus 
> allowing us to say that each URI unambiguously identifies a single 
> thing; even when the interpreted characterization of that thing is 
> ambiguous.
> 
> >[snip]
> >> So, I have to conclude that the names aren't ambiguous here
> > 
> > What would lead you to that conclusion? I don't see that you have argued for it anywhere. Like TimBL's claim, it seems to be a matter of W3C Dogma rather than an actual observation or even a rationally defended position. And as it is radically false, and indeed in many cases *provably* false, it seems rather obtuse to be defending it with so slender an excuse or argument. 
> 
> Hopefully the above helps explain my own personal thinking on it, well 
> as well as I can understand things given my limited knowledge.
> 
> Best,
> 
> Nathan
> 
> 
> 

-- 
David Booth, Ph.D.
http://dbooth.org/

Opinions expressed herein are those of the author and do not necessarily
reflect those of his employer.

Received on Saturday, 26 March 2011 21:26:32 UTC