SweoIG/TaskForces/CommunityProjects/LinkingOpenData/BrokenLinksInLOD

From W3C Wiki

SWEO Community Project: Linking Open Data on the Semantic Web

Broken Links in LOD

A page to gather ideas and consensus on best practice for dealing with broken links.

Started with a question and some answers on the LOD email list - see the end of this page.

Preamble

What is a broken link?

When does a LOD site generate a 404?

What does a 404 in response to a GET mean?

Two issues: what happens at the "owning" site (the site where the URI failed to resolve). What happens when a site gives you a URI that turns out to be dead, whatever that might mean?

The owning site

I think in a lot of the LOD world, a 404 means “I don’t know anything about that URI”, rather than a broken link. Certainly for us, that is all we can do. In fact, what we are actually doing is manually generating the 404 when we find there is nothing in the KB; we could instead return a blankish RDF document, but that didn’t seem sensible. Now I think about it, I have checked what dbpedia does to http://dbpedia.org/resource/Esperanta – it does the blank doc thing. (I guess we need to work out what is best practice for this and then add it to the How to Publish? I think my view is that something like http://dbpedia.org/data/Esperanta.rdf should 404.) So either way, in LOD sites of the sort that have DBs or KBs behind them, either it is not possible to get a 404 (dbpedia), or you can’t distinguish between a rubbish URI that might have been generated and one you want to know about. I find the idea that I might give people the expectation that I will create triples (as your point 2) rather strange - if I knew triples I would have served them in the first place. Of course if we consider a URI I don't know as a request for me to go and find knowledge about it, fair enough, but I would expect a more explicit service for that. In that sense it would not be a "broken link". Maybe the world is different for the other RDFa etc ways of publishing LD, but in the DB/KB world, I don't see broken incoming links as something that can be usefully dealt with, other than the maintainer checking what is happening, as you do with a normal site.

The site that gave out a broken link

We are concerned with the place that gave you the URI, which is possibly more interesting. And I think this is actually the case for your TAG example. If I gave you (by which I mean an agent) such a link and you discovered it was broken, it would be helpful to me and the LOD world if you could tell me about it, so I could fix it. In fact it would also be helpful if you had a suggestion as to the fix (ie a better URI), which is not out of the question. And if I trust you (when we understand what that means), I might even do a replacement or some equivalent triples without further intervention.

Examples in practice

RKBexplorer

In the case of our RKB system, we actually do something like this already. If we find that there is nothing about a URI in the KB that should have it, we don't immediately return 404, but look it up in the associated CRS (coreference service), and possibly others, to see if there is an equivalent URI in the same KB that could be used (we do not return RDF from other KB, although we could). So if you try to resolve http://southampton.rkbexplorer.com/description/person-07113 You actually get the data for http://southampton.rkbexplorer.com/id/person-0a36cf76d1a3e99f9267ce3d0b95e42e-06999d58799cb8a3a55d3c69efcc9ba6 and a message telling you to use the new one next time. (I'm not sure we have got the RDF perfectly right, but that is the idea.) In effect, if we are asked for a broken link, we have a quick look around to see if there is anything we do know, and give that back. Of course, the CRS also gives the requestor the chance to do the same fixing up. The reason that there might be a URI in the KB that has no triples, but we know about, is because we "deprecate" URIs to reduce the number, and then use the CRS to resolve from deprecated to non-deprecated. So a deprecated URI is one we used to know about, and may still be being used "out there", but don't want to continue to use - sort of a broken link. Hence our dynamic broken link fixing.

wikipedia deleted pages

My choice of http://dbpedia.org/data/Esperanta.rdf as a misspelling of http://dbpedia.org/data/Esperanto.rdf turned out to be fascinating. It turns out that wikipedia tells me that there used to be a page http://en.wikipedia.org/wiki/Esperanta, but it has been deleted. So what is returned is different from http://en.wikipedia.org/wiki/Esperanti. Although http://dbpedia.org/data/Esperanta.rdf and http://dbpedia.org/data/Esperanti.rdf both return empty RDF documents, I think. It looks to me that this is trying to solve a similar problem to that which our deprecated URIs is doing in our CRS.

The email thread that started it on Linked Data community <public-lod@w3.org> - I guess this will disappear one day

------ Forwarded Message
From: "Hausenblas, Michael" <michael.hausenblas@deri.org>
Date: Sat, 14 Feb 2009 16:32:32 -0000
To: Hugh Glaser <hg@ecs.soton.ac.uk>
Cc: Kingsley Idehen <kidehen@openlinksw.com>, Bernhard Haslhofer <bernhard.haslhofer@univie.ac.at>, Linked Data community <public-lod@w3.org>
Subject: Re: Broken Links in LOD Data Sets


Hugh,

As often, you are right (with my sloppy usage of the term publisher)
and I think your analysis below is indeed close to what I was thinking
as well. Let's move over to ESW Wiki and write up stuff. A paste from
your email might be a good start! Mind minting a URI for it and start
fill in the Wiki page? I'm on travel and limited re my capabilities
currently ;)

Cheers, Michael

Sent from my iPhone

On 14 Feb 2009, at 16:00, "Hugh Glaser" <hg@ecs.soton.ac.uk> wrote:

> Hi Michael.
> I got thoroughly confused, I think, by your use of the "dataset
> publisher
> (the authoritative one who 'owns' it)".
> That made me think you were talking about the owner of the broken
> URI (ie,
> where it should have resolved to), rather than the place that gave
> you the
> URI. (Which was it? :-) )
>
> So the next bit is the first of those:
> ======================================
> I think in a lot of the LOD world, a 404 means ³I don¹t know anythin
> g about
> that URI², rather than a broken link.
> Certainly for us, that is all we can do.
> In fact, what we are actually doing is manually generating the 404
> when we
> find there is nothing in the KB; we could instead return a blankish
> RDF
> document, but that didn¹t seem sensible.
> Now I think about it, I have checked what dbpedia does to
> http://dbpedia.org/resource/Esperanta  it does the blank doc thing.
> (I guess we need to work out what is best practice for this and then
> add it
> to the How to Publish? I think my view is that something like
> http://dbpedia.org/data/Esperanta.rdf should 404.)
> So either way, in LOD sites of the sort that have DBs or KBs behind
> them,
> either it is not possible to get a 404 (dbpedia), or you can¹t disti
> nguish
> between a rubbish URI that might have been generated and one you
> want to
> know about.
> I find the idea that I might give people the expectation that I will
> create
> triples (as your point 2) rather strange - if I knew triples I would
> have
> served them in the first place. Of course if we consider a URI I
> don't know
> as a request for me to go and find knowledge about it, fair enough,
> but I
> would expect a more explicit service for that. In that sense it
> would not be
> a "broken link".
> Maybe the world is different for the other RDFa etc ways of
> publishing LD,
> but in the DB/KB world, I don't see broken incoming links as
> something that
> can be usefully dealt with, other than the maintainer checking what is
> happening, as you do with a normal site.
> ======================================
>
> Now turning to the second possible meaning.
> We are concerned with the place that gave you the URI, which is
> possibly
> more interesting. And I think this is actually the case for your TAG
> example.
> If I gave you (by which I mean an agent) such a link and you
> discovered it
> was broken, it would be helpful to me and the LOD world if you could
> tell me
> about it, so I could fix it. In fact it would also be helpful if you
> had a
> suggestion as to the fix (ie a better URI), which is not out of the
> question. And if I trust you (when we understand what that means), I
> might
> even do a replacement or some equivalent triples without further
> intervention.
>
> ======================================
> In the case of our RKB system, we actually do something like this
> already.
> If we find that there is nothing about a URI in the KB that should
> have it,
> we don't immediately return 404, but look it up in the associated CRS
> (coreference service), and possibly others, to see if there is an
> equivalent
> URI in the same KB that could be used (we do not return RDF from
> other KB,
> although we could). So if you try to resolve
> http://southampton.rkbexplorer.com/description/person-07113
> You actually get the data for
> http://southampton.rkbexplorer.com/id/person-0a36cf76d1a3e99f9267ce3d0b95e42
> e-06999d58799cb8a3a55d3c69efcc9ba6 and a message telling you to use
> the new
> one next time.
> (I'm not sure we have got the RDF perfectly right, but that is the
> idea.)
> In effect, if we are asked for a broken link, we have a quick look
> around to
> see if there is anything we do know, and give that back.
> Of course, the CRS also gives the requestor the chance to do the
> same fixing
> up.
> The reason that there might be a URI in the KB that has no triples,
> but we
> know about, is because we "deprecate" URIs to reduce the number, and
> then
> use the CRS to resolve from deprecated to non-deprecated.
> So a deprecated URI is one we used to know about, and may still be
> being
> used "out there", but don't want to continue to use - sort of a
> broken link.
> Hence our dynamic broken link fixing.
>
> Best
> Hugh
>
> PS.
> My choice of http://dbpedia.org/data/Esperanta.rdf as a misspelling of
> http://dbpedia.org/data/Esperanto.rdf turned out to be fascinating.
> It turns out that wikipedia tells me that there used to be a page
> http://en.wikipedia.org/wiki/Esperanta, but it has been deleted.
> So what is returned is different from
> http://en.wikipedia.org/wiki/Esperanti.
> Although http://dbpedia.org/data/Esperanta.rdf and
> http://dbpedia.org/data/Esperanti.rdf both return empty RDF
> documents, I
> think.
> It looks to me that this is trying to solve a similar problem to
> that which
> our deprecated URIs is doing in our CRS.
>
>
> On 14/02/2009 14:06, "Hausenblas, Michael" <michael.hausenblas@deri.org
> >
> wrote:
>
>> Kingsley,
>>
>> Grounding in 404 and 30x makes sense to me. However I am still in the
>> conception phase ;)
>>
>> Sent from my iPhone
>>
>> On 12 Feb 2009, at 14:02, "Kingsley Idehen"
>> <kidehen@openlinksw.com> wrote:
>>
>>> Michael Hausenblas wrote:
>>>> Bernhard, All,
>>>>
>>>> So, another take on how to deal with broken links: couple of days
>>>> ago I
>>>> reported two broken links in a TAG finding [1] which was (quickly
>>>> and
>>>> pragmatically, bravo, TG!) addressed [2], recently.
>>>>
>>>> Let's abstract this away and apply to data rather than documents.
>>>> The
>>>> mechanism could work as follows:
>>>>
>>>> 1. A *human* (e.g. Through a built-in feature in a Web of Data
>>>> browser such
>>>> as Tabulator) encounters a broken link an reports it to the
>>>> respective
>>>> dataset publisher (the authoritative one who 'owns' it)
>>>>
>>>> OR
>>>>
>>>> 1. A machine encounters a broken link (should it then directly
>>>> ping the
>>>> dataset publisher or first 'ask' its master for permission?)
>>>>
>>>> 2. The dataset publisher acknowledges the broken link and creates
>>>> according
>>>> triples as done in the case for documents (cf. [2])
>>>>
>>>> In case anyone wants to pick that up, I'm happy to contribute.
>>>> The name?
>>>> Well, a straw-man proposal could be called *re*pairing *vi*ntage
>>>> link
>>>> *val*ues (REVIVAL) - anyone? :)
>>>>
>>>> Cheers,
>>>>      Michael
>>>>
>>>> [1] http://lists.w3.org/Archives/Public/www-tag/2009Jan/0118.html
>>>> <http://lists.w3.org/Archives/Public/www-tag/2009Jan/0118.html>
>>>> [2] http://lists.w3.org/Archives/Public/www-tag/2009Feb/0068.html
>>>> <http://lists.w3.org/Archives/Public/www-tag/2009Feb/0068.html>
>>>>
>>>>
>>> Micheal,
>>>
>>> If the publisher is truly dog-fooding and they know what data
>>> objects
>>> they are publishing, condition 404 should be the trigger for a self
>>> directed query to determine:
>>>
>>> 1. what's happened to the entity URI
>>> 2. lookup similar entities
>>> 3. then self fix if possible (e.g. a 302)
>>>
>>> Basically, Linked Data publishers should make 404s another Linked
>>> Data
>>> prowess exploitation point  :-)
>>>
>>>
>>> --
>>>
>>>
>>> Regards,
>>>
>>> Kingsley Idehen       Weblog: http://www.openlinksw.com/blog/
>>> ~kidehen
>>> <http://www.openlinksw.com/blog/~kidehen>
>>> President & CEO
>>> OpenLink Software     Web: http://www.openlinksw.com
>>> <http://www.openlinksw.com>
>>>
>>>
>>>
>>>
>>
>


------ End of Forwarded Message