RE: [open-bibliography] New BNB sample data available from Ford, Kevin on 2011-02-04 (public-lld@w3.org from February 2011)

From: Ford, Kevin <kefo@loc.gov>
Date: Fri, 4 Feb 2011 11:19:26 -0500
To: Antoine Isaac <aisaac@few.vu.nl>, "Deliot, Corine" <Corine.Deliot@bl.uk>
CC: List for Working Group on Open Bibliographic Data <open-bibliography@lists.okfn.org>, public-lld <public-lld@w3.org>
Message-ID: <1D525027B29706438707F336D75A279F167D92E8E4@LCXCLMB03.LCDS.LOC.GOV>
I think issue 3 comes very close - if, in fact, it isn't a fine example of the problem - to the "record" versus triples debate; the wholeness/completeness issue; the concise bounded description (CBD) discussion.

Corinne decided to include information - rdf:type, skos:prefLabel, skos:ConceptScheme - in addition to giving URI to the resource at ID.  Furnishing the additional information can reduce network look-ups and is available for immediate indexing.  She also astutely pointed out that including the added info in the BL output provided some safeguard should the data no longer be accessible from the given HTTP URI.  On this last point, here's an example:

<dcterms:subject rdf:resource="http://id.loc.gov/authorities/sh85026362#concept" />

versus 

<dcterms:subject>
	<skos:Concept rdf:about="http://id.loc.gov/authorities/sh85026362#concept">
		<skos:prefLabel>Civil procedure (International law)</skos:prefLabel>
		<skos:inScheme rdf:resource="http://id.loc.gov/authorities#conceptScheme" />
	</skos:Concept>
</dcterms:subject>

"Civil procedure (International law)" was canceled in December 2010 and replaced by two new concepts.  So, if you follow the link, you no longer get the resource.  I'll be the first to acknowledge that this isn't ideal behavior from ID.  We're working on changing this, but we're not there yet.

Yes, the BL's approach introduces duplication and a synchronization issue.   Personally, I have no problem with this however and, in fact, endorse it with the caveat that the data be kept up-to-date as best possible ( I do like the idea of a CBD).  And, while the decision to include this added information makes the BL's data independently understandable without further look-ups, it does not inhibit a BL data consumer from following the HTTP URI to ID if the user would like to.

I suspect that until systems are sophisticated enough to follow the URI's in RDF data without significant human intervention, and gather the data about _that_ resource, then this type of wholeness will be more beneficial than the alternative.  I feel it makes the data more immediately accessible.  I recognize that my position is one from a very practical standpoint.  Ideally, the whole system would be wonderfully interconnected with links and all the software designed to deal with this data would naturally and without prompting fetch the data from those links.

Now, as to defining the information to be included - should have the altLabels also been included?  They can be very valuable when indexed - that's part of the wholeness/completeness debate.  Perhaps it is sufficient to include the skos:prefLabel; further information can be had by following the HTTP URI.

Warmly,

Kevin

________________________________________
From: public-lld-request@w3.org [public-lld-request@w3.org] On Behalf Of Antoine Isaac [aisaac@few.vu.nl]
Sent: Friday, February 04, 2011 09:35
To: Deliot, Corine
Cc: List for Working Group on Open Bibliographic Data; public-lld
Subject: Re: [open-bibliography] New BNB sample data available

Hello Corine,

Re. 1 and 2, in fact your decision not to put the language tags is what saves you from the inconsistency Andrew has warned about. If you were using the same language tag as id.loc.gov, but a different literal (and adding one dot to a literal makes it an entirely different literal), then your data would be inconsistent with the id.loc.gov one.

Now, on having a language tag or not, I see your issue, but personally I'm ok with originally Spanish labels being considered as English ones, if there's no English translation for them.
Anyway, the core issue to me here is that this language tag dilemma also applies for LoC, which made the opposite choice. Ideally if you publish data on LC concepts, it should be compatible with what LC has--"compatible" in the formal but also informal way: whether there is an inconsistency or not, a data consumer may still be extremely puzzled why LC and BL can't agree on their concepts' prefLabels!

Re. 3, getting data for indexing is a very valid concern. But it also could be done just before the indexing step, not in the data you publish. But well, you are perhaps in the best position to judge: as you have put it, this is about what you feel you should provide to your typical data consumers. Note, however, that putting the labels re-introduces the risk of being out-of-synch with a central repository, which you correctly identified in your first move.

About the danger of a target source being put offline, that is also a valid point. But for id.loc.gov I wouldn't be so worry. In fact, BL starting to rely on it for its data would be a key motivation for LC not to put it offline :-)


Re. your last question, I guess I can only repeat what I've written above. My gut feeling would be to replicate as little as possible: ideally, the URI should be the only thing present in your data! But if you have clear ideas about the amount of efforts your data consumers would be willing to undergo, you should adapt your data to make their life easier.
Note that the data consumers who'd be interested in such caching might be the ones interested in accessing large dumps of data at once. So the "true linked data version" (what you get when following your nose over HTTP) could include only the URIs, but a fit-for-purpose dump of your entire catalogue may include a bit more.

Best,

Antoine



> Hi Antoine and all,
>
> Many thanks for the feedback and apologies for the length of this email.
>
> In answer to the questions about
> <dcterms:subject>
>>>          <rdf:Description
>>> rdf:about="http://id.loc.gov/authorities/sh2008107012#concept">
>>>            <skos:inScheme
>>> rdf:resource="http://id.loc.gov/authorities#conceptScheme" />
>>>            <skos:prefLabel>Literary landmarks--England--
>>> London.</skos:prefLabel>
>>>            <rdf:type
>>> rdf:resource="http://www.w3.org/2004/02/skos/core#Concept" />
>>>          </rdf:Description>
>>>        </dcterms:subject>
>
> And
>
> 1. why does the literal value contained in<skos:prefLabel>  Literary landmarks--England--
> London.</skos:prefLabel>  does not exactly match the one served by LC at id.loc.gov for http://id.loc.gov/authorities/sh2008107012#concept?
>
> The answer is that it should. We've matched the LCSH heading contained in the bibliographic record to the LCSH heading in the authority file. The issue is to do with punctuation (which is input at the end of the heading in the bib record but is not part of the heading in the authority file). We'll address this in the conversion - this is an issue in the LCSH headings and I believe in other parts of our output. [So no, we "are *not* essentially trying to say which of the SKOS preflabels the BL prefers" as one post tried to double-guess]
>
> 2. Why does our output does not include the xml:lang="en" in<skos:prefLabel>
> This is because in some cases this xml:lang="en" whilst true to the data served up by id.loc.gov is actually not correct. For example, if you look at
> <http://id.loc.gov/authorities/sh94003128#concept>  for Parque Nacional Torotoro (Bolivia), we have
> <skos:prefLabel xml:lang="en">Parque Nacional Torotoro (Bolivia)</skos:prefLabel>
>
> instead of Spanish.
>
> I assume the reason for that is that there isn't the granularity in MARC 21 - where these headings originates from - to code the language of each data element. So when LC expresses LCSH in SKOS, they couldn't specify and went for the language of the majority of the headings, which is English.
>
> So we - ok, I ;-) thought we could do "without" the xml:lang attribute since it wasn't "correct" in all cases. I didn't realise the implications.
>
> 3. Why are we outputting both the literal value and the resource URI?
> In a very first attempt, we'd only included the resource URI as you suggest. They were concerns about the two being out of sync., e.g. when a LCSH is updated. In fact, this is one of the uses of those URIs - enabling easier updating of bibliographic data.
>
> But we got some advice to the contrary. Some linked data platforms index the literal values to improve searching; it was also pointed out that there may be a risk of the linked dataset we link to "disappearing".
>
> There are other considerations: we are putting our data out for people to use and re-use; and we are not too sure what they want to do with it yet - so as you suggest, some of them may not want or be able to go and fetch data from id.loc.gov. or any other data sets we link to. A related question is to do with the time and resources to produce these files. At the moment, we are concentrating on the BNB but the intention is to work on other data sets. We are currently working on two versions of the file, a "non-URI" and a "with added-URI" version of the data and ideally, it would be good to have only one version - the "with added-URI" one - to maintain/produce if it meets the needs of all/most people.
>
> Now it's my turn for a question ;-)
>
> In your feedback, you highlight the risk of "that your data is less complete than the one of other services"[1] e.g., if you don't have skos:broader that id.loc.gov has for LCSH concepts.
>
> So to take the example of LCSH at id.loc.gov, how much of the data included there should I replicate in my instance data? Isn't the<skos:prefLabel>  and the resource URI sufficient? If you need other info, like<skos:altLabel>  or<skos:broader>, won't you be able to fetch it via the resource URI?
>
> That's it for now ;-)
>
> I would also like to say that from later today I shall be offline for the next two weeks. So that people don't think we don't want to engage or anything like that if there is no post. I really appreciate feedback.
>
> Cheers
>
> Corine
Received on Friday, 4 February 2011 16:21:36 UTC