Vocabulary Preservation Session, DC2013, Afternoon Session

03 September 2013

See also: IRC log

See also: Session description and agenda, with links to presentations

See also: Session Discussion paper


Bob Bailey (Thomson Reuters)
Tom Baker (DCMI)
Dan Brickley (Google)
Joseph Busch (Taxonomy Strategies)
Eric Childress (OCLC)
Karen Coyle (Consultant)
Michael D├╝ro (Office des publications)
Daniel Garijo (UPM)
Alexander Haffner (Deutsche National Bibliothek)
Ivan Herman (W3C)
Gildas Illien (Bibliotheque Nationale de France)
Antoine Isaac (Vrije Universiteit Amsterdam)
Eva Mendez (University Carlos III of Madrid)
Felix Ostrowski (GraphThinking)
Andrea Perego (European Commission)
David S.H. Rosenthal (LOCKSS)
Stefanie Ruehle (SUB Goettingen)
Daniel Vila Suero (Ontology Engineering Group (UPM))
Lars Svensson (Deutsche National Bibliothek)
Malar Thomas (National Library Board Singapore)
Pierre-Yves Vandenbussche (Fujitsu)
Bernard Vatant (Mondeca)
Richard Wallis (OCLC)
Bernard Vatant (Mondeca), Ivan Herman (W3C)
Tom Baker (DCMI)

Digital Vocabulary Preservation - DC2013 - Afternoon Session

David Rosenthal presents LOCKSS

David Rosenthal Bane of digital preservation: that there is a one-size-fits-all solution. There isn't.

David Rosenthal: Limiting ambition to "preserving Web" allowed us to make progress.

...2005: we found nobody had done a threat model (things that cause you to lose data).
... Data center operators see different threats (operator error, external attack, insider attack, economic or organizational failure).
... We set out to deal with these threats. Build on model of libraries (massively replicated, highly distributed...).
... Approached as software engineer.
... Failure of one library does not make the system fail.
... Each library runs a "persistent web cache"
... We get publishers to give us permission (or use CC license).
... Use Memento headers for access to past
... Digital info is not tamper-proof
... Compare contents of box by polling other sites, using hash values.
... If discrepancy is detected, examine and fix.
... Everything published in academic journals, ever, is 40TB.
... Tech is not expensive. Costs about the same as 1.5 hours of lawyer time.

Karen Coyle (IRC scribe assist): copyright issues probably more difficult/expensive than technology
...2005: Easy to bring up a LOCKSS box - free - can be virtual machine - can run in cloud (but more expensive).

Karen Coyle (IRC scribe assist): lockss network, about 20 libraries, "meta-archive"
...2005: MetaArchive network - run a LOCKSS network for preserving special collections - started in SE USA.
... They get alot of hurricanes, so wanted to do geographic dispersion.
... Organizations run out of money. Operators screw up.

Pierre-Yves Vandenbussche (IRC scribe assist) Tom Baker: In case a box fails, what happens?

Karen Coyle (IRC scribe assist): organized as peer-to-peer, no 'top box'
...Here: asking what you need for preservation.

Karen Coyle (IRC scribe assist): it's a dark archive
...Here: CLOCKSS archive is "dark" - content only gets out if a trigger event has happened ("not available from any publisher")
... Idea of a "primary" source is a problem for digital preservation.
... If I'm a bad guy, if there is a "primary" source, all I have to do is corrupt that.
... We try to avoid this.

Bernard Vatant: Problem with XML schemas, need to access through URI.

David Rosenthal: Get libraries to use their persistent Web cache as a transparent proxy. If publisher not responding, deliver from cache? Not a great idea.
... Other option: Memento. Content negotiation (my browser wants French). Herbert and Michael Nelson expanded Conneg into time dimension. What was this URI ten years ago?
... Preserved copies can work at their normal URI - but does require a browser plugin.
... Problem of preservation not at original URI - ways around it.

Felix Ostrowski: Not easy to set up crawler to handles PURL - when I tried to serve vocabularies with LOCKSS.

David Rosenthal: We have configured LOCKSS crawler to make it hard to use without permissions.
... Probably hard to identify the license conditions.
... If someone didn't have to listen so carefully to lawyers, could build a crawler less demanding about license issues.
... I could not help you with that.

Karen Coyle: How active are these boxes?

David Rosenthal: Ingest is very active.
... Content delivered only if original is not available. Publishers do not want us to steal hits.

Felix Ostrowski: If you want canonical URIs, want metadata to show which version. In triple stores, you have "context". DURIs - encoding time stamp with URIs.

Ivan Herman: Adding micro syntax to URI, for many people, is a big no-no.
... Some triple stores let you add time stamps.
... What is the granularity?

Tom Baker: With DCMI Metadata Terms, snapshot of a term description as of a particular date (if anything in that description changes).

Ivan Herman: Had some discussion about mapping Memento onto CVS - takes work.

Bernard Vatant: Management versus preservation.
... We have a legacy of already-published vocabularies which are used - good thing to say people should have versioning policy - but what do you do with legacy?

Karen Coyle (IRC scribe assist): still not clear on the difference between maintenance and preservation

Ivan Herman: How do we decide what is worth preserving?

What is the criteria for deciding that this vocabulary should be preserved? In libraries, which publicatns are worth preserving, and which not?

Karen Coyle (IRC scribe assist): preservation = get content out of the custody of the original publisher

David Rosenthal: One essential aspects of preservation - get content out of original publisher - cannot trust (fundamentally).
... Need to talk about system for publishing vocabularies that is disconnected from system for preserving them.

Dan Brickley: Sometimes costs more to _not_ preserve something.

Karen Coyle: We have advantage over archives - we can count use.
... Big box of letters comes in. But with vocabularies, can track usage. Can become evidence of value.
... Whether you move things into preservation could be based on usage.

Ivan Herman: Extreme example. Only one CERN. I develop vocabulary for an experiment there.

Karen Coyle: Have to allow people to say they want something preserved.

Lars: We (libraries) collect things that are "published".
... When is it important enough? Like: What is "art"?
... Who decides?

Karen Coyle (IRC scribe assist): "preserve it all"

David Rosenthal: "Preserve it all". Example: Brewster Kahle's Internet Archive.

Bernard Vatant: People are asking: "Where can I find good vocabularies"?

Karen Coyle (IRC scribe assist): "preserve it all" works if you have a good, unambiguous retrieval ID (e.g. URI)

Pierre-Yves_V Is Community Group a good place to get a consensus on which vocabulary is useful for the community domain?

Karen Coyle (IRC scribe assist): vocabularies are social

Karen Coyle (IRC scribe assist): preserving the vocabulary = preserving the community

Andrea Perego: Vocabularies are social. We have had discussion internally about how mechanism works. Meaning, and words used to describe meaning, can change. Choosing not to preserve can affect data.

Karen Coyle (IRC scribe assist): Joseph Busch: vocabularies and publications are very different

Joseph Busch: Vocabularies and publications are very different. Problem with vocabularies is to get them to be created and used.

Karen Coyle (IRC scribe assist): Vocabularies need to be used. Need to be put into some commons to be usable

Lars: They are living and breathing. Difficult to get them published.
... We have to be concerned that these are more dynamic than publications.
... Quite a different information object.

Richard: "Preserving 'quality' vocabularies". Lars is right - if it is published, we preserve.

[Lars agrees.]

Karen Coyle (IRC scribe assist): Ivan: what are you preserving it for?

Ivan Herman: Depends what you preserve it for. Archive of human endeavor?
... Goal of LOV: provide service to users of Linked Data to find vocabularies useful for their purposes.
... Meaningful level of quality control should be exercised.
... There are inconsistent vocabularies on the Web - we do not want to give them out.

Bernard Vatant: We have refused broken vocabularies.

Antoine: LOV is also not about preservation - more like access and preservation. Search and ranking of vocabularies.

Eva Mendez: Every time you put "quality" on table, problem. We are looking for standards. Who will legitimize? W3C can do this. But industry likes ISO standard. DC in 2001 - just the elements.
... We need layers. One level: save everything.

Felix Ostrowski: Web is messy place. Blogs are dynamic - more like vocabulary than traditional publication.

scribe: Publishing something on Web is no longer good measure of importance. Rather: what gets referenced.
... Thinking about LOCKSS - turning everything upside down. Run your own preservation node, then preserve what you reference.
... Cached in your LOCKSS cache.

Pierre-Yves Vandenbussche: [Small, isolated vocabularies].

Ivan Herman: What can we conclude in last 15 minutes?

Karen Coyle: Leads us towards criteria: preserve vocablaries without URIs?

Lars: German National Library: we collect complete crap along with everything else. Other libraries, more specialized, selects.
... Maybe some of those libraries could collect, applying criteria.

David Rosenthal: Conflating two things. We need to lead people to useful vocabularies - want criteria. Whether that is useful to apply to preservation of this information.
... Figuring that out will cost something, and it will change over time. Are you going to save enough by being selective as opposed to saving everything?
... My take, unless the content is huge and expensive, the costs of selection is more than cost of just preserving.

Karen Coyle: How do you define "everything"?

Antoine: There are still some technical criteria.

Lars: If you publish it in German, we will probably collect it.

Gildas Illien: Reminds of discussion 7 years ago when my institution started archiving the Web. There is urgency. Recommend: take action without taking too much time discussing criteria.
... a) Publish, b) Turn to sustainable publishers (national libraries) - long-term, free access, c) send them list of vocabularies. Continue discussion later.

Karen Coyle (IRC): We haven't defined what is a vocabulary - so how will we know what to preserve?

Alexander Haffner: Not just one way to provide vocabularies. Agree that selection is too expensive.

Pierre-Yves Vandenbussche: Without labels [is it really usable]...?

Dan Brickley: Quality issue is already a rathole.
... If you publish a vocabulary for "anyone", we should at least talk to each other at least once per year. Check: has DNS been paid?

Karen Coyle: we have had criteria for vocabulary... URI... label... - but without some

Gordon Dunsire: LLD XG report - recommended that national libraries preserve element sets and value vocabularies.

Karen Coyle (IRC): if we can't define 'vocabulary' then how can we preserve them? we aren't preserving the entire web

Ivan Herman: I understand both. "Preserve everything" - okay. But from practical point of view, this is not enough.

Karen Coyle (IRC): yes, there do need to be recommender services and services for people seeking vocabularies

Karen Coyle (IRC): need to know what is in scope for preservation - then quality selection is a separate activity

Tom Baker: Agree with David that we should separate "quality" from "preservation". LOV and LOCKSS have two different functions.

Dan Brickley: 12 people in this room responsible for maintaining vocabularies. Telecon once per year to identify problems.

Felix Ostrowski: Practical: set up LOCKSS network - get LOV stuff in there, take it from there.

David Rosenthal: Very strong case for this to be done by national libraries. In US, LC has specific [?permission] from DMCA to collect and preserve.


[End of minutes]

Minutes formatted by David Booth's scribe.perl version 1.138 (CVS log)
$Date: 2013-09-05 09:26:52 $