13:36:11 <RRSAgent> RRSAgent has joined #dcmi13pm
13:36:11 <RRSAgent> logging to http://www.w3.org/2013/09/03-dcmi13pm-irc
13:36:20 <ivan> rrsagent, set log public
13:36:21 <kcoyle> kcoyle has joined #dcmi13pm
13:36:54 <ivan> ivan has changed the topic to: Digital Vocabulary Preservation - DC2013 - Afternoon Session
13:37:02 <ivan> Topic: Digital Vocabulary Preservation - DC2013 - Afternoon Session
13:38:13 <ivan> scribe: ivan
13:38:30 <ivan> Presentation: LOCKSS by David Rosenthal
13:38:40 <ivan> (links to his blog follows)
13:39:50 <tbaker> tbaker has joined #dcmi13pm
13:40:01 <tbaker> Topic: David Rosenthal presents LOCKSS
13:40:36 <tbaker> Bane of digital preservation: that there is a one-size-fits-all solution.  There isn't.
13:41:17 <ivan> scribenick: tbaker
13:41:24 <ivan> scribe: Tom Baker
13:41:36 <tbaker> ...Limiting ambition to "preserving Web" allowed us to make progress.
13:41:59 <tbaker> ...2005: we found nobody had done a threat model (things that cause you to lose data).
13:42:36 <tbaker> ...Data center operators see different threats (operator error, external attack, insider attack, economic or organizational failure).
13:44:03 <tbaker> ...We set out to deal with these threats.  Build on model of libraries (massively replicated, highly distributed...).
13:44:50 <tbaker> ...Approached as software engineer.
13:45:09 <tbaker> ...Failure of one library does not make the system fail.
13:45:55 <tbaker> ...Each library runs a "persistent web cache"
13:46:40 <Pierre-Yves_V> Pierre-Yves_V has joined #dcmi13pm
13:46:41 <tbaker> ...We get publishers to give us permission (or use CC license).
13:47:42 <tbaker> ...Use Memento headers for access to past
13:48:01 <W927C> W927C has joined #dcmi13pm
13:48:04 <tbaker> ...Digital info is not tamper-proof
13:48:11 <W927C> W927C has left #dcmi13pm
13:48:21 <chrpr> chrpr has joined #dcmi13pm
13:48:27 <kcoyle> kcoyle has joined #dcmi13pm
13:48:31 <aisaac> aisaac has joined #dcmi13pm
13:49:01 <tbaker> ...Compare contents of box by polling other sites, using hash values.
13:49:29 <tbaker> ...If discrepancy is detected, examine and fix.
13:50:11 <tbaker> ...Everything published in academic journals, ever, is 40TB.
13:50:37 <tbaker> ...Tech is not expensive.  Costs about the same as 1.5 hours of lawyer time.
13:50:50 <kcoyle> copyright issues probably more difficult/expensive than technology
13:52:54 <tbaker> ...Easy to bring up a LOCKSS box - free - can be virtual machine - can run in cloud (but more expensive).
13:53:44 <kcoyle> lockss network, about 20 libraries, "meta-archive"
13:53:53 <tbaker> ...MetaArchive network - run a LOCKSS network for preserving special collections - started in SE USA.
13:54:22 <tbaker> ...They get alot of hurricanes, so wanted to do geographic dispersion.
13:55:09 <tbaker> ...Organizations run out of money. Operators screw up.
13:56:28 <Pierre-Yves_V> TBaker In case of a box failure, what happens?
13:56:36 <kcoyle> organized as peer-to-peer, no 'top box'
13:56:57 <tbaker> ...Here: asking what you need for preservation.
13:57:11 <kcoyle> it's a dark archive
13:57:27 <tbaker> ...CLOCKSS archive is "dark" - content only gets out if a trigger event has happened ("not available from any publisher")
13:58:14 <tbaker> ...Idea of a "primary" source is a problem for digital preservation.
13:58:32 <tbaker> ...If I'm a bad guy, if there is a "primary" source, all I have to do is corrupt that.
13:59:25 <tbaker> ...We try to avoid this.
14:01:40 <tbaker> Bernard: Problem with XML schemas, need to access through URI.
14:02:28 <tbaker> David: Get libraries to use their persistent Web cache as a transparent proxy.  If publisher not responding, deliver from cache?  Not a great idea.
14:03:29 <tbaker> ...Other option: Memento. Content negotiation (my browser wants French). Herbert and Michael Nelson expanded Conneg into time dimension.  What was this URI ten years ago?
14:04:23 <tbaker> ...Preserved copies can work at their normal URI - but does require a browser plugin.
14:06:16 <tbaker> ...Problem of preservation not at original URI - ways around it.
14:08:24 <tbaker> Felix: Not easy to set up crawler to handles PURL - when I tried to serve vocabularies with LOCKSS.
14:08:52 <tbaker> David: We have configured LOCKSS crawler to make it hard to use without permissions.
14:09:15 <tbaker> ...Probably hard to identify the license conditions.
14:09:53 <kcoyle> kcoyle has joined #dcmi13pm
14:09:55 <tbaker> ...If someone didn't have to listen so carefully to lawyers, could build a crawler less demanding about license issues.
14:10:08 <tbaker> ...I could not help you with that.
14:10:56 <tbaker> Karen: How active are these boxes?
14:12:15 <tbaker> David: Ingest is very active.
14:13:17 <tbaker> ...Content delivered only if original is not available.  Publishers do not want us to steal hits.
14:14:06 <aisaac> aisaac has joined #dcmi13pm
14:15:17 <tbaker> Felix: If you want canonical URIs, want metadata to show which version.  In triple stores, you have "context".  DURIs - encoding time stamp with URIs.
14:15:33 <tbaker> Ivan: Adding micro syntax to URI, for many people, is a big no-no.
14:16:33 <tbaker> ...Some triple stores let you add time stamps.
14:16:43 <tbaker> ...What is the granularity?
14:19:20 <tbaker> Tom: With DCMI Metadata Terms, snapshot of a term description as of a particular date (if anything in that description changes).
14:19:40 <tbaker> Ivan: Had some discussion about mapping Memento onto CVS - takes work.
14:21:04 <tbaker> Bernard: Management versus preservation.
14:21:36 <tbaker> ...We have a legacy of already-published vocabularies which are used - good thing to say people should have versioning policy - but what do you do with legacy?
14:22:06 <kcoyle> still not clear on the difference between maintenance and preservation
14:22:34 <tbaker> Ivan: How do we decide what is worth preserving?
14:23:05 <tbaker> What is the criteria for deciding that this vocabulary should be preserved?  In libraries, which publicatns are worth preserving, and which not?
14:23:40 <kcoyle> preservation = get content out of the custody of the original publisher
14:23:50 <tbaker> David: One essential aspects of preservation - get content out of original publisher - cannot trust (fundamentally).
14:24:10 <tbaker> ...Need to talk about system for publishing vocabularies that is disconnected from system for preserving them.
14:25:13 <tbaker> Dan: Sometimes costs more to _not_ preserve something.
14:25:24 <tbaker> Karen: We have advantage over archives - we can count use.
14:26:06 <tbaker> ...Big box of letters comes in.  But with vocabularies, can track usage.  Can become evidence of value.
14:26:35 <tbaker> ...Whether you move things into preservation could be based on usage.
14:27:25 <tbaker> Ivan: Extreme example.  Only one CERN.  I develop vocabulary for an experiment there.
14:27:46 <tbaker> Karen: Have to allow people to say they want something preserved.
14:28:02 <tbaker> Lars: We (libraries) collect things that are "published".
14:28:33 <tbaker> ...When is it important enough?  Like: What is "art"?
14:29:01 <tbaker> ...Who decides?
14:30:10 <kcoyle> "preserve it all"
14:30:12 <tbaker> David: "Preserve it all".  Example: Brewster's Internet Archive.
14:31:02 <tbaker> Bernard: People are asking: "Where can I find good vocabularies"?
14:31:16 <kcoyle> "preserve it all" works if you have a good, unambiguous retrieval ID (e.g. URI)
14:31:21 <Pierre-Yves_V> Is Community Group a good place to get a consensus on which vocabulary is useful for the community domain?
14:31:51 <kcoyle> vocabularies are social
14:32:04 <kcoyle> preserving the vocabulary = preserving the community
14:33:35 <tbaker> [?]: Vocabularies are social. We have had discussion internally about how mechanism works.  Meaning, and words used to describe meaning, can change.  Choosing not to preserve can affect data.
14:34:04 <kcoyle> j busch - vocabularies and publications are very different
14:34:20 <tbaker> Joseph Busch: Vocabularies and publications are very different.  Problem with vocabularies is to get them to be created and used.
14:34:22 <kcoyle> vocabularies need to be used
14:35:12 <kcoyle> need to be put into some commons to be usable
14:35:20 <tbaker> ...They are living and breathing. Difficult to get them published.
14:35:33 <tbaker> ...We have to be concerned that these are more dynamic than publications.
14:35:42 <tbaker> ...Quite a different information object.
14:36:58 <tbaker> Richard: "Preserving 'quality' vocabularies".  Lars is right - if it is published, we preserve.
14:37:04 <tbaker> Lars++
14:37:17 <kcoyle> ivan - what are you preserving it for?
14:37:25 <tbaker> Ivan: Depends what you preserve it for.  Archive of human endeavor?
14:37:45 <tbaker> ...Goal of LOV: provide service to users of Linked Data to find vocabularies useful for their purposes.
14:37:56 <tbaker> ...Meaningful level of quality control should be exercised.
14:38:23 <tbaker> ...There are inconsistent vocabularies on the Web - we do not want to give them out.
14:38:56 <tbaker> Bernard: We have refused broken vocabularies.
14:39:32 <tbaker> Antoine: LOV is also not about preservation - more like access and preservation.  Search and ranking of vocabularies.
14:41:56 <tbaker> Eva: Every time you put "quality" on table, problem.  We are looking for standards.  Who will legitimize?  W3C can do this.  But industry likes ISO standard.  DC in 2001 - just the elements.
14:42:45 <tbaker> ...We need layers.  One level: save everything.
14:43:21 <tbaker> Felix Ostrowski: Web is messy place.  Blogs are dynamic - more like vocabulary than traditional publication.
14:43:46 <tbaker> ...Publishing something on Web is no longer good measure of importance.  Rather: what gets referenced.
14:44:20 <tbaker> ...Thinking about LOCKSS - turning everything upside down.   Run your own preservation node, then preserve what you reference.
14:44:32 <tbaker> ...Cached in your LOCKSS cache.
14:45:59 <tbaker> Pierre-Yves: [Small, isolated vocabularies].
14:46:24 <tbaker> Ivan: What can we conclude in last 15 minutes?
14:47:14 <tbaker> Karen: Leads us towards criteria: preserve  vocablaries without URIs?
14:47:45 <tbaker> Lars: German National Library: we collect complete crap along with everything else.  Other libraries, more specialized, selects.
14:48:02 <tbaker> ...Maybe some of those libraries could collect, applying criteria.
14:48:40 <tbaker> David: Conflating two things.  We need to lead people to useful vocabularies - want criteria.  Whether that is useful to apply to preservation of this information.
14:49:17 <tbaker> ...Figuring that out will cost something, and it will change over time.  Are you going to save enough by being selective as opposed to saving everything?
14:49:51 <tbaker> ...My take, unless the content is huge and expensive, the costs of selection is more than cost of just preserving.
14:50:11 <tbaker> Karen: How do you define "everything"?
14:50:44 <tbaker> Antoine: There are still some technical criteria.
14:50:59 <tbaker> Lars: If you publish it in German, we will probably collect it.
14:51:54 <tbaker> Gildas: Reminds of discussion 7 years ago when my institution started archiving the Web.  There is urgency.  Recommend: take action without taking too much time discussing criteria.
14:52:45 <tbaker> ...a) Publish, b) Turn to sustainable publishers (national libraries) - long-term, free access, c) send them list of vocabularies. Continue discussion later.
14:53:14 <kcoyle> we haven't defined what's a vocabulary - so how will we know what to preserve?
14:54:17 <tbaker> Alexander Haffner: Not just one way to provide vocabularies.  Agree that selection is too expensive.
14:55:37 <tbaker> Pierre-Yves: Without labels...?
14:55:48 <tbaker> Dan: Quality issue is already a rathole.
14:56:35 <tbaker> ...If you publish a vocabulary for "anyone", we should at least talk to each other at least once per year.  DNS paid?
14:57:57 <tbaker> Karen: we have had criteria for vocabulary... URI... label... - but without some
14:59:00 <tbaker> Gordon: LLD XG report - recommended that national libraries preserve element sets and value vocabularies.
14:59:15 <kcoyle>  if we can't define 'vocabulary' then how can we preserve them? we aren't preserving the entire web
14:59:45 <tbaker> Ivan: I understand both.  "Preserve everything" - okay. But from practical point of view, this is not enough.
15:00:13 <kcoyle> yes, there do need to be recommender services and services for people seeking vocabularies
15:00:53 <kcoyle> need to know what is in scope for preservation - then quality selection is a separate activity
15:01:23 <tbaker> Tom: Agree with David that we should separate "quality" from "preservation". LOV and LOCKSS have two different functions.
15:02:48 <tbaker> Dan: 12 people in this room responsible for maintaining vocabularies.  Telecon once per year to identify problems.
15:03:08 <tbaker> Felix: Practical: set up LOCKSS network - get LOV stuff in there, take it from there.
15:03:58 <tbaker> David: Very strong case for this to be done by national libraries.  In US, LC has specific [] from DMCA to collect and preserve.
15:05:26 <tbaker> [adjourned]
15:05:48 <tbaker> rrsagent, please draft minutes
15:05:48 <RRSAgent> I have made the request to generate http://www.w3.org/2013/09/03-dcmi13pm-minutes.html tbaker
15:27:04 <ivan> ivan has joined #dcmi13pm