13:36:11 RRSAgent has joined #dcmi13pm 13:36:11 logging to http://www.w3.org/2013/09/03-dcmi13pm-irc 13:36:20 rrsagent, set log public 13:36:21 kcoyle has joined #dcmi13pm 13:36:54 ivan has changed the topic to: Digital Vocabulary Preservation - DC2013 - Afternoon Session 13:37:02 Topic: Digital Vocabulary Preservation - DC2013 - Afternoon Session 13:38:13 scribe: ivan 13:38:30 Presentation: LOCKSS by David Rosenthal 13:38:40 (links to his blog follows) 13:39:50 tbaker has joined #dcmi13pm 13:40:01 Topic: David Rosenthal presents LOCKSS 13:40:36 Bane of digital preservation: that there is a one-size-fits-all solution. There isn't. 13:41:17 scribenick: tbaker 13:41:24 scribe: Tom Baker 13:41:36 ...Limiting ambition to "preserving Web" allowed us to make progress. 13:41:59 ...2005: we found nobody had done a threat model (things that cause you to lose data). 13:42:36 ...Data center operators see different threats (operator error, external attack, insider attack, economic or organizational failure). 13:44:03 ...We set out to deal with these threats. Build on model of libraries (massively replicated, highly distributed...). 13:44:50 ...Approached as software engineer. 13:45:09 ...Failure of one library does not make the system fail. 13:45:55 ...Each library runs a "persistent web cache" 13:46:40 Pierre-Yves_V has joined #dcmi13pm 13:46:41 ...We get publishers to give us permission (or use CC license). 13:47:42 ...Use Memento headers for access to past 13:48:01 W927C has joined #dcmi13pm 13:48:04 ...Digital info is not tamper-proof 13:48:11 W927C has left #dcmi13pm 13:48:21 chrpr has joined #dcmi13pm 13:48:27 kcoyle has joined #dcmi13pm 13:48:31 aisaac has joined #dcmi13pm 13:49:01 ...Compare contents of box by polling other sites, using hash values. 13:49:29 ...If discrepancy is detected, examine and fix. 13:50:11 ...Everything published in academic journals, ever, is 40TB. 13:50:37 ...Tech is not expensive. Costs about the same as 1.5 hours of lawyer time. 13:50:50 copyright issues probably more difficult/expensive than technology 13:52:54 ...Easy to bring up a LOCKSS box - free - can be virtual machine - can run in cloud (but more expensive). 13:53:44 lockss network, about 20 libraries, "meta-archive" 13:53:53 ...MetaArchive network - run a LOCKSS network for preserving special collections - started in SE USA. 13:54:22 ...They get alot of hurricanes, so wanted to do geographic dispersion. 13:55:09 ...Organizations run out of money. Operators screw up. 13:56:28 TBaker In case of a box failure, what happens? 13:56:36 organized as peer-to-peer, no 'top box' 13:56:57 ...Here: asking what you need for preservation. 13:57:11 it's a dark archive 13:57:27 ...CLOCKSS archive is "dark" - content only gets out if a trigger event has happened ("not available from any publisher") 13:58:14 ...Idea of a "primary" source is a problem for digital preservation. 13:58:32 ...If I'm a bad guy, if there is a "primary" source, all I have to do is corrupt that. 13:59:25 ...We try to avoid this. 14:01:40 Bernard: Problem with XML schemas, need to access through URI. 14:02:28 David: Get libraries to use their persistent Web cache as a transparent proxy. If publisher not responding, deliver from cache? Not a great idea. 14:03:29 ...Other option: Memento. Content negotiation (my browser wants French). Herbert and Michael Nelson expanded Conneg into time dimension. What was this URI ten years ago? 14:04:23 ...Preserved copies can work at their normal URI - but does require a browser plugin. 14:06:16 ...Problem of preservation not at original URI - ways around it. 14:08:24 Felix: Not easy to set up crawler to handles PURL - when I tried to serve vocabularies with LOCKSS. 14:08:52 David: We have configured LOCKSS crawler to make it hard to use without permissions. 14:09:15 ...Probably hard to identify the license conditions. 14:09:53 kcoyle has joined #dcmi13pm 14:09:55 ...If someone didn't have to listen so carefully to lawyers, could build a crawler less demanding about license issues. 14:10:08 ...I could not help you with that. 14:10:56 Karen: How active are these boxes? 14:12:15 David: Ingest is very active. 14:13:17 ...Content delivered only if original is not available. Publishers do not want us to steal hits. 14:14:06 aisaac has joined #dcmi13pm 14:15:17 Felix: If you want canonical URIs, want metadata to show which version. In triple stores, you have "context". DURIs - encoding time stamp with URIs. 14:15:33 Ivan: Adding micro syntax to URI, for many people, is a big no-no. 14:16:33 ...Some triple stores let you add time stamps. 14:16:43 ...What is the granularity? 14:19:20 Tom: With DCMI Metadata Terms, snapshot of a term description as of a particular date (if anything in that description changes). 14:19:40 Ivan: Had some discussion about mapping Memento onto CVS - takes work. 14:21:04 Bernard: Management versus preservation. 14:21:36 ...We have a legacy of already-published vocabularies which are used - good thing to say people should have versioning policy - but what do you do with legacy? 14:22:06 still not clear on the difference between maintenance and preservation 14:22:34 Ivan: How do we decide what is worth preserving? 14:23:05 What is the criteria for deciding that this vocabulary should be preserved? In libraries, which publicatns are worth preserving, and which not? 14:23:40 preservation = get content out of the custody of the original publisher 14:23:50 David: One essential aspects of preservation - get content out of original publisher - cannot trust (fundamentally). 14:24:10 ...Need to talk about system for publishing vocabularies that is disconnected from system for preserving them. 14:25:13 Dan: Sometimes costs more to _not_ preserve something. 14:25:24 Karen: We have advantage over archives - we can count use. 14:26:06 ...Big box of letters comes in. But with vocabularies, can track usage. Can become evidence of value. 14:26:35 ...Whether you move things into preservation could be based on usage. 14:27:25 Ivan: Extreme example. Only one CERN. I develop vocabulary for an experiment there. 14:27:46 Karen: Have to allow people to say they want something preserved. 14:28:02 Lars: We (libraries) collect things that are "published". 14:28:33 ...When is it important enough? Like: What is "art"? 14:29:01 ...Who decides? 14:30:10 "preserve it all" 14:30:12 David: "Preserve it all". Example: Brewster's Internet Archive. 14:31:02 Bernard: People are asking: "Where can I find good vocabularies"? 14:31:16 "preserve it all" works if you have a good, unambiguous retrieval ID (e.g. URI) 14:31:21 Is Community Group a good place to get a consensus on which vocabulary is useful for the community domain? 14:31:51 vocabularies are social 14:32:04 preserving the vocabulary = preserving the community 14:33:35 [?]: Vocabularies are social. We have had discussion internally about how mechanism works. Meaning, and words used to describe meaning, can change. Choosing not to preserve can affect data. 14:34:04 j busch - vocabularies and publications are very different 14:34:20 Joseph Busch: Vocabularies and publications are very different. Problem with vocabularies is to get them to be created and used. 14:34:22 vocabularies need to be used 14:35:12 need to be put into some commons to be usable 14:35:20 ...They are living and breathing. Difficult to get them published. 14:35:33 ...We have to be concerned that these are more dynamic than publications. 14:35:42 ...Quite a different information object. 14:36:58 Richard: "Preserving 'quality' vocabularies". Lars is right - if it is published, we preserve. 14:37:04 Lars++ 14:37:17 ivan - what are you preserving it for? 14:37:25 Ivan: Depends what you preserve it for. Archive of human endeavor? 14:37:45 ...Goal of LOV: provide service to users of Linked Data to find vocabularies useful for their purposes. 14:37:56 ...Meaningful level of quality control should be exercised. 14:38:23 ...There are inconsistent vocabularies on the Web - we do not want to give them out. 14:38:56 Bernard: We have refused broken vocabularies. 14:39:32 Antoine: LOV is also not about preservation - more like access and preservation. Search and ranking of vocabularies. 14:41:56 Eva: Every time you put "quality" on table, problem. We are looking for standards. Who will legitimize? W3C can do this. But industry likes ISO standard. DC in 2001 - just the elements. 14:42:45 ...We need layers. One level: save everything. 14:43:21 Felix Ostrowski: Web is messy place. Blogs are dynamic - more like vocabulary than traditional publication. 14:43:46 ...Publishing something on Web is no longer good measure of importance. Rather: what gets referenced. 14:44:20 ...Thinking about LOCKSS - turning everything upside down. Run your own preservation node, then preserve what you reference. 14:44:32 ...Cached in your LOCKSS cache. 14:45:59 Pierre-Yves: [Small, isolated vocabularies]. 14:46:24 Ivan: What can we conclude in last 15 minutes? 14:47:14 Karen: Leads us towards criteria: preserve vocablaries without URIs? 14:47:45 Lars: German National Library: we collect complete crap along with everything else. Other libraries, more specialized, selects. 14:48:02 ...Maybe some of those libraries could collect, applying criteria. 14:48:40 David: Conflating two things. We need to lead people to useful vocabularies - want criteria. Whether that is useful to apply to preservation of this information. 14:49:17 ...Figuring that out will cost something, and it will change over time. Are you going to save enough by being selective as opposed to saving everything? 14:49:51 ...My take, unless the content is huge and expensive, the costs of selection is more than cost of just preserving. 14:50:11 Karen: How do you define "everything"? 14:50:44 Antoine: There are still some technical criteria. 14:50:59 Lars: If you publish it in German, we will probably collect it. 14:51:54 Gildas: Reminds of discussion 7 years ago when my institution started archiving the Web. There is urgency. Recommend: take action without taking too much time discussing criteria. 14:52:45 ...a) Publish, b) Turn to sustainable publishers (national libraries) - long-term, free access, c) send them list of vocabularies. Continue discussion later. 14:53:14 we haven't defined what's a vocabulary - so how will we know what to preserve? 14:54:17 Alexander Haffner: Not just one way to provide vocabularies. Agree that selection is too expensive. 14:55:37 Pierre-Yves: Without labels...? 14:55:48 Dan: Quality issue is already a rathole. 14:56:35 ...If you publish a vocabulary for "anyone", we should at least talk to each other at least once per year. DNS paid? 14:57:57 Karen: we have had criteria for vocabulary... URI... label... - but without some 14:59:00 Gordon: LLD XG report - recommended that national libraries preserve element sets and value vocabularies. 14:59:15 if we can't define 'vocabulary' then how can we preserve them? we aren't preserving the entire web 14:59:45 Ivan: I understand both. "Preserve everything" - okay. But from practical point of view, this is not enough. 15:00:13 yes, there do need to be recommender services and services for people seeking vocabularies 15:00:53 need to know what is in scope for preservation - then quality selection is a separate activity 15:01:23 Tom: Agree with David that we should separate "quality" from "preservation". LOV and LOCKSS have two different functions. 15:02:48 Dan: 12 people in this room responsible for maintaining vocabularies. Telecon once per year to identify problems. 15:03:08 Felix: Practical: set up LOCKSS network - get LOV stuff in there, take it from there. 15:03:58 David: Very strong case for this to be done by national libraries. In US, LC has specific [] from DMCA to collect and preserve. 15:05:26 [adjourned] 15:05:48 rrsagent, please draft minutes 15:05:48 I have made the request to generate http://www.w3.org/2013/09/03-dcmi13pm-minutes.html tbaker 15:27:04 ivan has joined #dcmi13pm