16:50:08 RRSAgent has joined #dpub-arch 16:50:08 logging to http://www.w3.org/2016/03/24-dpub-arch-irc 16:50:31 Zakim has joined #dpub-arch 16:52:48 Meeting: DPub Archival Task Force 16:53:47 Agenda: https://lists.w3.org/Archives/Public/public-digipub-ig/2016Mar/0086.html 16:54:05 Topic: https://lists.w3.org/Archives/Public/public-digipub-ig/2016Mar/0086.html 16:54:31 agenda: https://lists.w3.org/Archives/Public/public-digipub-ig/2016Mar/0086.html 16:55:12 Agenda: http://lists.w3.org/Archives/Public/public-digipub-ig/2016Mar/0086.html 16:56:04 rrsagent, make log public 16:57:00 rrsagent, Agenda: https://lists.w3.org/Archives/Public/public-digipub-ig/2016Mar/0086.html 16:57:00 I'm logging. I don't understand 'Agenda: https://lists.w3.org/Archives/Public/public-digipub-ig/2016Mar/0086.html', TimCole. Try /msg RRSAgent help 17:00:03 lrosenth has joined #dpub-arch 17:00:27 I’ll be on WebEx in a sec - wrapping up another meeting 17:02:31 I thought pwd was dpubarch? 17:02:39 Bill_Kasdorf has joined #dpub-arch 17:02:57 ah, just darch 17:03:03 Tim, Nicholas, and I are on the call and Webex but not hearing anybody else 17:03:12 Present+ Bill_Kasdorf 17:03:24 present+ Leonard 17:05:28 Present+ Tim_Cole 17:05:39 Present+ Ayla_Stein 17:06:21 I meant Craig, Nicholas and now Leonard and I are on the call 17:08:07 scribenick: TimCole 17:08:35 Topic: Minutes Approval 17:08:51 TimCole: Any concerns about minutes 17:09:04 ... hearing none they are approved. 17:09:15 Topic: CLOCKSS AND LOCKSS 17:10:23 Bill_Kasdorf: Do we need to introduce Craig and Nicholas an introduction 17:10:53 Craig: we've read the documents, but would appreciate a brief intro 17:11:27 Bill_Kasdorf: First the DPub is a W3C Interest Group 17:12:02 ... IGs do not publish recommendations but they inform the W3C about issues and coordinate with Working Groups that make Recommendations 17:12:44 ... In the context of that work an initiative was done to draft a vision of a Web-based format that is independent of whether a document was online or offline 17:13:01 ... This led to the Portable Web Publication document 17:13:24 ... when online all the components of a PWP document are available online 17:13:38 ... but whether online or offline to the user it's the same document. 17:13:57 ... in the course of this dicussion the issue of forming archival came up 17:14:25 ... we want to make sure that PWP is useful and can be archived. 17:15:02 Craig: I am the exec dir of CLOCKSS after 37 years in publishing. 17:15:16 Nichoals: Web archiving service manager at Stanford 17:15:42 ... mostly work on Web archiving genreally, but have been working a lot on LOCKSS 17:16:01 Craig: CLOCKSS is a free-standing org made up of publisher and libraries 17:16:17 ... pubs are billed both annually and by the article or book archived. 17:16:33 ... CLOCKSS builds on the LOCKSS protocol 17:17:02 ... CLOCKSS adds controlled -- LOCKSS typically has more copies... 17:17:41 ... libraries have a traditional role of archives, but in digital world they are not the holders of the digital copies 17:18:11 ... so libraries wanted trusted 3rd parties to whom publishers could provide content 17:18:50 ... 3rd parties (like CLOCKSS) is a dark archive for safekeeping in case the resources become unavailable on the web (e.g., publisher goes out of business) 17:19:21 ... CLOCKSS for example has 20,000+ journals, so if a journal goes away, CLOCKSS can provide archival access 17:19:43 ... scholars highly dependent on the literature, so access to lit is crucial 17:20:01 ... CLOCKSS helps ensure that access 17:20:28 ... CLOCKSS harvests or crawls - publisher agrees that LOCKSS can harvest 17:20:50 ... harvested content is then put in the 12 nodes that CLOCKSS has spread out across the globe 17:21:13 ... 2nd method is to accept files from the publisher (or retrieved from the publisher) 17:21:48 ... Regardless of whether delivered by publisher or crawled, the norm is to simply archive the files, not to normalize 17:22:18 ... we do however, a quality assurance on the materials ingested. 17:22:41 ... LOCKSS allows nodes to confirm that all nodes have the same content 17:22:58 ... a voting system can be used among the nodes to validate data correct 17:23:25 ... this checking that all copies match is constantly ongoing and repaired as needed 17:24:45 Bill_Kasdorf: question about nature of the content in CLOCKSS 17:25:03 ... when you sign on a publisher is it all content of that publisher? 17:25:14 ... is it mostly scientific literature 17:25:30 Craig: broader than just science 17:25:42 ... anything scholars access from publishers 17:26:02 ... in theory all journal and book publications, but in practice books may not be added right away 17:26:23 ... databases, other content types not always in scope 17:26:52 Nicholas: drilling down on how CLOCKSS acquires content by harvest 17:27:04 ... one challenge is that publishers use different platforms 17:27:23 ... we have to account for differences in how content and related files are served 17:27:37 ... we create 'plugins' for each publisher 17:28:02 ... one thing that might be useful about PWP would be more consistent presentation across publisher platform 17:28:12 ... this might make easier to acquire content 17:28:48 ... but one concern would be that PWD might be a parallel - which is canonical version? Do they stay in sync 17:29:17 lrosenth: Our thinking right now - we recognize that for any given publication there is a canonical version 17:30:06 ... our strategy is that there will be a locator associated with any 'copy' that refers back to canonical copy and/or breadcrumbs through versions that you have to follow 17:30:36 ... so for example if annotations are added you may have a new version and so a new locater, but you can still go back to the original 17:30:59 ... There is no requirement that a PWP is served as package 17:31:40 ... for example a publication talking about the Mona Lisa, and so the most correct version of the pub references the Mona Lisa at the Louvre 17:32:21 Nicholas: the manifest sounds like the Signposting being discussed in the Web Archiving Community more general 17:32:24 q+ 17:32:42 lrosenth: yes, sounds relevant 17:32:46 ack TimCole 17:34:44 q+ 17:35:15 lrosenth: the manifest documents everything, every part that is necessary to 'present' / 'consume' the publication 17:35:25 ... this would include what a machine might need 17:35:53 ... if you utilize the manifest, and processes the manifest to have the set of elements needed for the publication 17:36:11 ack BI 17:36:15 ack Bill 17:36:27 q? 17:37:16 Bill_Kasdorf: If there are elements that the publisher wants to 'protect' , the publisher can count on CLOCKSS not to release any of this until a trigger event occurs 17:37:37 Craig: Yes 17:38:12 Bill_Kasdorf: So there is another level of abstraction, potentially 17:38:28 Bill_Kasdorf: a font might be an example 17:38:46 ... some of these fonts may require licenses 17:38:59 ... so the PWP may name a fallback 17:39:29 lrosenth: Yes, that seems right. Important to look at. 17:39:42 ... may define which PWP's are truly archival 17:40:12 Nicholas: Seems like spec has a lot of potential 17:40:31 ... as it stands now we have this 2 pronged archival approach 17:40:48 ... PWP may bring these together a little more 17:41:20 ... the manifest idea or signposting or some level of semantic annotation that projects the publisher's perspective of the publication would be very useful 17:41:38 q+ 17:41:42 ... right now we have a set of heuristics that we need to keep revisiting 17:42:06 ... a little difficult to know what it might look like in practice 17:42:08 ack lr 17:42:34 lrosenth: as i understand LOCKSS, you make no requirements on the content 17:42:50 ... so if a trigger even occurs, you release what you have 17:43:35 Craig: not making any kind of legal warrant 17:43:47 ... but our expectation is to deliver content in a consumable way 17:44:00 ... we're not focused on user experience, but rather on content 17:44:20 lrosenth: Bill was talking about fonts 17:44:41 ... in case you have a document that references a font, but the font is not archived 17:44:58 ... you don't overtly address font issue? 17:45:31 Nicholas: Once the content is archived, we try to make sure that all the required content is archived 17:45:40 ... not sure if we are going after fonts 17:45:42 q+ 17:46:06 lrosenth: has a lot of implications for what we are trying to do 17:46:45 Bill_Kasdorf: as an XML guy, ideally content is Unicode, so you should be able to consume the content, albeit without the proper glyph 17:47:12 lrosenth: But since they don't normalize, they don't ensure unicode is being made 17:47:17 ack ti 17:48:29 timCole: manifest may make it easier to make sure you get everything you need? 17:49:34 Nicholas: yes it can be difficult to know for each platform exactly which links should be collected (e.g., fonts vs. publisher home page) 17:50:13 Nicholas: is this related to packaging on the Web (W3C) 17:50:46 lrosenth: that initiative is broader and separate, but as we start talking about what our packages look like, we will look at that work 17:53:53 Bill_Kasdorf: would Nicholas have more time to join DPub IG? Since Stanford already a member of W3C 17:54:46 Nicholas: have already joined... 17:57:26 rrsagent, make log public 17:57:45 rrsagent, draft minutes 17:57:45 I have made the request to generate http://www.w3.org/2016/03/24-dpub-arch-minutes.html TimCole 17:58:12 rrsagent, make log public 18:08:42 bye, zakim 18:09:03 zakim, bye 18:09:03 leaving. As of this point the attendees have been Bill_Kasdorf, Leonard, Tim_Cole, Ayla_Stein 18:09:03 Zakim has left #dpub-arch 18:09:12 rrsagent, bye 18:09:12 I see no action items